Whisper Transcription¶

1.3

type access

Operating System:
Terminal:
Shell:
Editor:
Package Manager:
Programming Language:
Utility:
Extension:

1.2

type access

Operating System:
Terminal:
Shell:
Editor:
Package Manager:
Programming Language:
Utility:
Extension:

1.1

type access

Operating System:
Terminal:
Shell:
Editor:
Package Manager:
Programming Language:
Utility:
Extension:

This utility is used to make a transcription of a voice or video recording, using the Whisper large language model from OpenAI.

Input Format¶

The app can process .mp3, .mp4, .m4a, .wav and .mpg files.

Output Format¶

CSV:
Contains every parameter outputted from the Whisper model.
DOTE:
DOTE Transcription software developed by the BigSoftVideo team at AAU.
DOCX:
Office Open XML Document (Microsoft Word).
JSON:
JavaScript Object Notation.
SRT:
SubRip file format, widely adopted subtitle format.
TSV:
Tab-separated value file contain start, end and text.
TXT:
Pure text file with the transcription
VTT:
Web Video Text Tracks format.
ZIP:
Archive with all of the output files.

Output Folder¶

By default the transcript files are saved in /Jobs/Whisper Transcription/<job-id>/out. The user can select another directory using the corresponding optional parameter.

Optional Parameters¶

From the application's job submission page the user may specify a number of optional parameters, as needed:

Input file:
A single file which will be transcribed by Whisper. This optional parameter is suitable if the user only needs Whisper to transcribe one file in the job.
Input directory:
The directory containing the file(s) which will be transcribed by Whisper. This optional parameter is suitable if the user needs Whisper to transcribe more than one file in the job. Output files are generated for each file in the 'Input directory'.
Option: --output_dir:
If the user wants the output files to be saved somewhere else than the default output folder, the desired folder is specified here.
Option: --output_format:
The file format of the output. See details above. The default is that all the output formats are generated and saved.

1.3

Option: --model:
The model which Whisper will use for the transcription. The default is large-v3, i.e., the largest, and most accurate model. The model large-v3-turbo is faster to use, but when transcribing Danish the accuracy is not as good as when using the large-v3 model. Using a smaller model will make the transcription process faster at the cost of accuracy.

1.2

Option: --model:
The model which Whisper will use for the transcription. The default is large-v3, i.e., the largest, and most accurate model. Using a smaller model will make the transcription process faster at the cost of accuracy.

1.1

Option: --model:
The model which Whisper will use for the transcription. The default is large, i.e., the largest, and most accurate model. Using a smaller model will make the transcription process faster at the cost of accuracy.

Option: --language:
Selecting a specific language forces Whisper to transcribe the input file(s) in that language. If no language is selected, Whisper tries to recognize the language.
Interactive mode:
Allows the user to select whether Whisper should run interactively (by setting the parameter value to true). If the job is entered in interactive mode, the user can access the app terminal or web interface. The latter gives access to a JupyterLab workspace to run notebooks. The default setting is a non-interactive mode.
Archive password:
This will AES encrypt and password-protect the ZIP output archive. The user must specify a password for the archive as a text string.

General Considerations¶

When using the Whisper app, there are a few things to keep in mind:

In general, larger models yield more accurate transcription results but also take longer to run. The user should therefore be sure to allocate enough time for the job, and/or extend the job lifetime if necessary.
Running Whisper on a GPU node is considerably faster than running it on a CPU node. However, the app can only use one GPU at a time. Therefore, users should only allocate single-GPU machines (i.e., *-gpu-1 machines) to their Whisper jobs.
Speaker diarization (i.e., distinguishing between different speakers) is not currently a feature of Whisper Transcription, but this feature is available in the Transcriber application on UCloud.

Whisper Transcription¶

Input Format¶

Output Format¶

Output Folder¶

Optional Parameters¶

General Considerations¶

Contents