Transcriber¶
Operating System:
Terminal:
Shell:
Editor:
Package Manager:
Programming Language:
Utility:
Extension:
This utility is used to make a transcription of a voice or video recording, using the Whisper large language model from OpenAI. On top of this the utility uses WhisperX to add speaker diarization.
Input Format¶
The app can process .mp3
, .mp4
, .m4a
, .wav
and .mpg
files.
Output Format¶
CSV:
Contains every parameter outputted from the Whisper model.DOTE:
DOTE Transcription software developed by the BigSoftVideo team at AAU.DOCX:
Office Open XML Document (Microsoft Word).JSON:
JavaScript Object Notation.SRT:
SubRip file format, widely adopted subtitle format.TSV:
Tab-separated value file contain start, end and text.TXT:
Pure text file with the transcriptionVTT:
Web Video Text Tracks format.ZIP:
Archive with all of the output files.
Output Folder¶
By default the transcript files are saved in /Jobs/Transcriber/<job-id>/out
. The user can select another directory using the corresponding optional parameter.
App Parameters¶
From the application's job submission page the user may specify a number of optional parameters, as needed:
Initialization:
Allows the user to run a Bash script (*.sh
) with initialization code. This could be useful, for example, to pre-process audio files before transcription.Input file:
A single file which will be transcribed by Whisper. This optional parameter is suitable if the user only needs Whisper to transcribe one file in the job.Input directory:
The directory containing the file(s) which will be transcribed by Whisper. This optional parameter is suitable if the user needs Whisper to transcribe more than one file in the job. Output files are generated for each file in the 'Input directory'.Option
--output_dir
:
If the user wants the output files to be saved somewhere else than the default output folder, the desired folder is specified here.Option
--output_format
:
The file format of the output. See details above. The default is that all the output formats are generated and saved.Option
--model
:
The model which Whisper will use for the transcription. The default islarge-v3
, i.e., the largest, and most accurate model. The modellarge-v3-turbo
is faster to use, but when transcribing Danish the accuracy is not as good as when using thelarge-v3
model. Using a smaller model will make the transcription process faster at the cost of accuracy.Option
--language
:
Selecting a specific language forces Whisper to transcribe the input file(s) in that language. If no language is selected, Whisper tries to recognize the language.Interactive mode:
Allows the user to select whether Whisper should run interactively (by setting the parameter value totrue
). If the job is entered in interactive mode, the user can access the app terminal or web interface. The latter gives access to a JupyterLab workspace to run notebooks. The default setting is a non-interactive mode.Archive password:
This will AES encrypt and password-protect the ZIP output archive. The user must specify a password for the archive as a text string.Option
--min_speakers
:
If the number of speakers is known in advance this option can be used to set the minimum number of speakers. Using this configuration option may in some cases increase the accuracy of the speaker diarization.Option
--max_speakers
:
If the number of speakers is known in advance this option can be used to set the maximum number of speakers. Using this configuration option may in some cases increase the accuracy of the speaker diarization.Option
--merge_speakers
:
This option will enable merging of consecutive text entries from the same speaker into one entry in a set of additional output files postfixed with_merged
.
General Considerations¶
When using the Transcriber app, there are a few things to keep in mind:
In general, larger models yield more accurate transcription results but also take longer to run. The user should therefore be sure to allocate enough time for the job, and/or extend the job lifetime if necessary.
Running Transcriber on a GPU node is considerably faster than running it on a CPU node. However, the app can only use one GPU at a time. Therefore, users should only allocate single-GPU machines (i.e.,
*-gpu-1
machines) to their Transcriber jobs.
Contents