Transcriber

type access

  • Operating System:

  • Terminal:

  • Shell:

  • Editor:

  • Package Manager:

  • Programming Language:

  • Utility:

  • Extension:


This utility is used to make a transcription of a voice or video recording, using the Whisper large language model from OpenAI. On top of this the utility uses WhisperX to add speaker diarization.

Input Format

The app can process .mp3, .mp4, .m4a, .wav and .mpg files.

Output Format

Output Folder

By default the transcript files are saved in /Jobs/Transcriber/<job-id>/out. The user can select another directory using the corresponding optional parameter.

App Parameters

From the application's job submission page the user may specify a number of optional parameters, as needed:

  • Initialization:
    Allows the user to run a Bash script (*.sh) with initialization code. This could be useful, for example, to pre-process audio files before transcription.

  • Input file:
    A single file which will be transcribed by Whisper. This optional parameter is suitable if the user only needs Whisper to transcribe one file in the job.

  • Input directory:
    The directory containing the file(s) which will be transcribed by Whisper. This optional parameter is suitable if the user needs Whisper to transcribe more than one file in the job. Output files are generated for each file in the 'Input directory'.

  • Option --output_dir:
    If the user wants the output files to be saved somewhere else than the default output folder, the desired folder is specified here.

  • Option --output_format:
    The file format of the output. See details above. The default is that all the output formats are generated and saved.

  • Option --model:
    The model which Whisper will use for the transcription. The default is large-v3, i.e., the largest, and most accurate model. The model large-v3-turbo is faster to use, but when transcribing Danish the accuracy is not as good as when using the large-v3 model. Using a smaller model will make the transcription process faster at the cost of accuracy.

  • Option --language:
    Selecting a specific language forces Whisper to transcribe the input file(s) in that language. If no language is selected, Whisper tries to recognize the language.

  • Interactive mode:
    Allows the user to select whether Whisper should run interactively (by setting the parameter value to true). If the job is entered in interactive mode, the user can access the app terminal or web interface. The latter gives access to a JupyterLab workspace to run notebooks. The default setting is a non-interactive mode.

  • Archive password:
    This will AES encrypt and password-protect the ZIP output archive. The user must specify a password for the archive as a text string.

  • Option --min_speakers:
    If the number of speakers is known in advance this option can be used to set the minimum number of speakers. Using this configuration option may in some cases increase the accuracy of the speaker diarization.

  • Option --max_speakers:
    If the number of speakers is known in advance this option can be used to set the maximum number of speakers. Using this configuration option may in some cases increase the accuracy of the speaker diarization.

  • Option --merge_speakers:
    This option will enable merging of consecutive text entries from the same speaker into one entry in a set of additional output files postfixed with _merged.

General Considerations

When using the Transcriber app, there are a few things to keep in mind:

  • In general, larger models yield more accurate transcription results but also take longer to run. The user should therefore be sure to allocate enough time for the job, and/or extend the job lifetime if necessary.

  • Running Transcriber on a GPU node is considerably faster than running it on a CPU node. However, the app can only use one GPU at a time. Therefore, users should only allocate single-GPU machines (i.e., *-gpu-1 machines) to their Transcriber jobs.