OpenAI Whisper

Hello, I hope you are doing well.

Today I would like to tell you about OpenAI Whisper. Here is the explanation from its GitHub repository.

Whisper is a general-purpose speech recognition model. It is trained on a large dataset of diverse audio and is also a multitasking model that can perform multilingual speech recognition, speech translation, and language identification.

The passage explains that the main use cases for Whisper are:

Multilingual speech recognition (e.g., Siri, Alexa, Google voice search).
Speech translation (e.g., Google translate).
Language identification (e.g., Google translate).

You can call Whisper directly from Python or the command line, which means you can code in any language by executing the command line inside the code.

We will attempt to download a popular Shorts video from a well-known YouTuber (Mr. Beast). I have installed yt-dlp and executed the command yt-dlp https://www.youtube.com/shorts/se50viFJ0AQ to obtain the video.

As far as I know, we cannot directly convert videos using Whisper. Therefore, we need to convert it to audio. I am using FFmpeg for this purpose. I executed the command ffmpeg -y -i Would\ You\ Fly\ To\ Paris\ For\ A\ Baguette？\ \[se50viFJ0AQ\].webm -b:a 192K -vn output.mp3.

Before attempting to transcribe the audio, ensure that your machine has enough VRAM as specified in the documentation. You might consider using the tiny model to speed up the process. To transcribe the audio to text, I ran whisper output.mp3 --model large. The default model is small, but I opted for the large model for better results. The first time you execute the command, Whisper will download the model before proceeding. Whisper will use the first 30 seconds of your audio to determine the language automatically. However, you can specify the language by adding the --language flag, for example, --language Japanese. Additionally, you can translate the audio to English by including the --task translate flag.

The output will consist of five files in json, srt, tsv, txt, and vtt formats. You can also specify the output format by adding the --output_format flag, for example, --output_format json. Below is an example of the output.

JSON

{
  "text": " if i give you a hundred dollars would you go to paris to give me a baguette no if i give you three hundred dollars would you fly to paris to bring me back a baguette yeah right now actually yeah fly included yes i can't believe she's actually getting some baguettes right now i hope mr beast is hungry wow i cannot believe i'm here one stretch mr beast i'm ready mr beast oh my baguette i honestly just needed one here you can just have this whatever that is oh",
  "segments": [
    {
      "id": 0,
      "seek": 0,
      "start": 0.0,
      "end": 4.16,
      "text": " if i give you a hundred dollars would you go to paris to give me a baguette no if i give you",
      "tokens": [
        50365, 498, 741, 976, 291, 257, 3262, 3808, 576, 291, 352, 281, 971,
        271, 281, 976, 385, 257, 3411, 84, 3007, 572, 498, 741, 976, 291, 50573
      ],
      "temperature": 0.0,
      "avg_logprob": -0.1995048136324496,
      "compression_ratio": 1.8835616438356164,
      "no_speech_prob": 0.11105627566576004
    },
    {
      "id": 1,
      "seek": 0,
      "start": 4.16,
      "end": 7.12,
      "text": " three hundred dollars would you fly to paris to bring me back a baguette",
      "tokens": [
        50573, 1045, 3262, 3808, 576, 291, 3603, 281, 971, 271, 281, 1565, 385,
        646, 257, 3411, 84, 3007, 50721
      ],
      "temperature": 0.0,
      "avg_logprob": -0.1995048136324496,
      "compression_ratio": 1.8835616438356164,
      "no_speech_prob": 0.11105627566576004
    },
    {
      "id": 2,
      "seek": 0,
      "start": 7.84,
      "end": 15.92,
      "text": " yeah right now actually yeah fly included yes",
      "tokens": [50757, 1338, 558, 586, 767, 1338, 3603, 5556, 2086, 51161],
      "temperature": 0.0,
      "avg_logprob": -0.1995048136324496,
      "compression_ratio": 1.8835616438356164,
      "no_speech_prob": 0.11105627566576004
    },
    {
      "id": 3,
      "seek": 0,
      "start": 21.84,
      "end": 23.92,
      "text": " i can't believe she's actually getting some baguettes right now",
      "tokens": [
        51457, 741, 393, 380, 1697, 750, 311, 767, 1242, 512, 3411, 84, 16049,
        558, 586, 51561
      ],
      "temperature": 0.0,
      "avg_logprob": -0.1995048136324496,
      "compression_ratio": 1.8835616438356164,
      "no_speech_prob": 0.11105627566576004
    },
    {
      "id": 4,
      "seek": 2392,
      "start": 23.92,
      "end": 30.8,
      "text": " i hope mr beast is hungry wow i cannot believe i'm here",
      "tokens": [
        50365, 741, 1454, 33660, 13464, 307, 8067, 6076, 741, 2644, 1697, 741,
        478, 510, 50709
      ],
      "temperature": 0.0,
      "avg_logprob": -0.24799319675990514,
      "compression_ratio": 1.5163934426229508,
      "no_speech_prob": 0.0011775000020861626
    },
    {
      "id": 5,
      "seek": 2392,
      "start": 33.36,
      "end": 36.08,
      "text": " one stretch mr beast i'm ready",
      "tokens": [50837, 472, 5985, 33660, 13464, 741, 478, 1919, 50973],
      "temperature": 0.0,
      "avg_logprob": -0.24799319675990514,
      "compression_ratio": 1.5163934426229508,
      "no_speech_prob": 0.0011775000020861626
    },
    {
      "id": 6,
      "seek": 2392,
      "start": 40.72,
      "end": 47.0,
      "text": " mr beast oh my baguette i honestly just needed one here you can just have this whatever that is oh",
      "tokens": [
        51205, 33660, 13464, 1954, 452, 3411, 84, 3007, 741, 6095, 445, 2978,
        472, 510, 291, 393, 445, 362, 341, 2035, 300, 307, 1954, 51519
      ],
      "temperature": 0.0,
      "avg_logprob": -0.24799319675990514,
      "compression_ratio": 1.5163934426229508,
      "no_speech_prob": 0.0011775000020861626
    }
  ],
  "language": "en"
}

SRT

1
00:00:00,000 --> 00:00:04,160
if i give you a hundred dollars would you go to paris to give me a baguette no if i give you

2
00:00:04,160 --> 00:00:07,120
three hundred dollars would you fly to paris to bring me back a baguette

3
00:00:07,840 --> 00:00:15,920
yeah right now actually yeah fly included yes

4
00:00:21,840 --> 00:00:23,920
i can't believe she's actually getting some baguettes right now

5
00:00:23,920 --> 00:00:30,800
i hope mr beast is hungry wow i cannot believe i'm here

6
00:00:33,360 --> 00:00:36,080
one stretch mr beast i'm ready

7
00:00:40,720 --> 00:00:47,000
mr beast oh my baguette i honestly just needed one here you can just have this whatever that is oh

TSV

start	end	text
0	4160	if i give you a hundred dollars would you go to paris to give me a baguette no if i give you
4160	7120	three hundred dollars would you fly to paris to bring me back a baguette
7840	15920	yeah right now actually yeah fly included yes
21840	23920	i can't believe she's actually getting some baguettes right now
23920	30800	i hope mr beast is hungry wow i cannot believe i'm here
33360	36080	one stretch mr beast i'm ready
40720	47000	mr beast oh my baguette i honestly just needed one here you can just have this whatever that is oh

TXT

if i give you a hundred dollars would you go to paris to give me a baguette no if i give you
three hundred dollars would you fly to paris to bring me back a baguette
yeah right now actually yeah fly included yes
i can't believe she's actually getting some baguettes right now
i hope mr beast is hungry wow i cannot believe i'm here
one stretch mr beast i'm ready
mr beast oh my baguette i honestly just needed one here you can just have this whatever that is oh

VTT

WEBVTT

00:00.000 --> 00:04.160
if i give you a hundred dollars would you go to paris to give me a baguette no if i give you

00:04.160 --> 00:07.120
three hundred dollars would you fly to paris to bring me back a baguette

00:07.840 --> 00:15.920
yeah right now actually yeah fly included yes

00:21.840 --> 00:23.920
i can't believe she's actually getting some baguettes right now

00:23.920 --> 00:30.800
i hope mr beast is hungry wow i cannot believe i'm here

00:33.360 --> 00:36.080
one stretch mr beast i'm ready

00:40.720 --> 00:47.000
mr beast oh my baguette i honestly just needed one here you can just have this whatever that is oh

So, you get the idea. The rest is an engineering problem, whether you want to create a competitor to Siri or Google Translate. The use case I have tried is at Pranoto.ai, where you can upload a video and search for text inside it. It’s open source, so you can try it out and read the code.

Some people think they need to learn a lot of things before they can create an AI application. The reality is, if you only use existing and open-source models, you can start creating your own AI application right away. You only need a use case and start coding to put AI on your résumé.

That’s all from me. See you again.