Hello, I hope you are doing well.
Today I would like to tell you about OpenAI Whisper. Here is the explanation from its GitHub repository.
Whisper is a general-purpose speech recognition model. It is trained on a large dataset of diverse audio and is also a multitasking model that can perform multilingual speech recognition, speech translation, and language identification.
The passage explains that the main use cases for Whisper are:
- Multilingual speech recognition (e.g., Siri, Alexa, Google voice search).
- Speech translation (e.g., Google translate).
- Language identification (e.g., Google translate).
You can call Whisper directly from Python or the command line, which means you can code in any language by executing the command line inside the code.
We will attempt to download a popular Shorts video from a well-known YouTuber (Mr. Beast). I have installed yt-dlp and executed the command yt-dlp https://www.youtube.com/shorts/se50viFJ0AQ
to obtain the video.
As far as I know, we cannot directly convert videos using Whisper. Therefore, we need to convert it to audio. I am using FFmpeg for this purpose. I executed the command ffmpeg -y -i Would\ You\ Fly\ To\ Paris\ For\ A\ Baguette?\ \[se50viFJ0AQ\].webm -b:a 192K -vn output.mp3
.
Before attempting to transcribe the audio, ensure that your machine has enough VRAM as specified in the documentation. You might consider using the tiny
model to speed up the process. To transcribe the audio to text, I ran whisper output.mp3 --model large
. The default model is small
, but I opted for the large
model for better results. The first time you execute the command, Whisper will download the model before proceeding. Whisper will use the first 30 seconds of your audio to determine the language automatically. However, you can specify the language by adding the --language
flag, for example, --language Japanese
. Additionally, you can translate the audio to English by including the --task translate
flag.
The output will consist of five files in json
, srt
, tsv
, txt
, and vtt
formats. You can also specify the output format by adding the --output_format
flag, for example, --output_format json
. Below is an example of the output.
JSON
{
"text": " if i give you a hundred dollars would you go to paris to give me a baguette no if i give you three hundred dollars would you fly to paris to bring me back a baguette yeah right now actually yeah fly included yes i can't believe she's actually getting some baguettes right now i hope mr beast is hungry wow i cannot believe i'm here one stretch mr beast i'm ready mr beast oh my baguette i honestly just needed one here you can just have this whatever that is oh",
"segments": [
{
"id": 0,
"seek": 0,
"start": 0.0,
"end": 4.16,
"text": " if i give you a hundred dollars would you go to paris to give me a baguette no if i give you",
"tokens": [
50365, 498, 741, 976, 291, 257, 3262, 3808, 576, 291, 352, 281, 971,
271, 281, 976, 385, 257, 3411, 84, 3007, 572, 498, 741, 976, 291, 50573
],
"temperature": 0.0,
"avg_logprob": -0.1995048136324496,
"compression_ratio": 1.8835616438356164,
"no_speech_prob": 0.11105627566576004
},
{
"id": 1,
"seek": 0,
"start": 4.16,
"end": 7.12,
"text": " three hundred dollars would you fly to paris to bring me back a baguette",
"tokens": [
50573, 1045, 3262, 3808, 576, 291, 3603, 281, 971, 271, 281, 1565, 385,
646, 257, 3411, 84, 3007, 50721
],
"temperature": 0.0,
"avg_logprob": -0.1995048136324496,
"compression_ratio": 1.8835616438356164,
"no_speech_prob": 0.11105627566576004
},
{
"id": 2,
"seek": 0,
"start": 7.84,
"end": 15.92,
"text": " yeah right now actually yeah fly included yes",
"tokens": [50757, 1338, 558, 586, 767, 1338, 3603, 5556, 2086, 51161],
"temperature": 0.0,
"avg_logprob": -0.1995048136324496,
"compression_ratio": 1.8835616438356164,
"no_speech_prob": 0.11105627566576004
},
{
"id": 3,
"seek": 0,
"start": 21.84,
"end": 23.92,
"text": " i can't believe she's actually getting some baguettes right now",
"tokens": [
51457, 741, 393, 380, 1697, 750, 311, 767, 1242, 512, 3411, 84, 16049,
558, 586, 51561
],
"temperature": 0.0,
"avg_logprob": -0.1995048136324496,
"compression_ratio": 1.8835616438356164,
"no_speech_prob": 0.11105627566576004
},
{
"id": 4,
"seek": 2392,
"start": 23.92,
"end": 30.8,
"text": " i hope mr beast is hungry wow i cannot believe i'm here",
"tokens": [
50365, 741, 1454, 33660, 13464, 307, 8067, 6076, 741, 2644, 1697, 741,
478, 510, 50709
],
"temperature": 0.0,
"avg_logprob": -0.24799319675990514,
"compression_ratio": 1.5163934426229508,
"no_speech_prob": 0.0011775000020861626
},
{
"id": 5,
"seek": 2392,
"start": 33.36,
"end": 36.08,
"text": " one stretch mr beast i'm ready",
"tokens": [50837, 472, 5985, 33660, 13464, 741, 478, 1919, 50973],
"temperature": 0.0,
"avg_logprob": -0.24799319675990514,
"compression_ratio": 1.5163934426229508,
"no_speech_prob": 0.0011775000020861626
},
{
"id": 6,
"seek": 2392,
"start": 40.72,
"end": 47.0,
"text": " mr beast oh my baguette i honestly just needed one here you can just have this whatever that is oh",
"tokens": [
51205, 33660, 13464, 1954, 452, 3411, 84, 3007, 741, 6095, 445, 2978,
472, 510, 291, 393, 445, 362, 341, 2035, 300, 307, 1954, 51519
],
"temperature": 0.0,
"avg_logprob": -0.24799319675990514,
"compression_ratio": 1.5163934426229508,
"no_speech_prob": 0.0011775000020861626
}
],
"language": "en"
}
SRT
1
00:00:00,000 --> 00:00:04,160
if i give you a hundred dollars would you go to paris to give me a baguette no if i give you
2
00:00:04,160 --> 00:00:07,120
three hundred dollars would you fly to paris to bring me back a baguette
3
00:00:07,840 --> 00:00:15,920
yeah right now actually yeah fly included yes
4
00:00:21,840 --> 00:00:23,920
i can't believe she's actually getting some baguettes right now
5
00:00:23,920 --> 00:00:30,800
i hope mr beast is hungry wow i cannot believe i'm here
6
00:00:33,360 --> 00:00:36,080
one stretch mr beast i'm ready
7
00:00:40,720 --> 00:00:47,000
mr beast oh my baguette i honestly just needed one here you can just have this whatever that is oh
TSV
start end text
0 4160 if i give you a hundred dollars would you go to paris to give me a baguette no if i give you
4160 7120 three hundred dollars would you fly to paris to bring me back a baguette
7840 15920 yeah right now actually yeah fly included yes
21840 23920 i can't believe she's actually getting some baguettes right now
23920 30800 i hope mr beast is hungry wow i cannot believe i'm here
33360 36080 one stretch mr beast i'm ready
40720 47000 mr beast oh my baguette i honestly just needed one here you can just have this whatever that is oh
TXT
if i give you a hundred dollars would you go to paris to give me a baguette no if i give you
three hundred dollars would you fly to paris to bring me back a baguette
yeah right now actually yeah fly included yes
i can't believe she's actually getting some baguettes right now
i hope mr beast is hungry wow i cannot believe i'm here
one stretch mr beast i'm ready
mr beast oh my baguette i honestly just needed one here you can just have this whatever that is oh
VTT
WEBVTT
00:00.000 --> 00:04.160
if i give you a hundred dollars would you go to paris to give me a baguette no if i give you
00:04.160 --> 00:07.120
three hundred dollars would you fly to paris to bring me back a baguette
00:07.840 --> 00:15.920
yeah right now actually yeah fly included yes
00:21.840 --> 00:23.920
i can't believe she's actually getting some baguettes right now
00:23.920 --> 00:30.800
i hope mr beast is hungry wow i cannot believe i'm here
00:33.360 --> 00:36.080
one stretch mr beast i'm ready
00:40.720 --> 00:47.000
mr beast oh my baguette i honestly just needed one here you can just have this whatever that is oh
So, you get the idea. The rest is an engineering problem, whether you want to create a competitor to Siri or Google Translate. The use case I have tried is at Pranoto.ai, where you can upload a video and search for text inside it. It’s open source, so you can try it out and read the code.
Some people think they need to learn a lot of things before they can create an AI application. The reality is, if you only use existing and open-source models, you can start creating your own AI application right away. You only need a use case and start coding to put AI on your résumé.
That’s all from me. See you again.