When OpenAI’s Whisper speech transcription engine was announced a number of months ago, I was very interested in potentially using it with my podcast transcription side project for the Accidental Tech Podcast. I quickly put it through my reference ATP accuracy test and saw Flashlight finally displaced as the most accurate open source transcription I’d tested. And the difference was not small.
I was also excited that with Whisper I could deprecate my existing custom built Flashlight chunking transcription fusion system. For background my previous implementation would split audio files into 15 second chunks (separated by 5 second deltas). This was because Flashlight’s accuracy dropped quickly once you got past the 15 second range. The system would then transcribe the 15 second chunks, and then using a custom sliding window algorithm would pick the most likely word from the up to three transcripts covering the segment of time, generating a ‘fused’ transcript from all of the individual transcripts.
I did feel vindicated when I saw that Whisper uses a sliding window in their implementation as well for handling large audio files.
Based on the improvement in accuracy and the promise of simplifying my catatp.fm processing code base, I excitedly dove into prototyping its integration into the workflow.
In digging through the transcription outputs from Whisper, I discovered that a fair number of the segments timestamps were noticeably inaccurate. This poses a problem for my use case. Specifically I am rather proud of the per-word highlighting during playback catatp.fm features. Lacking per-word timestamps would also break the word-level speaker detection the site leverages.
While Whisper supposedly provides phrase level timestamps which I potentially could have then used to refine the word level timestamps using Flashlight or another alignment system, the lack of accuracy would have made the alignment task significantly harder. Reinforcing the issue I even saw negative segment durations (where the ending time for the segment was before the supposed starting time) during testing.
I started to play around with performing an alignment pre-pass (more or less sliding a window forward based on the alignment outputs from previous segments), but ran into some issues.
So the Whisper transition was put on one of the back burners of my oven of side projects (tortured metaphor alert), pending a further development.
One day a couple weeks ago I was searching around and came across WhisperX. I was ecstatic, as it looked like exactly the sort of thing I had started working on myself, except by someone who had devoted a lot more time and energy than I had available for a side project.
After a quick test transcription turned out great, I quickly integrated the Python API for WhisperX into my audio processing workflow and let it chug through ATP’s massive backlog. Because I hate to leave any accuracy on the table, I only used Whisper’s
large_v2 model for transcription.
I thought it would be interesting to compare the speed of the different models of Whisper to each other, and also on differing hardware. The systems I used for this benchmark included:
Note added 2022/04/16: FYI I included my Amazon affiliate links for the products above, but I wouldn’t recommend purchasing a 2080 TI if you’re looking to transcribe audio using a GPU, as it is a couple generations old. Instead I would recommend a GPU with at least 11GB of VRAM so you can use the full size model. For nVidia these include the RTX 4070, RTX 4080, and RTX 4090. The good news is the GPU shortage the last couple years seems to have finally abated, so you can hopefully get some list price GPUs. I also wouldn’t recommend purchasing a 2019 27” iMac, as the M1/M2 iMac will perform dramatically better. Alright, back to your regularly scheduled blog post.
I excluded the model loading time for each of the different systems, as that only needs to run once, and then can be used to transcribe as many audio files as desired. If you were trying to create a client app for transcription I could see model loading time being important however.
All systems were run against chapter 2 of episode 205 of ATP, which has been a standard benchmark for this project. The chapter is ~808 seconds long.
First off, let’s briefly examine the performance of Whisper’s different models. Whisper’s github page goes over the differences in each model, and more or less the larger the model the better the accuracy but the slower the transcription. But just how much slower? The following benchmarks were run using the standard Python implementation of Whisper, set to run on the CPU only to get initial numbers.
|AMD Ryzen 3900x||46.57||44.97||77.77||75.88||202.31||193.60||519.60||508.85||958.85||1076.10|
Yikes. While I want to have the best possible accuracy for catatp.fm, and thus want to use the
large model; having the transcriptions running at ~96% of real time is not great, particularly for going through the entire backlog. And that speed is only achieved when dedicating my personal laptop to the project, which I wasn’t excited to do.
Something that has looked very cool is the C++ implementation of whisper, called (fittingly) whisper.cpp. It is an impressively clean C++ implementation, requiring only two source files (not including headers / etc) to include in a project. The project has even demoed (some of the smaller models of) Whisper running on an iPhone in a native application. I was interested in seeing if there was a noticeable difference in running the standard Python implementation of Whisper on the CPU compared to whisper.cpp. As such here is a comparison table, including a Speed % measuring the speed difference between whisper.cpp and the standard Python implementation of Whisper running on the CPU only.
Note that whisper.cpp consistently hard crashed midway through transcription when I tried to run more than 16 threads on the 3900X, so it was run with 16 threads despite having 24 threads available. The iMac was run with 6 threads and the M1 Max with 10, matching their available thread counts.
|System||Whisper Model||Python (s)||.cpp (s)||.cpp Speed % of Python|
|AMD Ryzen 3900x||tiny||46.57||28.39||164.1%|
|AMD Ryzen 3900x||tiny.en||44.97||27.01||166.5%|
|AMD Ryzen 3900x||base||77.77||52.07||149.4%|
|AMD Ryzen 3900x||base.en||75.88||52.16||145.5%|
|AMD Ryzen 3900x||small||202.31||151.45||133.6%|
|AMD Ryzen 3900x||small.en||193.60||157.76||122.7%|
|AMD Ryzen 3900x||medium||519.60||433.35||119.9%|
|AMD Ryzen 3900x||medium.en||508.85||466.27||109.1%|
|AMD Ryzen 3900x||large-v1||958.85||788.59||121.6%|
|AMD Ryzen 3900x||large||1076.10||805.65||133.6%|
Note that the gains are significantly greater on systems with newer CPUs with more cores. The 6-core iMac has only 6 threads compared to 10 in the M1 Max and 24 (though we only use 16 in these tests) in the 3900x. There are also likely some instruction set advantages for systems like the M1 Max.
While the performance increase looks attractive for whisper.cpp, we’re still not where I’d like to be. Next we should also look at the performance impact of adding a GPU.
I only have one system among these three with a CUDA compatible GPU (the Ryzen 3900x has an nVidia RTX 2080 TI installed), so I’ll only report stats from that system.
|Model||Python (s)||.cpp (s)||GPU (s)||.cpp Speed % of Python||GPU Speed % of Python|
So… GPU is definitely the way to go for mass transcription, particularly for the bigger models. As such for the past two weeks or so I had my GPU successfully chug (with a couple breaks to use the GPU for other purposes) through the backlog.
Note that I think I was actually CPU limited, as I saw a single CPU core of my AMD Ryzen 3900X sitting at 100%, while my GPU’s usage (as measured by
nvidia-smi) fluctuated between 70-80%.
+-----------------------------------------------------------------------------+ | NVIDIA-SMI 520.61.05 Driver Version: 520.61.05 CUDA Version: 11.8 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 NVIDIA GeForce ... On | 00000000:07:00.0 Off | N/A | | 0% 36C P2 180W / 260W | 10968MiB / 11264MiB | 75% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ +-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| | 0 N/A N/A 1258 G /usr/lib/xorg/Xorg 59MiB | | 0 N/A N/A 176601 G /usr/lib/xorg/Xorg 66MiB | | 0 N/A N/A 176730 G /usr/bin/gnome-shell 9MiB | | 0 N/A N/A 704269 C python3 10816MiB | +-----------------------------------------------------------------------------+
The utilization numbers do make me wonder if I had a faster single-threaded CPU if I would get even better GPU results.
Just in case I have some people concerned, you can ignore the reported 0% / 0 RPM GPU fan, as the GPU is in a custom water cooling loop (something I did to play around and learn during the pandemic). That is also why the GPU is sitting at a very cool and comfortable (for a GPU) 36C, particularly considering this measurement was taken after DAYS of the system continuously processing episodes.
Memory wise my GPU was relatively maxed out however. I don’t think I could run a
xlarge model if OpenAI released one, I’d likely need to find another GPU with more memory.
While I was chugging through the backlog I added a note at the top of each podcast episode page stating what components were used to process the episode. This was to communicate to users whether they could expect the high accuracy of Whisper, or the accuracy of my previous Flashlight chunking fusion transcription system.
In case anyone is curious, here is an overview of the current podcast episode processing pipeline:
large_v2Whisper model once, I process all chapters using Whisper, and then send the outputs from Whisper into WhisperX (using custom Python code). At the end I have transcriptions with relatively accurate word-level timestamps.
Note that since I started transcribing using WhisperX the author has added speaker diarization to his implementation. This is awesome to see, as I would love to see speaker diarization access democratized like Whisper has done for transcription.
I’m happy to have finally gotten catatp.fm transitioned over to Whisper, and it was also fun to run a couple benchmarks to examine the performance of the various ways to run mass speech recognition. It demonstrates the value of keeping a nVidia GPU around for ML tasks. I do hope that Apple will take a look of how to port Whisper over to the GPU cores in their M1 chips (similar to how they’ve done for some other high profile ML projects), as I think that would have a significant impact on performance.