Another Leap in Transcription Accuracy

When OpenAI’s Whisper speech transcription engine was announced a number of months ago, I was very interested in potentially using it with my podcast transcription side project for the Accidental Tech Podcast. I quickly put it through my reference ATP accuracy test and saw Flashlight finally displaced as the most accurate open source transcription I’d tested. And the difference was not small.

I was also excited that with Whisper I could deprecate my existing custom built Flashlight chunking transcription fusion system. For background my previous implementation would split audio files into 15 second chunks (separated by 5 second deltas). This was because Flashlight’s accuracy dropped quickly once you got past the 15 second range. The system would then transcribe the 15 second chunks, and then using a custom sliding window algorithm would pick the most likely word from the up to three transcripts covering the segment of time, generating a ‘fused’ transcript from all of the individual transcripts.

I did feel vindicated when I saw that Whisper uses a sliding window in their implementation as well for handling large audio files.

Based on the improvement in accuracy and the promise of simplifying my catatp.fm processing code base, I excitedly dove into prototyping its integration into the workflow.

Unfortunately…

Whisper timestamps can be… less than accurate

In digging through the transcription outputs from Whisper, I discovered that a fair number of the segments timestamps were noticeably inaccurate. This poses a problem for my use case. Specifically I am rather proud of the per-word highlighting during playback catatp.fm features. Lacking per-word timestamps would also break the word-level speaker detection the site leverages.

While Whisper supposedly provides phrase level timestamps which I potentially could have then used to refine the word level timestamps using Flashlight or another alignment system, the lack of accuracy would have made the alignment task significantly harder. Reinforcing the issue I even saw negative segment durations (where the ending time for the segment was before the supposed starting time) during testing.

I started to play around with performing an alignment pre-pass (more or less sliding a window forward based on the alignment outputs from previous segments), but ran into some issues.

So the Whisper transition was put on one of the back burners of my oven of side projects (tortured metaphor alert), pending a further development.

WhisperX to the rescue!

One day a couple weeks ago I was searching around and came across WhisperX. I was ecstatic, as it looked like exactly the sort of thing I had started working on myself, except by someone who had devoted a lot more time and energy than I had available for a side project.

After a quick test transcription turned out great, I quickly integrated the Python API for WhisperX into my audio processing workflow and let it chug through ATP’s massive backlog. Because I hate to leave any accuracy on the table, I only used Whisper’s large_v2 model for transcription.

The Need for Speed: The Different Models of Whisper, and the Stunning Impact of GPUs

I thought it would be interesting to compare the speed of the different models of Whisper to each other, and also on differing hardware. The systems I used for this benchmark included:

14” M1 MacBook Pro with M1 Max and 64GB of RAM running Mac OS Ventura
Custom-built desktop with AMD 3900X, 128GB of DDR4 RAM, and nVidia RTX 2080 TI running Ubuntu 20.04
2019 Retina 27” iMac with 3GHz 6-core Intel processor and 24GB RAM running Mac OS Ventura

Note added 2022/04/16: FYI I included my Amazon affiliate links for the products above, but I wouldn’t recommend purchasing a 2080 TI if you’re looking to transcribe audio using a GPU, as it is a couple generations old. Instead I would recommend a GPU with at least 11GB of VRAM so you can use the full size model. For nVidia these include the RTX 4070, RTX 4080, and RTX 4090. The good news is the GPU shortage the last couple years seems to have finally abated, so you can hopefully get some list price GPUs. I also wouldn’t recommend purchasing a 2019 27” iMac, as the M1/M2 iMac will perform dramatically better. Alright, back to your regularly scheduled blog post.

I excluded the model loading time for each of the different systems, as that only needs to run once, and then can be used to transcribe as many audio files as desired. If you were trying to create a client app for transcription I could see model loading time being important however.

All systems were run against chapter 2 of episode 205 of ATP, which has been a standard benchmark for this project. The chapter is ~808 seconds long.

First off, let’s briefly examine the performance of Whisper’s different models. Whisper’s github page goes over the differences in each model, and more or less the larger the model the better the accuracy but the slower the transcription. But just how much slower? The following benchmarks were run using the standard Python implementation of Whisper, set to run on the CPU only to get initial numbers.

System	tiny	tiny.en	base	base.en	small	small.en	medium	medium.en	large-v1	large
M1 Max	29.75	25.22	64.75	58.06	165.20	139.30	445.86	446.96	753.42	839.58
AMD Ryzen 3900x	46.57	44.97	77.77	75.88	202.31	193.60	519.60	508.85	958.85	1076.10
6-core iMac	43.02	55.46	69.75	72.12	194.49	186.46	571.15	553.94	982.82	1056.09

Yikes. While I want to have the best possible accuracy for catatp.fm, and thus want to use the large model; having the transcriptions running at ~96% of real time is not great, particularly for going through the entire backlog. And that speed is only achieved when dedicating my personal laptop to the project, which I wasn’t excited to do.

Something that has looked very cool is the C++ implementation of whisper, called (fittingly) whisper.cpp. It is an impressively clean C++ implementation, requiring only two source files (not including headers / etc) to include in a project. The project has even demoed (some of the smaller models of) Whisper running on an iPhone in a native application. I was interested in seeing if there was a noticeable difference in running the standard Python implementation of Whisper on the CPU compared to whisper.cpp. As such here is a comparison table, including a Speed % measuring the speed difference between whisper.cpp and the standard Python implementation of Whisper running on the CPU only.

Note that whisper.cpp consistently hard crashed midway through transcription when I tried to run more than 16 threads on the 3900X, so it was run with 16 threads despite having 24 threads available. The iMac was run with 6 threads and the M1 Max with 10, matching their available thread counts.

System	Whisper Model	Python (s)	.cpp (s)	.cpp Speed % of Python
M1 Max	tiny	29.75	17.21	172.9%
M1 Max	tiny.en	25.22	16.68	151.2%
M1 Max	base	64.75	27.97	231.5%
M1 Max	base.en	58.06	29.26	198.4%
M1 Max	small	165.20	83.57	197.7%
M1 Max	small.en	139.30	83.14	167.6%
M1 Max	medium	445.86	244.66	182.2%
M1 Max	medium.en	446.96	274.08	163.1%
M1 Max	large-v1	753.42	376.18	200.3%
M1 Max	large	839.58	391.12	214.7%
AMD Ryzen 3900x	tiny	46.57	28.39	164.1%
AMD Ryzen 3900x	tiny.en	44.97	27.01	166.5%
AMD Ryzen 3900x	base	77.77	52.07	149.4%
AMD Ryzen 3900x	base.en	75.88	52.16	145.5%
AMD Ryzen 3900x	small	202.31	151.45	133.6%
AMD Ryzen 3900x	small.en	193.60	157.76	122.7%
AMD Ryzen 3900x	medium	519.60	433.35	119.9%
AMD Ryzen 3900x	medium.en	508.85	466.27	109.1%
AMD Ryzen 3900x	large-v1	958.85	788.59	121.6%
AMD Ryzen 3900x	large	1076.10	805.65	133.6%
6-core iMac	tiny	43.02	41.91	102.6%
6-core iMac	tiny.en	55.46	46.98	118.0%
6-core iMac	base	69.75	73.86	94.4%
6-core iMac	base.en	72.12	74.75	96.5%
6-core iMac	small	194.49	190.23	102.2%
6-core iMac	small.en	186.46	183.05	101.9%
6-core iMac	medium	571.15	543.41	105.1%
6-core iMac	medium.en	553.94	529.24	104.7%
6-core iMac	large-v1	982.82	903.82	108.7%
6-core iMac	large	1056.09	952.13	110.9%

Note that the gains are significantly greater on systems with newer CPUs with more cores. The 6-core iMac has only 6 threads compared to 10 in the M1 Max and 24 (though we only use 16 in these tests) in the 3900x. There are also likely some instruction set advantages for systems like the M1 Max.

While the performance increase looks attractive for whisper.cpp, we’re still not where I’d like to be. Next we should also look at the performance impact of adding a GPU.

I only have one system among these three with a CUDA compatible GPU (the Ryzen 3900x has an nVidia RTX 2080 TI installed), so I’ll only report stats from that system.

Model	Python (s)	.cpp (s)	GPU (s)	.cpp Speed % of Python	GPU Speed % of Python
tiny	46.57	28.39	13.26	164.1%	351.3%
tiny.en	44.97	27.01	13.43	166.5%	334.9%
base	77.77	52.07	18.74	149.4%	415.0%
base.en	75.88	52.16	18.44	145.5%	411.5%
small	202.31	151.45	35.37	133.6%	572.0%
small.en	193.60	157.76	33.01	122.7%	586.5%
medium	519.60	433.35	64.96	119.9%	799.9%
medium.en	508.85	466.27	63.18	109.1%	805.3%
large-v1	958.85	788.59	91.45	121.6%	1048.4%
large	1076.10	805.65	106.17	133.6%	1013.6%

So… GPU is definitely the way to go for mass transcription, particularly for the bigger models. As such for the past two weeks or so I had my GPU successfully chug (with a couple breaks to use the GPU for other purposes) through the backlog.

Note that I think I was actually CPU limited, as I saw a single CPU core of my AMD Ryzen 3900X sitting at 100%, while my GPU’s usage (as measured by nvidia-smi) fluctuated between 70-80%.

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 520.61.05    Driver Version: 520.61.05    CUDA Version: 11.8     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  On   | 00000000:07:00.0 Off |                  N/A |
|  0%   36C    P2   180W / 260W |  10968MiB / 11264MiB |     75%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      1258      G   /usr/lib/xorg/Xorg                 59MiB |
|    0   N/A  N/A    176601      G   /usr/lib/xorg/Xorg                 66MiB |
|    0   N/A  N/A    176730      G   /usr/bin/gnome-shell                9MiB |
|    0   N/A  N/A    704269      C   python3                         10816MiB |
+-----------------------------------------------------------------------------+

The utilization numbers do make me wonder if I had a faster single-threaded CPU if I would get even better GPU results.

Just in case I have some people concerned, you can ignore the reported 0% / 0 RPM GPU fan, as the GPU is in a custom water cooling loop (something I did to play around and learn during the pandemic). That is also why the GPU is sitting at a very cool and comfortable (for a GPU) 36C, particularly considering this measurement was taken after DAYS of the system continuously processing episodes.

Memory wise my GPU was relatively maxed out however. I don’t think I could run a xlarge model if OpenAI released one, I’d likely need to find another GPU with more memory.

Current State of the Audio Processing Pipeline for catatp.fm

While I was chugging through the backlog I added a note at the top of each podcast episode page stating what components were used to process the episode. This was to communicate to users whether they could expect the high accuracy of Whisper, or the accuracy of my previous Flashlight chunking fusion transcription system.

In case anyone is curious, here is an overview of the current podcast episode processing pipeline:

The process starts by parsing ATP’s public RSS feed.
If a new episode is detected, it downloads the latest episode. It also takes note of all the episode notes from the RSS feed.
After downloading the episode, the MP3 is parsed to process chapter data. Any chapter art is also extracted from the MP3 in this step.
The MP3 is split into chapters, with each chapter audio file output as a 16kHz mono FLAC audio file.
After initializing the large_v2 Whisper model once, I process all chapters using Whisper, and then send the outputs from Whisper into WhisperX (using custom Python code). At the end I have transcriptions with relatively accurate word-level timestamps.
Using an older model and version of Pyannote I perform speaker diarization across the entire episode of ATP. The output of this step is a series of timestamp ranges specifying which generic speaker was believed to be speaking at that time, e.g. from 0:21:34.567-0:21:56.789 speaker 3 is believed to be speaking.
After speaker diarization is complete, I perform speaker identification on each identified speaker using Pyannote again. I have a set of “reference” clips of each host which is used to find the best match for each identified speaker.
Now that I have identified speaker speaking times, I combine that with the word level transcriptions to try and perform “best matching” for each word. The output from this step is a transcript containing word-level timing and speaker information.
The word-level timing and speaker information transcript is merged with episode notes and chapter information to generate a markdown file for the episode.
Jekyll converts the 500+ Markdown files into a semi-coherent website.
The site is pushed to catatp.fm.

Note that since I started transcribing using WhisperX the author has added speaker diarization to his implementation. This is awesome to see, as I would love to see speaker diarization access democratized like Whisper has done for transcription.

Conclusion

I’m happy to have finally gotten catatp.fm transitioned over to Whisper, and it was also fun to run a couple benchmarks to examine the performance of the various ways to run mass speech recognition. It demonstrates the value of keeping a nVidia GPU around for ML tasks. I do hope that Apple will take a look of how to port Whisper over to the GPU cores in their M1 chips (similar to how they’ve done for some other high profile ML projects), as I think that would have a significant impact on performance.

Note: Amazon links above include my affiliate code. More details on the About page.

← Previous Post Next Post →

catatp.fm - Now with Whisper Powered Transcription