So a side project I actually semi-finished (because a side project is never actually completely finished, you just eventually give up or launch it with missing features) needed speech recognition.
A key requirement for the speech recognition was not having a per-minute cost, as I had hundreds of hours of audio to chew through. Since I didn’t want to spend potentially thousands of dollars on a cloud service, that meant I needed options I could locally execute. You can read a brief overview of the project itself on the About page, while this post is a place for me to chat in a longer form about “local” speech recognition.
As I don’t believe in burying the lede, I’ll start off with an overview of the different options I evaluated.
Here is an excerpt from my side project’s About page, discussing the initial test and results. Note I added in results for iOS 15, which I ran in the past couple days following general availability of it.
Start of excerpt from project About page
Early on in this project I chose an initial short segment of audio to use as a ‘reference’ to determine how different speech recognition systems could handle the tech-heavy speech of ATP. The first segment I selected was John’s soliloquy about the Apple Human Interface Guidelines and UI Design from Accidental Tech Podcast #392. It has a duration of 200 seconds. Using the 200 second segment I then generated a ‘reference’ transcript and ran it through a variety of speech recognition systems. This included both open source options and the built in offline transcription found in iOS 14 and MacOS Big Sur (which interestingly consistently provided slightly different results…). Note that for Flashlight it also involved some custom algorithms / techniques to merge overlapping transcripts together, as Flashlight’s accuracy with the model I was using dropped significantly when samples were longer than ~15 seconds.
I did not examine cloud hosted speech recognition systems (Otter.ai, Amazon Transcribe, etc), as due to the 400+ often 3+ hour episodes of ATP transcribing it all would have been cost prohibitive for a side project. Note that they likely exceed the accuracy of even the best performing open source project, but were out of scope for this project.
For this project I wanted to maximize the number of Hits (number of words correctly found) while minimizing the number of Insertions (number of words inserted that did not exist in original audio) and Substitutions (where a word was substituted for a different one). I used the Python module jiwer to help calculate all these values. I also removed all punctuation (
!/ etc) before running them through the comparison, to ensure I didn’t favor those that used punctuation over those that didn’t. Finally do not take this accuracy comparison as gospel, it is a short snippet of a very tech oriented show hosted by three white dudes from the United States, and accuracy might be dramatically different for your particular speech recognition use case.
|Speech Recognition System
|WER (Word Error Rate)
|Flashlight (15 second chunking)
|macOS Big Sur
|Wav2Letter (Flashlight predecessor)
|Mozilla DeepSpeech 9
End of excerpt from project About page
I also did another comparison (this time using an entire chapter from ATP #205) before deciding on which transcription engine to use. I didn’t write about this comparison on the project About page, as I had already removed some of the worse performing speech recognition systems from the comparison.
Note that I did not include the WER and insertion numbers, as I used the hand-generated ‘official’ ATP transcript as my ‘reference’ transcript for this test. Due to the transcript omitting a lot of
hmms, and other connecting but extraneous words along with stutters, stammers, and other things that aren’t actually needed for an abbreviated transcript, it showed dramatically inflated WER and insertion numbers.
This still was a useful data point for me, as the key thing I was looking at was the hit rate. The hit rate for my project was the count of the ‘important’ words included in the abbreviated transcript that each speech recognition engine able to generate.
|Speech Recognition System
|Flashlight (15 second chunking)
|MacOS Big Sur
|Wav2Letter (Flashlight predecessor)
Flashlight and iOS 14 were pretty close yet again, though Flashlight still had a noticeable lead in the number of hits. In addition parallelizing the iOS engine would have required a lot of additional work (and multiple iOS devices stolen from other members of my family for an extended period of time), while Flashlight ran very fast on my existing nVidia gaming GPU. Thus I moved forward with Flashlight as the speech recognition engine of choice for my project.
During the process of down-selecting and then utilizing Flashlight a number of thoughts/learnings came to me, which I’ll go over one by one.
While this learning probably applies to almost everything these days, it particularly applies to free/OSS speech recognition. I found that most of the projects that were not from Facebook or Apple had dramatically worse accuracy than either Flashlight or iOS/Mac OS integrated transcription. And yes, before anyone e-mails/tweets/messages me, iOS and Mac OS transcription are not OSS, but they are free for anyone who has a Mac/iOS device to run whatever audio they want through Apple’s speech recognition engine (while in offline mode that is). iOS 15 appears to have had a slight regression in accuracy in my benchmarks compared to iOS 14, but it was much easier to setup offline speech recognition on iOS 15 compared to iOS 14.
I was particularly disappointed that Mozilla’s engine (DeepSpeech 9) did not perform better. Though with recent turmoil in the Mozilla organization I unfortunately can’t say I’m surprised.
While testing (and eventually using) Flashlight I used their provided speech recognition model as detailed in their Colab project. While I work professionally as a software engineer, I have not recently had reasons to leverage machine learning at work. As such I decided I would use the pre-trained models they have on their website, instead of trying to roll my own. Unlike other tech companies running their own cloud services, Facebook isn’t worried about providing the models that they get their best results in. This is in contrast to a company like MSFT/AMZN/GOOG, who would likely be hesitant to allow random people to run their state of the art models on their home machines as opposed to only allowing them to run within their respective data centers (at likely a nice profit per execution).
In fact Google has provided offline execution capabilities of some of their mobile-optimized ML models including image identification using ML Kit, but only on Android/iOS devices. And speech recognition is not included in ML Kit currently.
When I started the project I assumed that the majority of the effort would be in the mass-execution of an OSS speech recognition engine against hundreds of hours of audio, along with converting it into a web site that didn’t make a person’s eyes bleed. Instead the vast majority of the effort in the project was spent figuring out how to get most of the projects to run once against my ‘benchmark’ audio sample. While some projects were relatively easy to get running, others were… not. There were a number where I was relieved that a project’s accuracy was terrible, as it meant I didn’t need to get it running again.
While Flashlight’s speed allowed me to crunch through the back-catalog of episodes over a weekend, some of the other options would have taken much longer to chew through all the audio when running on a single system. They could have potentially been sped up through parallelization, but likely requiring a large EC2 bill to do so. So whenever you’re evaluating something, make sure to take how long it will take to run/cost into consideration.
Unfortunately I didn’t take great notes on how long all of them took, but I do have execution times for a number of the tests I performed. Note that these tests were run on different input samples (as they’re from different stages of testing in my notes), so the main thing you should examine is the
x Real Time column. Unless otherwise noted all tests were run on a PC with a Ryzen 1600 AF running Ubuntu 18.04 Server.
|Speech Recognition System
|Input File Duration
|x Real Time (Higher is Better/Faster)
|Mozilla DeepSpeech 6
|Mozilla DeepSpeech 8
|iOS 14.1 (iPhone 10 Pro Max)
|iOS 15 (iPhone 10 Pro Max)
|Mac OS Big Sur (wife’s 2019 27” iMac)
|Mac OS Big Sur (M1 Macbook Air)
|Flashlight (CUDA on 1080 TI)
|7652s (Entire Episode in 15s Chunks)
Note that for both the iOS and Big Sur results the actual application requesting the speech recognition does not appear to use much (if any) CPU, instead the work appears to be performed on a background process. In addition in my brief testing on MacOS I was unable to run multiple instances simultaneously, likely due to that single background process. I also found it interesting that the background process seemed to only use the ‘efficiency’ cores of my M1 MacBook Air.
Finally I now wonder how fast Flashlight would be if it weren’t running on the GPU, considering it’s predecessor Wav2Letter is the slowest engine I have results for. Also it is likely that using the multithreaded Wav2Letter executable would speed things up, I didn’t investigate it further after seeing Flashlight’s performance and accuracy improvements over Wav2Letter.
During my initial testing of Flashlight I discovered that accuracy was noticeably reduced when audio samples run through it exceeded ~15 seconds. This was confirmed by an issue on Flashlight’s GitHub project. After looking through some samples of test audio both at 15 seconds and exceeding it, I decided the accuracy delta was worth the additional effort to only transcribe in 15 second chunks. This required a good amount of additional work, from generating overlapping 15 second chunks to developing an algorithm to merge all the overlapping chunks into a single transcript (ideally while also minimizing duplicated words).
Unfortunately when working with OSS software (particularly research centric software) it may be optimized more for research paper benchmarks as opposed to real life work. You’re not paying any money for the software or model, so you can’t necessarily complain about the lack of convenience. But hey, it made the side project that much more interesting.
I already knew this learning, based on previous professional experience with Docker. But when you’re fighting against projects using old versions of CUDA, or other open source libraries that are not easily swappable on a host, providing a Docker image you can instantiate to jump directly to running commands is incredibly efficient. Plus GPU accelerated Docker makes it simple even for CUDA accelerated projects. Similarly Google Colab tutorials are super useful for non-ML experts trying to leverage cutting edge research.
Overall I was both impressed and annoyed with the state of ‘free’ speech recognition. While both iOS/MacOS and Flashlight have surprisingly decent accuracy, the rest of the field feels years behind. Overall I would recommend someone looking for a free speech recognition engine (who is a developer) to start off using Facebook’s Flashlight system. It provides solid accuracy and fast speeds, assuming you handle the splitting and merging of audio in 15 second chunks and have an nVidia GPU handy to speed everything up.
Also as a note to any prospective open source speech recognition library / engine authors out there, please write documentation on how to actually run your engine. In addition ideally vend a Docker image, so that when the cutting edge library you depend on later has multiple breaking changes your customers can still run your project when evaluating it. You never know, having a Docker image might be the difference between someone actually using the software you’ve worked so hard on and someone passing over it after fighting an incompatible dependency tree for an hour.