Quick Update on catatp.fm

Speaker Recognition (Diaritization) Integrated but Whisper Transition Pending

Posted by TechBeret on January 19, 2023 · 5 mins read

catatp.fm continues…

While this blog has been dormant, the side project continues to operate. I spent some time over the US’s Labor Day weekend in 2022 integrating speaker diarization using pyannote, with the hope of it working ahead of ATP’s 500th episode. I just barely got it integrated in time, but after notifying the hosts about the update the site was referenced during the 500th episode by Casey! I also received kind e-mails from all three hosts, which made my week.

Working Speaker Recognition for ATP Transcripts

Back when I wrote up a brief design doc for catatp.fm I included speaker identification as a potential feature. At the time after examining the landscape for speaker diarization options and their lackluster accuracy, I abandoned the feature, with the hope to return to it in the future.

Finally, over a year later (fall 2022), I found an open source project that met my accuracy requirements, and I felt it appropriate to get it integrated into catatp.fm for ATP’s 500th episode.

While there are now speaker labels for each sequence of words for each episode, the fun part was I added a section on the statistics page about the word counts for each speaker, and an interactive graph allowing someone to look over the word counts for each of the detected speakers for each episode. That makes it easy to (mostly accurately) identify which episodes Tiff Arment was in, the episodes where you interviewed someone (Chris Latner, Phil Schiller, Christina Warren), the one episode where Marco never spoke (263), the one episode where John was not present (119, the Christina Warren interview), and the fact that Casey is the only person who has been heard in every episode of ATP.

The graph also makes it easier for me to see where more work is needed, like episode 202 (where people’s laughter and Tiff Arment’s voice are detected as Jonathan Mann).

Note that I am using an older model and version of pyannote than the currently released one. That is because after I upgraded I saw a massive drop in accuracy for the long audio files I process as part of this project. After seeing the accuracy reduction, I was able to revert and continue using the cached model.

Some Semi-Random Notes / Thoughts on Speaker Diaritization / Project in General

  1. Accuracy for speaker recognition is not 100%, but the transcriptions aren’t 100% accurate by any means either.
  2. I am still using Flashlight for the speech recognition portion of the project, and the timestamps generated by Flashlight don’t seem to match up perfectly with the timestamps generated by pyannote where sometimes trailing words get assigned to the following speaker.
  3. I was extremely impressed to discover that pyannote was able to mostly handle identifying each of the main host speakers with only two samples of each host, all taken from episode 491 (the episode from the week when I was first playing around with pyannote).
  4. The only exception to #3 was when John was sick in episode 67. I had to take separate samples for John for specifically that episode, as pyannote could not identify John at all in that episode from his ‘normal’ samples. I’m glad John’s throat recovered to where he was correctly detected in the following episode without requiring additional one-off samples.
  5. I also added samples for Jonathan Mann, Phil Schiller, Chris Latner, Christina Warren, and Tiffany Arment, as all were detected as speakers that were not one of the main hosts.
  6. Phil Schiller was originally classified as Jonathan Mann by pyannote in episode 317, which I found amusing. I probably need to adjust my minimum accuracy thresholds for classification.
  7. On the speech recognition side of things, I have performed some experiments with Whisper, but ran into issues with segments with zero or negative durations which put a halt to that work until I have enough time to deep dive into the issue. That being said, in initial experiments examining accuracy the large Whisper model (v1, haven’t tested v2 yet) beat Flashlight and all the other speech recognition systems I measured. And on a recent nVidia gaming GPU (GeForce 1080 TI) with enough RAM to fit the rather large model it was still processing audio files than real time.

Detour to Another Side Project…

I’m currently trying to finish up another side project entirely unrelated to ATP which is occupying what little free time I have. Keep an eye on this space for more info hopefully in the near future.

You can also follow me on Mastodon @[email protected], where I believe I have already tooted more than I ever tweeted. Wow, that sounds weird…