Measuring the Problem: Clear Speech vs. Disordered Speech
Rohan K -
Hello and welcome back! Last week I introduced the prospect of a new model called Whisper. As I continue to compare models and run optimization algorithms, I am also getting ready to prepare results. The first step is measuring the initial problem, showing why this work is important in the first place. This week I reviewed healthy controls, who have no symptoms when speaking. These recordings have a more continuous voice with clearer articulations. The dataset consists of 25 healthy-speaking patients. One challenge was that each recording had more than the 20 sentences that I was working with. There were additional sentences in between, and whispered sentences at the end. This meant I had to do some preprocessing. I edited each audio file so that they only had the 20 sentences that I wanted, and I labeled each sentence with correct timestamps. Using the correct timestamps, I can judge the accuracy of a model’s predicted timestamps. Now that both the healthy control and the voice disorder datasets were in the same format, I could run some baseline tests.
The Speech-to-Text transcription model developed by OpenAI, Whisper, is the state-of-the-art model, trained on 600,000+ hours of speech. It’s important to note that it learns from clear speech, not disordered speech. That’s the root of my research question. I wanted to compare how it performs on each of our datasets. After running the model on 25 healthy controls, Whisper found and labeled on average 94.8% of the 20 sentences. The timestamps were 98.7% accurate. This is fairly good, considering we didn’t adjust any parameters or use a Voice Activity Detector to assist it. Next, I ran the exact same model on 25 randomly selected recordings with disordered speech. This time, Whisper found just 58.2% of the 20 sentences! This is a huge problem. Instead of finding 19 sentences on average, it only found 12. The timestamps were 97.8% accurate. Although this seems like a small drop, it’s actually pretty significant because we’re working down to the millisecond, and the average syllable is about 65 milliseconds. So if the model is just a tiny bit off, it could be missing key syllables at the beginning or end of a sentence.
I ran these baseline numbers to show that there is a real problem. Now, my goal is to build a model specific to disordered speech that finds at least 90-95% of sentences (~18-19 sentences on average), while still maintaining a timestamp accuracy of above 98%. This would mean that the proposed model would perform just as good on disordered speech as it did on clear speech. Thanks for reading. Hope to see you next week!