Measuring the Problem: Clear Speech vs. Disordered Speech

Rohan K -

Hello and welcome back! Last week I introduced the prospect of a new model called Whisper. As I continue to compare models and run optimization algorithms, I am also getting ready to prepare results. The first step is measuring the initial problem, showing why this work is important in the first place. This week I reviewed healthy controls, who have no symptoms when speaking. These recordings have a more continuous voice with clearer articulations. The dataset consists of 25 healthy-speaking patients. One challenge was that each recording had more than the 20 sentences that I was working with. There were additional sentences in between, and whispered sentences at the end. This meant I had to do some preprocessing. I edited each audio file so that they only had the 20 sentences that I wanted, and I labeled each sentence with correct timestamps. Using the correct timestamps, I can judge the accuracy of a model’s predicted timestamps. Now that both the healthy control and the voice disorder datasets were in the same format, I could run some baseline tests.

The Speech-to-Text transcription model developed by OpenAI, Whisper, is the state-of-the-art model, trained on 600,000+ hours of speech. It’s important to note that it learns from clear speech, not disordered speech. That’s the root of my research question. I wanted to compare how it performs on each of our datasets. After running the model on 25 healthy controls, Whisper found and labeled on average 94.8% of the 20 sentences. The timestamps were 98.7% accurate. This is fairly good, considering we didn’t adjust any parameters or use a Voice Activity Detector to assist it. Next, I ran the exact same model on 25 randomly selected recordings with disordered speech. This time, Whisper found just 58.2% of the 20 sentences! This is a huge problem. Instead of finding 19 sentences on average, it only found 12. The timestamps were 97.8% accurate. Although this seems like a small drop, it’s actually pretty significant because we’re working down to the millisecond, and the average syllable is about 65 milliseconds. So if the model is just a tiny bit off, it could be missing key syllables at the beginning or end of a sentence.

I ran these baseline numbers to show that there is a real problem. Now, my goal is to build a model specific to disordered speech that finds at least 90-95% of sentences (~18-19 sentences on average), while still maintaining a timestamp accuracy of above 98%. This would mean that the proposed model would perform just as good on disordered speech as it did on clear speech. Thanks for reading. Hope to see you next week!

More Posts

Leave a Reply