Using OpenAI but not ChatGPT? Exploring Speech Recognition Systems
Rohan K -
Hello all! There are two components of this project: the VAD model and the Speech-to-Text (STT) model. The VAD feeds time clips recognized as human speech into the STT model, and then the STT model outputs transcribed sentences. However, working with disordered speech creates a higher error rate. The past two weeks, I have been working on calculating this error rate, and trying to figure out why the error was so high. Specifically, I’ve been trying to fine-tune the VAD model. As I was about to run a parameter search, I landed upon a different STT model with promising accuracy. Originally, we were using Facebook AI’s Wav2Vac 2.0, which is a speech-to-text model trained on audio and un-paired text. The new model I found is a faster version of OpenAI’s Whisper model, the same company that made ChatGPT. Whisper is an automatic speech recognition (ASR) system trained on 680,000 hours of data. According to this article, the accuracy of Whisper is significantly better than Wav2Vac 2.0. I am in the process of testing its accuracy on disordered speech specifically, but we will likely move forward using this model instead. After I test the Whisper model, I will go back to the fine-tuning of the VAD. The VAD remains a priority, because it’s more sensitive to breaks, breathiness, and tremor in disordered speech. I hope to have sufficient model flow by next week, so I can run an optimization search. Thanks for reading. Here’s a picture of everyone in the lab at the Dr. Simonyan’s Professor Promotion Celebration at the Harvard Faculty Club:
Comments:
All viewpoints are welcome but profane, threatening, disrespectful, or harassing comments will not be tolerated and are subject to moderation up to, and including, full deletion.