Using OpenAI but not ChatGPT? Exploring Speech Recognition Systems

Rohan K -

Hello all! There are two components of this project: the VAD model and the Speech-to-Text (STT) model. The VAD feeds time clips recognized as human speech into the STT model, and then the STT model outputs transcribed sentences. However, working with disordered speech creates a higher error rate. The past two weeks, I have been working on calculating this error rate, and trying to figure out why the error was so high. Specifically, I’ve been trying to fine-tune the VAD model. As I was about to run a parameter search, I landed upon a different STT model with promising accuracy. Originally, we were using Facebook AI’s Wav2Vac 2.0, which is a speech-to-text model trained on audio and un-paired text. The new model I found is a faster version of OpenAI’s Whisper model, the same company that made ChatGPT. Whisper is an automatic speech recognition (ASR) system trained on 680,000 hours of data. According to this article, the accuracy of Whisper is significantly better than Wav2Vac 2.0. I am in the process of testing its accuracy on disordered speech specifically, but we will likely move forward using this model instead. After I test the Whisper model, I will go back to the fine-tuning of the VAD. The VAD remains a priority, because it’s more sensitive to breaks, breathiness, and tremor in disordered speech. I hope to have sufficient model flow by next week, so I can run an optimization search. Thanks for reading. Here’s a picture of everyone in the lab at the Dr. Simonyan’s Professor Promotion Celebration at the Harvard Faculty Club:

More Posts


All viewpoints are welcome but profane, threatening, disrespectful, or harassing comments will not be tolerated and are subject to moderation up to, and including, full deletion.

    Looks like you are making some great progress, Rohan. I am interested in if these AI models are available to the public? What are other applications of these speech to text AI models?
    Bryn Sharp
    Hi Rohan, What great progress! How did you come about this new AI model? What do you think might be the implications of changing models at this stage in your research? On another note, I absolutely love the photo. Thank you for sharing - it looks like you all are having a blast!
    Ms. Bennett - These models are completely 'open-source', meaning the code is available to anyone. You can download it from an website that hosts AI models called Speech to text models are used in so many aspects, from making your voice clearer over the phone to speaking to an AI assistant such as Siri.
    Sra. Sharp - I had known this model existed, however I didn't use it because it was slow and took up a lot of memory. However, as I was researching alternative methods, I came across a version of Whisper that was proposed as faster while still maintaining a similar accuracy. Before I fully transition models, I need to run more tests. If it doesn't work out, the nice thing is that it is very easy to delete the model and move on.

Leave a Reply

Your email address will not be published. Required fields are marked *