Labeling the Start of My Project: Voice Activity Detection

Rohan K -

Hello all! I’m excited to finally share with you a rundown of what my project consists of in regards to clinical applications. At my site placement, the Dystonia Speech and Motor Control Laboratory, patients with laryngeal dystonia participate in studies where they record their speech before and after treatment. The patient is given 20 sentences to read aloud, which are recorded and stored into an audio file. Dr. Simonyan, director of the lab, then scores a patient’s speech sample out of 10 based on harshness, breathiness, and tremor. Because there are several hundreds of patients each with multiple recordings, the scoring process is a lot of manual work and takes a significant amount of time.

One of the postdoctoral researchers is working on building an AI model that scores a patient’s speech on its own. The model would be trained on labeled data that the lab already has: Dr. Simonyan’s scores + patients audio file (see Introduction post for more on how AI models work). The end goal is a machine that mimics Dr. Simonyan’s manual rating, providing a score out of 10, but is much faster.

The first step in this model is detecting when a patient is actually speaking. Instead of manually finding the start and end of each sentence (there are 700 × 20 sentences), we will use a pre-trained VAD (Voice Activity Detection) model that will label the endpoints of each sentence for us. However, this VAD model is not perfect. Sometimes it marks the end of a sentence prematurely. Sometimes, the patient may have to repeat a sentence for clarity, so the VAD picks up 21 sentences instead of 20. Or, the patient may accidentally skip a sentence, so we get 19 instead. My first task is to manually correct the VAD by checking/correcting the labels that it made in Audacity. The simple project that I covered in my last post is now coming into use, because I’m using the same tools to edit the VAD’s labels. As I manually check the files, I am working on the code to improve the Voice Activity Detection model, so that no manual correction is needed.

This is just the first step in the complex AI model. After we find the sentences, we need to identify them by transcribing them to text and matching them to one of the prescribed 20 sentences. I will save that for another time. Thanks for reading.

More Posts


All viewpoints are welcome but profane, threatening, disrespectful, or harassing comments will not be tolerated and are subject to moderation up to, and including, full deletion.

    Hi Rohan, this is such an interesting application of AI models. Why do you think the VAD model mislabels endpoints of sentences? Is it just dependent on the way a patient speaks?
    Ms. Bennett - Yes, it varies patient to patient. The VAD model is trained on clear, normal speech, so it might not work well with some of the patients with LD. The challenge is fine-tuning the "settings" of the VAD model to accommodate breaks in speech caused by the disorder. I am working on finding the right settings as I go through the audio and labels.
    Hi Rohan, This is such a great start to your project! I have a question for you to think about as you work through your research: How can one ensure ethic responsibility while using AI in research? Is ethics seen the same way when using AI in research as it is when conducting traditional research? I look forward to hearing your thoughts.

Leave a Reply

Your email address will not be published. Required fields are marked *