Labeling the Start of My Project: Voice Activity Detection
Rohan K -
Hello all! I’m excited to finally share with you a rundown of what my project consists of in regards to clinical applications. At my site placement, the Dystonia Speech and Motor Control Laboratory, patients with laryngeal dystonia participate in studies where they record their speech before and after treatment. The patient is given 20 sentences to read aloud, which are recorded and stored into an audio file. Dr. Simonyan, director of the lab, then scores a patient’s speech sample out of 10 based on harshness, breathiness, and tremor. Because there are several hundreds of patients each with multiple recordings, the scoring process is a lot of manual work and takes a significant amount of time.
One of the postdoctoral researchers is working on building an AI model that scores a patient’s speech on its own. The model would be trained on labeled data that the lab already has: Dr. Simonyan’s scores + patients audio file (see Introduction post for more on how AI models work). The end goal is a machine that mimics Dr. Simonyan’s manual rating, providing a score out of 10, but is much faster.
The first step in this model is detecting when a patient is actually speaking. Instead of manually finding the start and end of each sentence (there are 700 × 20 sentences), we will use a pre-trained VAD (Voice Activity Detection) model that will label the endpoints of each sentence for us. However, this VAD model is not perfect. Sometimes it marks the end of a sentence prematurely. Sometimes, the patient may have to repeat a sentence for clarity, so the VAD picks up 21 sentences instead of 20. Or, the patient may accidentally skip a sentence, so we get 19 instead. My first task is to manually correct the VAD by checking/correcting the labels that it made in Audacity. The simple project that I covered in my last post is now coming into use, because I’m using the same tools to edit the VAD’s labels. As I manually check the files, I am working on the code to improve the Voice Activity Detection model, so that no manual correction is needed.
This is just the first step in the complex AI model. After we find the sentences, we need to identify them by transcribing them to text and matching them to one of the prescribed 20 sentences. I will save that for another time. Thanks for reading.
Comments:
All viewpoints are welcome but profane, threatening, disrespectful, or harassing comments will not be tolerated and are subject to moderation up to, and including, full deletion.