Calculating the Error of a Speech Recognition AI
Rohan K -
Status Update: Manual correction for Voice Activity Detector (VAD) was completed on Tuesday. We reviewed 323 audio recordings and over 6000 sentences. After editing the labels, we have two files for each recording: the VAD’s labels (not perfectly accurate) and the corrected labels (accurate). Our goal is to make the VAD more accurate. The following is how I am approaching this task:
A VAD model is pre-trained (it already comes trained with hours of clear speech) and customizable (you can edit the ‘settings’ to perform better on your specific data). We adjusted the settings to accommodate disordered speech. After manually reviewing the timestamps, the accuracy was low. But how low is low? We need a quantifiable method of measuring accuracy. This week, I wrote code that goes through the sentences in the file with correct timestamps, matches them to the corresponding sentence in the VAD’s file with not-so correct timestamps, and calculates the mean absolute error between the timestamps. Ultimately, we have the VAD’s errorĀ for each audio recording, and if we take the average, we obtain one important number: the error of the VAD.
Now that we have the ability to calculate how well the VAD performs, we can adjust the settings, and see if the accuracy goes up or down. There are two techniques to find the best performing VAD settings: brute force and optimization. The brute force technique simply means calculating the error for every possible combination in the VAD’s settings, and selecting the one with the least error. This takes a lot of computational power and might break the computer. A more efficient method is optimization (hill climbing), where you make small adjustments to the settings and move toward the higher performing one, in order to find a more direct path to the top of the hill (highest accuracy). This is a machine learning technique where the model learns what settings work the best to produce the least amount of error.
Right now I have the error calculation ready. I am working on creating the machine learning model to train the VAD to give more accurate timestamps of the start and end of each sentence. After that, we can begin training the larger model to rate the sentences out of 10 on harshness, breathiness, and tremor. Progress is being made faster than expected, and I am now confident in my Python skills. I have found that the best way to learn how to code is to throw yourself in the middle of a project that works with real data!
Comments:
All viewpoints are welcome but profane, threatening, disrespectful, or harassing comments will not be tolerated and are subject to moderation up to, and including, full deletion.