Lip-Syncing My Data

Elena C - April 12, 2025 1:15 pm

Hey everyone,

Today I made some more ML models because I had to retrain them with the new compound sentiments. This is in total a 4 step process.

1: So first to do that I had to train my sentiment analysis model using real-world datasets. The one I picekd was Sentiment140 which is an online dataset that contains over 1.6 million tweets labeled as positive, negative, or neutral which made it a great model for comparing social media.

2: Then I converted the text to numerical elements using a method called TF-IDF. TF-IDF is a statistic that represents how often a word appears in a document. To find it you use an equation which is TF x IDF. That translates to

(Number of times “a word” appears in a documentTotal number of words in document)x log(Total number of documentsNumber of documents with that “word”)

To analyze it: a word that appears a lot in one document (or in this case transcript since we are using interviews), but not in other transcripts will have a high TF-IDF score. That essentially means to that specific document it is a crucial difference.

3: Next I cleaned up the data processing by taking out stop words and trained it into a logistic regression. Logistic regression basically estimates the probability of an event occurring.

4: And lastly, I made sure the model ran smoothly by calculating an F1 score. F1 score = 2(Precision x RecallPrecision + Recall). Precision checks how exact the model is and recall looks at how comprehensive the model is. The lower the score is the more false positives/negatives were accounted for. My goal for this research project is to obtain 95% accurarcy, however with only 10 transcripts being studied to maintain personalization the score is likely to be lower.

Come back next week for another update!

Comments:

All viewpoints are welcome but profane, threatening, disrespectful, or harassing comments will not be tolerated and are subject to moderation up to, and including, full deletion.

krish_s

Great work so far, Elena! How do you plan on handling potential bias in your sentiment analysis model, especially since it’s trained on tweets but applied to interviews?

April 16, 2025 at 10:17 pm - Reply

elena_c

Great question Krish! So while its trained on tweets, the training is only applied to the word choices it matches. Some of the tweets did contain context about drag in them and those were defiantly more helpful than some of the others, but all sentiment analysis was personally cross-checked to each scenario. In the future I plan to also use secondary survey data to cross check the sentiment with literature as well to ensure academic validity in the data.

April 24, 2025 at 3:54 pm - Reply

Lip-Syncing My Data

More Posts

Comments:

Leave a Reply Cancel reply