Lip-Syncing My Data
Elena C -
Hey everyone,
Today I made some more ML models because I had to retrain them with the new compound sentiments. This is in total a 4 step process.
1: So first to do that I had to train my sentiment analysis model using real-world datasets. The one I picekd was Sentiment140 which is an online dataset that contains over 1.6 million tweets labeled as positive, negative, or neutral which made it a great model for comparing social media.
2: Then I converted the text to numerical elements using a method called TF-IDF. TF-IDF is a statistic that represents how often a word appears in a document. To find it you use an equation which is TF x IDF. That translates to
(Number of times “a word” appears in a documentTotal number of words in document)x log(Total number of documentsNumber of documents with that “word”)
To analyze it: a word that appears a lot in one document (or in this case transcript since we are using interviews), but not in other transcripts will have a high TF-IDF score. That essentially means to that specific document it is a crucial difference.
3: Next I cleaned up the data processing by taking out stop words and trained it into a logistic regression. Logistic regression basically estimates the probability of an event occurring.
4: And lastly, I made sure the model ran smoothly by calculating an F1 score. F1 score = 2(Precision x RecallPrecision + Recall). Precision checks how exact the model is and recall looks at how comprehensive the model is. The lower the score is the more false positives/negatives were accounted for. My goal for this research project is to obtain 95% accurarcy, however with only 10 transcripts being studied to maintain personalization the score is likely to be lower.
Come back next week for another update!
Comments:
All viewpoints are welcome but profane, threatening, disrespectful, or harassing comments will not be tolerated and are subject to moderation up to, and including, full deletion.