Post 5 – Data
Hi, welcome back to another banger. Today, I’ll be expanding a bit on my data and how I’m using it. Right now, I’m still working on finalizing the three datasets that I need. In my last post, I was only done with the first one dataset, but now I’m on the last one with the most common words associated with each country right now, which I mentioned at the end of my last post (the PMI dataset).
First, let me talk about some specifics of how I used sentiment analysis. Sentiment analysis, a fundamental natural language processing technique, was used to label the sentiment of each news headline in order to create the second dataset. Various machine learning algorithms and natural language processing libraries, such as NLTK (Natural Language Toolkit), Logistic Regression, and LSTM were used, however the model that performed with the most accuracy (and ultimately the model that used to create the second dataset) was BERT. BERT, a transformer-based model, captures contextual relationships in text.
So far, the process of making each dataset has gone smoothly. Right now, as I mentioned, I have two datasets. The description of how I made the first dataset is in my previous post.
From the second figure, we can see that developing countries had relatively the same amount of negative (390) and neutral (385) headlines posted. As for the positive articles posted about developing countries, the number was much lower, at only 150 headlines. Developed countries saw a bit of a different trend. The overall quantity of developed country headlines was much larger, which was expected because of the fact that the news articles all come from ABC which is profit incentivized to post about more relevant current events that their consumers would read, which mostly happens in developed countries. As for the sentiment associated with developed countries, 567 articles were negative, 696 were neutral, and only 133 were positive. As for the headlines which had mentions of both developed and developing countries, 112 were negative, 133 were neutral, and only 47 were positive. The category of developed/developing had the least quantity of headlines in the dataset, with only 292 headlines in total.
In my next post I will most likely be done with the common words dataset, so stay tuned!