Week 11: Success!

Johnny Y -

This week, after speaking with my advisor, I implemented several changes to the logistic regression classifier, specifically to the input feature data. Recall that the original feature data was a numeric version of each day’s posts along with the predicted ARIMA price change. I replaced the numeric post data with the proportion of the words in the posts that were positive and negative, respectively. To determine whether a word was positive or negative, I used Bing Liu’s lexicon (which is based on consumer reviews). Then, instead of the raw predicted ARIMA price change, I used the percentage change, and I also added the ARIMA up/down/no-change prediction as a feature. To implement these changes, I had to create new training and test datasets. I then integrated the new datasets with the LR model, but the results were still not great. I then increased the # of iterations to 50, decreased the learning rate (which affects the magnitude of the changes that the model makes to its assumptions – too big can lead to the model not retaining enough information), and tried several tests to ensure that the model was learning properly. However, these changes did not have the desired effect.

I then tried a different set of features, based on a paper I read which analyzed Twitter posts to predict stock movements. This paper had the most success using tweet volume, so my new set of features had 3 features: total number of words with sentiment, the ratio of positive words to negative words, and ARIMA’s predicted percentage change. Again, I had to create a new data set for these features. Unfortunately, this still didn’t work. At my advisor’s suggestion, I created a cryptocurrency/trading-specific lexicon of positive and negative words. While this improved performance, it still only could predict 2 of the 3 classes.

I then realized that there was still data I hadn’t used – the comments on the Reddit posts I was analyzing. Thus, I added features from the comments data: total number of words with sentiment and the ratio of positive to negative words in the comments. With the comments data, the model finally worked!

Results:


Of course, there’s still much room for improvement. I plan to add negated concepts to the lexicon (ex: if green is positive, then “not green” should be negative). I’ll also continue expanding the lexicon to make it more comprehensive.

More Posts

Comments:

All viewpoints are welcome but profane, threatening, disrespectful, or harassing comments will not be tolerated and are subject to moderation up to, and including, full deletion.

    tesla_l
    Interesting how using comments on posts could make your model work when it wasn't before! How does your program read the comments? Do you manually copy paste them into your algorithm? Also, what words are "positive" and "negative"?
    tesla_l
    Also, as Makeen and I requested last week, could you give us a quick intuitive explanation of how the entire model works?
    johnny_y
    Thank you for your questions, Tesla! I used a filter program to save the comments into a spreadsheet, then used the pandas library (pre-made program) to extract them. Python can process text as strings and split each comment into individual words using spaces as endpoints of words. Not going to paste the full lexicon here, but "positive" words are words that are associated with the price going up, such as "moon" and "green". Similarly, "negative" words are words that are associated with the price going down, like "crash" and "red". Quick intuitive explanation: - Model takes input: volume of posts, positive-to-negative ratio, ARIMA prediction, volume of comments, positive-to-negative ratio of comments - Model assumes each input feature is equally valid, and makes a prediction - Model checks the answer - If right, move on, if wrong, model changes how much weight it gives to each input feature - Repeat for each training sample (which has different input values and correct answer) Example: Sample 1: posts: 8, positive-to-negative ratio: 4, ARIMA prediction: 0.01, comments: 50, ratio: 11. Model sees that the ratios are very positive; post and comment count low, ARIMA low. Since each are weighted equally, the ratios push the model to predict "Up". Actual answer: No change. This suggests that the ratios matter less in this case, so the model discounts them. Sample 2: posts: 25, positive-to-negative ratio: 3, ARIMA prediction: 0.06, comments: 120, ratio: 2. Post and comment count medium, ratios low, ARIMA high. Model discounts ratios because of sample 1, and high ARIMA leads it to predict "Up". Actual answer: "Down". This suggests that ARIMA matters less in this case, so the model discounts it. ...repeat with 50 other samples... Model eventually figures out some thesis like "When post volume and ratios are both high, predict 'Up'. When post volume is high and ratios are low, predict 'Down'. ARIMA prediction is only significant when post volume is low". Hope this helps!
    makeen_s
    Hi Johnny! Glad to see you're making progress. If you could quantify, in simple terms, how much your models have improved over the last few months, what would you be able to say?
    johnny_y
    Thank you for your question, Makeen! ARIMA improved accuracy from 33% to 36%, and with LR, accuracy has risen to 47%.

Leave a Reply

Your email address will not be published. Required fields are marked *