Johnny Y's Senior Project Blog

Project Title: The Meme Economy: Utilizing Natural Language Processing to Predict “Meme” Cryptocurrency Price Movements
BASIS Advisor: Rob Lee
Internship Location: University of Arizona Computer Science Department
Onsite Mentor: Dr. Mihai Surdeanu



Project Abstract

This project aims to determine whether natural language processing (NLP) can effectively predict price movements of "meme cryptocurrencies"---digital currencies driven primarily by internet culture and social media phenomena---by analyzing social media sentiment. This question is particularly relevant because meme coins, unlike traditional cryptocurrencies, are primarily driven by social factors rather than fundamental value. Success would demonstrate NLP's potential as a valuation tool for assets driven by "mindshare" and "social support," while contributing to the growing body of research at the intersection of artificial intelligence and financial markets. The research builds on existing work applying NLP to stock market and major cryptocurrency analysis, but focuses specifically on the understudied meme coin market, which has a significant market cap of $123 billion as of November 2024. The project was conducted at the University of Arizona Computer Science Department under Dr. Mihai Surdeanu, an expert in NLP and financial text analysis. The project includes development of a sentiment analysis classifier to analyze social media posts from Reddit about selected meme coins. The model was optimized through trial and error to generate trading signals, with performance measured against that of a strategy utilizing ARIMA, a statistical model. The research addresses challenges like filtering out automated "bot comments" that attempt to artificially inflate prices. The goal is to be able to use sentiment analysis to achieve significantly higher returns than ARIMA.

    My Posts:

  • Week 12: Wrapping Up

    Hey everyone! Thanks so much for following my project over the past couple of months. This has been an incredible learning experience, and I’m confident that the skills and insights I’ve gained will continue to shape my future work—both in college and beyond. A big thank you to Professor Mihai Surdeanu, my on-site mentor, for... Read More

  • Week 11: Success!

    This week, after speaking with my advisor, I implemented several changes to the logistic regression classifier, specifically to the input feature data. Recall that the original feature data was a numeric version of each day’s posts along with the predicted ARIMA price change. I replaced the numeric post data with the proportion of the words... Read More

  • Week 10: Classifier Complete!

    This week, I finished coding my LR classifier. Again, the goal was to take some pre-computed text embeddings and predict one of three labels: "down", "no change", or "up". It uses one logistic regression layer. The data came in CSV format, with each row being a vector (basically a long list of numbers) and a... Read More

  • Week 9: Multiclass LR (intuitive? version)

    This week, I continued working on the multiclass LR algorithm for the SentBERT + ARIMA data. It has come to my attention that a more intuitive explanation of LR might be best (thanks Makeen). Here goes: The AI is confronted with the task of making a yes/no prediction based on several bits of information. It... Read More

  • Week 8: Multiclass LR

    This week, I continued creating a PyTorch classifier for the SentBERT + ARIMA Dogecoin data, which implements multiclass logistic regression. Recall from my last post that binary logistic regression examines the dot product of the feature vector (the embedding created by running SentBERT on the words in the Reddit posts/comments) and a weight vector, and... Read More

  • Week 7: Spring Break

    I will be out this week for spring break. Thank you for tuning in and see you all next week! Read More

  • Week 6: SentBERT + Model Usefulness

    This week, after discussing with my advisor, I computed average SentBERT embeddings for all posts of each day. I then manually entered 2024 prices for Dogecoin, adapted my ARIMA model to work with an Excel file (which proved to be harder than I expected - pandas DataFrames are picky!), and appended the ARIMA results to... Read More

  • Week 5: Data Processing and Labeling + Readings

    This week, I converted the downloaded data into a spreadsheet and utilized a Hugging Face DistilBERT model to generate binary sentiment labels for each post. I implemented batch processing and GPU acceleration, which reduced runtime to 1 minute. I also finished the readings my site mentor assigned me. The readings covered basic machine learning algorithms... Read More

  • Week 4: ARIMA Tweaks and Downloading Data

    This week, I made a couple tweaks to my ARIMA model after meeting with my site advisor and downloaded data. I created a “no change” class along with up/down, as some changes are too insignificant to profit off of given transaction costs and uncertainty (I set the threshold for significance at 3% change). I experimented... Read More

  • Week 3: Improving ARIMA

    This week, I focused on improving the performance of my ARIMA model. I examined potential alternative baselines just in case (random choice, tweet volume, and a random forest classifier), and tried various improvements. These included using a logarithmic scaling, functions like auto_ARIMA to find parameters, AIC, and a rolling window. Logarithmic scaling didn’t address the... Read More

  • Week 2: ARIMA Models

    This week, I focused on building and optimizing my baseline ARIMA (autoregressive integrated moving average) model. I first built an ARIMA model to predict continuous prices (which ARIMA tends to perform better on). I found the parameters using the Augmented Dickey-Fuller (ADF), Autocorrelation Function (ACF), and Partial Autocorrelation Function (PACF) tests, and used Dogecoin data... Read More

  • Week 1: Coin Price Data Gathering and ARIMA

    This week, I focused on obtaining the data necessary for my project and studying the ARIMA model. I found that the CoinGecko API can be used to download pricing data up to 1 year, and that CoinMarketCap can be used for manual entry beyond 1 year. I had to learn some Python in order to... Read More

  • Introductory Blog Post

    Hello! I’m Johnny Yu, a senior at BASIS Tucson North. This blog will document my journey as I advance my Senior Project investigating the efficacy of utilizing natural language processing to analyze social media in order to predict cryptocurrency price movements. I’ll be focusing on so-called “meme coins” like Dogecoin, which have higher volatility and... Read More