Week 3: Improving ARIMA
Johnny Y -
This week, I focused on improving the performance of my ARIMA model. I examined potential alternative baselines just in case (random choice, tweet volume, and a random forest classifier), and tried various improvements. These included using a logarithmic scaling, functions like auto_ARIMA to find parameters, AIC, and a rolling window. Logarithmic scaling didn’t address the problem, since the data had sustained increases in price rather than temporary spikes. Auto_ARIMA didn’t work with my version of Python for some reason, so I used the Akaike Information Criterion (AIC) instead. AIC assigns a score to each ordered triple (p, d, q) of parameters; the lower the score, the better. Thus, I looped through each possible ordered triple and used the one with the lowest score. While AIC helped a little bit, the biggest improvements came when I implemented a “rolling window” approach. Instead of training the model on a fixed dataset, I re-trained ARIMA for each day in the testing dataset based on the past 30 days. ARIMA would be trained on the previous 30 days and make a prediction for the next day; then, the 30-day window would move one day forward, and ARIMA would be re-trained and make a prediction for the following day (i.e. the prediction for Jan 31 would be based on the data from Jan 1 – Jan 30, the prediction for Feb 1 would be based on the data from Jan 2 – Jan 31, etc).
Though it took longer to run, this overcame ARIMA’s inability to deal with sustained price increases, leading to significant accuracy improvements. Mean Absolute Percentage Error decreased from 40.77% to 6.66% for exact forecasting, and up/down forecasting accuracy improved from 47% to 60% (now significantly better than random chance). This method also reduces the amount of training data needed. Below is a chart comparing the model’s predictions to actual prices (generated in Python using matplotlib).
Comments:
All viewpoints are welcome but profane, threatening, disrespectful, or harassing comments will not be tolerated and are subject to moderation up to, and including, full deletion.