Week 5: Utilizing Bayesian Updating for New Data

Ian M -

This week, I continued to focus on predicting future player performance using Bayesian methods. However, unlike last week, which focused on use of hierarchical data at a league-wide level, I focused on individual player data across several seasons. To make predictions using historical rather than hierarchical data, we have to use a different method known as Bayesian updating, one of the core aspects of Bayesian inference. The main components of my work this week involved learning what Bayesian updating was and how it could be used in the context of Francisco Lindor as he approached the middle of his career.

First off, Bayesian updating is a method for refining predictions by combining prior beliefs with new data using Bayes’ theorem to form posterior predictions for a given hypothesis or event. In the context of predicting a baseball player’s batting average, the prior is our initial estimate based on past performance, and the new data is the player’s current season statistics. Using Bayes’ Theorem, we update our belief about the player’s true batting average by calculating the likelihood of observing the current data given the prior. As more data is gathered, this process continuously improves the accuracy of our predictions, resulting in a more confident estimate of the player’s future performance as we shift our prior into a posterior predictive distribution.

By combining our prior distribution with our new data, we can create a posterior distribution that combines the two. This posterior distribution can then be used to create a posterior predictive distribution to predict the future state of the data. Image credit to Creative Commons Attribution 4.0 International.

One useful example in the context of baseball can be found using Francisco Lindor as an example. Early on in Lindor’s career, his batting performance was consistently solid, with his .313 rookie batting average placing him at second for AL Rookie of the Year voting and consistently above .270 batting average earning him 4 consecutive All Star selections from 2016 to 2019.  After the 2020 season, Lindor was set to become a free-agent in the upcoming season, so 2020 would be crucial to netting a large contract in the upcoming offseason. However, that year, his performance seemed to falter, with his batting average dropping to .258 in a shortened 2020 season. This raised concerns with Cleveland’s front office, who wondered if he was already regressing this early on in his career. Using Bayesian updating, we can determine if this season’s performance was statistically significant or not.

Generated using baseball-reference.com.

To set up a prior distribution for Bayesian updating, we need to create a beta distribution, a probability distribution based on a set of two positive parameters, to create a prior probability distribution for our data. A beta distribution can be defined by 

B is a normalization constant to ensure the total area is 1 and gamma represents the gamma function. Definitely not practical to do by hand! Image credit to Wikipedia, Beta distribution.

and in our case, 

This example changes alpha and beta to a and b, but they mean the same thing. Image credit to Jim Albert, Bayes Thinking.

where η is our approximated prior mean and K is an approximation based on our level of confidence in our previous metrics. We can approximate η to approximately .280, since that was how Lindor performed prior to the 2020 season and through some experimentation, we find that K at 400 is a reasonable estimate given the number of at-bats he had to this point. Using these values, we find a is 112 and b is 288. With these, we can create a distribution that gives us both a prior beta distribution and a predictive posterior distribution that we can adjust based on the length of time we wish to use our data to predict.

Using the Shiny R packages, we can create beta prior distributions and posterior predictive distributions using our data. We can choose whatever number of future at-bats that we wish; 400 is just an example that has a reasonable credible interval. Image credit to Jim Albert, Bayes Thinking.

Based on our predictions here, we can see that the parameters of the posterior distribution did not shift to a significant degree compared to our prior distribution, suggesting that Lindor’s performance had not significantly changed during the 2020 season, implying that the results from that season may have just been randomness caused by a shortened season. Regardless, Cleveland would not reach an agreement on an extension for Lindor, who would instead sign a 10-year, $341 million contract extension with the Mets after a trade during the offseason. 2021 was a rough year for Lindor once again, as he would start off his season with a .218 batting average through the first three months of the season, having 58 hits in 265 at-bats. Using Bayesian updating once again, we can test if something notable has changed with his performance. This time, we begin with a and b rather than η and K, updating them by adding hits to a and non-hits to b. 

Image credit to Jim Albert, Bayes Thinking.

This gives us a new posterior η of 0.256 and K of 665 by using the prior equations. With this new update to our distributions, we can now chart our new prior distribution and our new posterior distribution.

Same tool as last time, just with new data incorporated through Bayesian updating. Also, we choose 328 at-bats since that’s approximately how many more at-bats he should have that season if he continues at the same rate. Image credit to Jim Albert, Bayes Thinking.

From this data, we can notice that our parameters have shifted noticeably downwards compared to our previous distributions, with our confidence and credible intervals having shifted downwards to a noticeable degree. Due to this shift, we can determine that some external factor has negatively impacted his performance. In the case of older players, such regression is often just aging, but Lindor was still quite young, so that was unlikely. Instead, what had happened was an oblique strain that he suffered early on in the season that he had been playing through. Due to this, he would miss about a month and a half of the season healing from his injury, and after doing so, his performance bounced back massively, returning to his pre-2019 form and even finishing second in NL MVP voting in 2024. By using Bayesian statistics, we were able to notice that while Lindor’s batting average decline in 2020 may have just been purely random, his decline in early 2021 was not random and was instead caused by an injury. Through Bayesian updating, we can update our datasets and parameters in real time as a baseball player plays in a season to identify more subtle patterns and changes in data over time.

More Posts

Comments:

All viewpoints are welcome but profane, threatening, disrespectful, or harassing comments will not be tolerated and are subject to moderation up to, and including, full deletion.

    nakyung_y
    Hey Ian! Your explanation of how to set up prior distributions with parameters η and K, and then converting them to the distribution parameters a and b is very clear. Which is helpful for those who do not have a clear background on this! Do you think your Bayesian updating approach can be used to create an early warning system for potential injuries or mechanical issues before they become apparent?
    emma_k
    Seeing a real-life application of this method is intriguing! It also made me curious: do contracting parties/agents (I'm assuming that's what they're called?) actually use this method to better determine whether they're making a good investment in a player or not?
    ian_m
    Thanks for the question, Nakyung. This approach could be useful in situations where potential injuries or mechanical issues already impact a player's performance, but in most cases, injuries won't affect our data significantly until they happen, so an early warning system would likely struggle to have widespread success.
    ian_m
    Thanks for the question, Emma. Generally, a team's front office will use methods like these to observe trends in a player's performance in recent years to determine how their performance is trending to decide how much is worth investing in trades for other players or contracts for free agents.

Leave a Reply

Your email address will not be published. Required fields are marked *