Who's the Best Batter? Estimating Probabilities from Unevenly Collected Data

jmount1 pts0 comments

Who's the Best Batter? Estimating Probabilities from Unevenly Collected Data

Skip to main content

Who's the Best Batter? Estimating Probabilities from Unevenly Collected Data

Nina Zumel

20 May 2026

probabilistic modeling,

python,

stan,

Bayesian data analysis

Source code for this article

In this article, we look at the problem of estimating and comparing probabilities about a population of subjects from unevenly collected observations. Some examples might include:

The perceived quality of a movie (how often is a movie positively reviewed) when some movies have far more reviews than others.

The effectiveness of various ad campaigns, when some compaigns have had more exposure than others.

The efficacy of a certain medical procedure by hospital, when some hospitals have had more cases than others.

For our specific task, we'll try to estimate the "innate" batting ability (the probability of making a hit when at bat)[1] of major league baseball players in 2023[2]. For the sake of this article, we will take this single season of data as everything that we know about these players and their batting statistics.

First, let's take a quick look at the data.

battingf = pd.read_csv(datadir + 'battingstats.tsv', sep = "\t")<br>battingf

playerID<br>atbat<br>hits<br>batting_avg

abramcj01<br>563<br>138<br>0.245115

abreujo02<br>540<br>128<br>0.237037

abreuwi02<br>76<br>24<br>0.315789

acunaro01<br>643<br>217<br>0.337481

adamewi01<br>553<br>120<br>0.216998

...<br>...<br>...<br>...<br>...

651<br>yoshima02<br>537<br>155<br>0.288641

652<br>youngja02<br>43<br>0.186047

653<br>youngja03<br>107<br>27<br>0.252336

654<br>zavalse01<br>175<br>30<br>0.171429

655<br>zuninmi01<br>124<br>22<br>0.177419

656 rows × 4 columns

We can calculate some summary statistics, too.

Click to see code<br>nplayers = battingf.shape[0]<br>print(f'Population of {nplayers} players.')

mean_ba = battingf['batting_avg'].mean()<br>std_ba = battingf['batting_avg'].std()<br>print(f'Mean batting average: {mean_ba:.2f}, standard deviation {std_ba:.2f}')

mean_atbat = battingf['atbat'].mean()<br>std_atbat = battingf['atbat'].std()<br>print(f'Mean at bats: {mean_atbat:.2f}, standard deviation {std_atbat:.2f}')

Population of 656 players.<br>Mean batting average: 0.23, standard deviation 0.07<br>Mean at bats: 250.64, standard deviation 192.45

Given this information, how do we estimate players' batting ability?

You may be tempted to simply use a player's observed batting average as an estimate of their batting skill. One issue with this is that not all players get the same number of at-bats. Let's look at the batting averages for all the players, sorted by their number of times at bat. The horizontal line on the graph represents the mean batting average of the population.

Batting averages versus times at bat. Dark blue horizontal line represents population mean batting average.

As you can see, the number of at-bats for players in 2023 varied widely; some players were up hundreds of times, and some fewer than ten times. For players with a lot of at-bats, their observed batting average is probabably a good estimate of their innate batting ability. But for players with fewer at-bats, their observed batting average is more likely to be an over or under estimate of their ability.

Finding the Top 10 Batters

We can make this point dramatically by using our naive batting ability estimates to answer the question, Who are the top 10 batters ?

naive_top10 = battingf.nlargest(10, 'batting_avg')<br>naive_top10

playerID<br>atbat<br>hits<br>batting_avg

140<br>culbech01<br>1.000000

124<br>colliza01<br>0.500000

325<br>lopezal03<br>0.500000

450<br>perezmi03<br>0.500000

593<br>tuckeco01<br>0.500000

626<br>wallfo01<br>13<br>0.461538

583<br>toroab01<br>18<br>0.444444

165<br>downsje01<br>0.400000

229<br>graytr01<br>0.400000

121<br>clemeer01<br>50<br>19<br>0.380000

Do you trust this ranking? Probably not---notice that most of the players in the top ten using this naive measure have actually been at bat very few times, and their batting averages are unrealistically high. Remember: the average batting average in the league for this season is 0.23, and the standard deviation is small. Batting averages of 1.0 or even 0.5 are highly improbable estimates of actual player ability.

This is analogous to sorting the rankings of a product on an online shopping site[3]. Which assessment would you consider more reliable:

one with a five-star average rating calculated from only one or two ratings,

or one with a 4.5 star rating calculated from 200 ratings?

Personally, I would be more likely to trust the assesment of the second product.

Given that our observations of the players are so uneven, is there a better way estimate how good a batter each player really is?

We would like a method that handles players with very few observations in a reasonable way. If a player has been at bat only once, their observed batting average is either 1 or 0: either they look perfect, or they look terrible. Since they are most likely neither, we'd like to assume a reasonable estimate of their batting ability, one we can use while we are waiting for more data. And of course, we want a method where the...

batting players average battingf mean data

Related Articles