Who's the Best Batter? Estimating Probabilities from Unevenly Collected Data

Nina Zumel

20 May 2026

probabilistic modeling,

python,

stan,

Bayesian data analysis

Source code for this article

In this article, we look at the problem of estimating and comparing probabilities about a population of subjects from unevenly collected observations. Some examples might include:

The perceived quality of a movie (how often is a movie positively reviewed) when some movies have far more reviews than others.

The effectiveness of various ad campaigns, when some compaigns have had more exposure than others.

The efficacy of a certain medical procedure by hospital, when some hospitals have had more cases than others.

For our specific task, we'll try to estimate the "innate" batting ability (the probability of making a hit when at bat)[1] of major league baseball players in 2023[2]. For the sake of this article, we will take this single season of data as everything that we know about these players and their batting statistics.

First, let's take a quick look at the data.

battingf = pd.read_csv(datadir + 'battingstats.tsv', sep = "\t") battingf

playerID atbat hits batting_avg

abramcj01 563 138 0.245115

abreujo02 540 128 0.237037

abreuwi02 76 24 0.315789

acunaro01 643 217 0.337481

adamewi01 553 120 0.216998

... ... ... ... ...

651 yoshima02 537 155 0.288641

652 youngja02 43 0.186047

653 youngja03 107 27 0.252336

654 zavalse01 175 30 0.171429

655 zuninmi01 124 22 0.177419

656 rows × 4 columns

We can calculate some summary statistics, too.

Click to see code nplayers = battingf.shape[0] print(f'Population of {nplayers} players.')

mean_ba = battingf['batting_avg'].mean() std_ba = battingf['batting_avg'].std() print(f'Mean batting average: {mean_ba:.2f}, standard deviation {std_ba:.2f}')

mean_atbat = battingf['atbat'].mean() std_atbat = battingf['atbat'].std() print(f'Mean at bats: {mean_atbat:.2f}, standard deviation {std_atbat:.2f}')

Population of 656 players. Mean batting average: 0.23, standard deviation 0.07 Mean at bats: 250.64, standard deviation 192.45

Given this information, how do we estimate players' batting ability?

You may be tempted to simply use a player's observed batting average as an estimate of their batting skill. One issue with this is that not all players get the same number of at-bats. Let's look at the batting averages for all the players, sorted by their number of times at bat. The horizontal line on the graph represents the mean batting average of the population.

Batting averages versus times at bat. Dark blue horizontal line represents population mean batting average.

As you can see, the number of at-bats for players in 2023 varied widely; some players were up hundreds of times, and some fewer than ten times. For players with a lot of at-bats, their observed batting average is probabably a good estimate of their innate batting ability. But for players with fewer at-bats, their observed batting average is more likely to be an over or under estimate of their ability.

Finding the Top 10 Batters

We can make this point dramatically by using our naive batting ability estimates to answer the question, Who are the top 10 batters ?

naive_top10 = battingf.nlargest(10, 'batting_avg') naive_top10

playerID atbat hits batting_avg

140 culbech01 1.000000

124 colliza01 0.500000

325 lopezal03 0.500000

450 perezmi03 0.500000

593 tuckeco01 0.500000

626 wallfo01 13 0.461538

583 toroab01 18 0.444444

165 downsje01 0.400000

229 graytr01 0.400000

121 clemeer01 50 19 0.380000

Do you trust this ranking? Probably not---notice that most of the players in the top ten using this naive measure have actually been at bat very few times, and their batting averages are unrealistically high. Remember: the average batting average in the league for this season is 0.23, and the standard deviation is small. Batting averages of 1.0 or even 0.5 are highly improbable estimates of actual player ability.

This is analogous to sorting the rankings of a product on an online shopping site[3]. Which assessment would you consider more reliable:

one with a five-star average rating calculated from only one or two ratings,

or one with a 4.5 star rating calculated from 200 ratings?

Personally, I would be more likely to trust the assesment of the second product.

Given that our observations of the players are so uneven, is there a better way estimate how good a batter each player really is?

We would like a method that handles players with very few observations in a reasonable way. If a player has been at bat only once, their observed batting average is either 1 or 0: either they look perfect, or they look terrible. Since they are most likely neither, we'd like to assume a reasonable estimate of their batting ability, one we can use while we are waiting for more data. And of course, we want a method where the...

Who's the Best Batter? Estimating Probabilities from Unevenly Collected Data

Related Articles

Amazon, Facebook, FBI have access to a private intelligence-sharing network

Elevated error rates on requests to multiple models

Donald Trump and sons to be 'forever' exempt from tax audits

PopuLoRA: Co-Evolving LLM Populations for Reasoning Self- Play

Old Reddit Is Down