Finding the Best Dog Treat with Statistics
Finding the Best Dog Treat with Statistics
A Greyhound, five treats, and a Bradley-Terry model.
Posted on June 19, 2026
by Adam Wespiser
Bebop, my 83lb, 33 inch tall, Greyhound, loves three things: running fast, following me around the house, and treats.<br>Whether it’s a chew treat, pizza out of a child’s hand who strayed too far from a party, or a small tray of cat food, he has a nose for what he likes and the athleticism to give him a fair shot at getting it.<br>I’ve watched him eat for years, so it was upsetting to realize I don’t know what his favorite snack is, and can’t easily ask him.
Fortunately for Bebop’s palate, the Bradley-Terry model gives us a way to figure out a “strength” of treat from pairwise comparisons.<br>The model assigns each competitor (or treat) (i) a positive strength score pi.
Given two competitors i and j, the probability that i beats j is:
Pr(i > j) = pi/(pi + pj)
Equivalently, if we write each strength as an exponential score,
pi = eβi
then the same probability can be written as:
Pr (i > j) = eβi/(eβi + eβj)
So the model is saying: the difference between two competitors’ latent strengths determines the log-odds that one beats the other.
The Elo rating system used in chess is closely related.<br>If Ri and Rj are Elo ratings, then:
Pr (i > j) = (10Ri/400)/(10Ri/400 + 10Rj/400)
However, modern Elo ratings are calculated incrementally to avoid expensive recompute cycles and allow scores to be updated after each match.<br>After the game, (A)’s rating is updated by comparing the actual result to the expected result:
RA′ = RA + K(SA − EA)
where SA is the actual score: (1) for a win, (0.5) for a draw, and (0) for a loss.<br>The constant K controls how much ratings move after each game.
So if a player wins a game they were expected to win, their rating only moves slightly.<br>If they win a game they were expected to lose, their rating moves a lot.<br>In this sense, Elo can be thought of as an online version of the Bradley-Terry idea: after each result, move the ratings in the direction of the prediction error.<br>Elo makes sense for systems like chess because games arrive continuously and ratings need to update immediately.<br>In this experiment, the dataset is small enough that we can simply fit the Bradley-Terry model directly after collecting the trials.
You might also recognize a related model from The Social Network movie, where global ranking from pairwise comparisons powered FaceSmash, an early social media experiment by Mark Zuckerberg.1<br>A third application is Chatbot Arena, which uses Bradley-Terry style rankings for model performance.2<br>Bradley-Terry is the solution you reach for when you want a global ranking but only have head-to-head comparisons.
Experiment
For the experiment, the setup is straightforward: we can take a set of treats, label them, and run a series of pairwise comparisons to discover which treat is best!<br>Prior to the experiment, I trained a “choice” command.<br>The same time every day, around 11pm, I go to the kitchen, select two different treats, say the word “choice”, and present the treats in either hand, allowing Bebop to only take one, with the other going back into the bag.<br>By the time the experiment started, Bebop was used to the routine and sniffing both treats before taking one.
For the selection of treats, I used a combination of treats we have a history with, like Greenies, and searched Amazon for a variety of treats in different formats.<br>Each of these treats is a slightly different size, but I decided to ignore the differences for the sake of simplicity.<br>This could introduce size bias in the results, however, the experiment is run about 2 hours after dinner so he should be full, and makes the results consistent with how I will give him treats post-experiment.<br>In other words, I’m not interested in an experiment that requires me to cut and weigh dog treats.
The treats selected are as follows:
Treat A is MON2SUN, duck + rawhide. Amazon link
Treat B is Greenies, large-size. Amazon link
Treat C is Pork Chomps, the red ones. Amazon link
Treat D is MON2SUN, chicken + rawhide. Amazon link
Treat E is Pur Luv Chicken, dehydrated chicken. Amazon link
Data
For the pairings, I created a daily schedule with two head to head comparisons.<br>Full source on github
C/B :: B<br>E/B :: E<br>In this example, we have two head to head match-ups, which is one day of trials.<br>The first has Treat C in the left hand, Treat B in the right, and the winner is B.<br>For the second, E is left hand, B is right, and the winner is E.<br>To estimate how settled the result was, I ran a bootstrap experiment: repeatedly resampling the trials, fitting Bradley-Terry models to those samples, and recording how often each treat came out on top.<br>Github source code
About halfway through the experiment, I realized that treats C & B, the “Pork Chomps” and Greenies, were reliably losing.<br>Because these were out of the running, I marked any planned trial with C or B...