Heaven knows I'm perplexed now

Heaven Knows I'm Perplexed Now — idlemachines

← essays ← essays

Perplexity is one of the most commonly reported metrics for language models, and it's one of the most awfully misused. Everybody knows you don't really compare losses across papers. There are so many impossible variations between one replication and the next that raw numbers don't mean much on their own. All it really says is some numbers are bigger than others. And yet the same people will happily line up perplexity numbers from different papers and fight in the comments about them, which is an even stranger thing to do than it first looks. So, in this short note we'll look at what perplexity is actually measuring, why those cross-paper comparisons mostly tell you nothing, and then one place where it's genuinely useful.

What is perplexity actually measuring?

The equation for perplexity is incredibly simple, it's just an exponentiation of the cross-entropy loss.

Perplexity=exp⁡(−1N∑i=1Nlog⁡p(xi∣yi))\text{Perplexity} = \exp\left(-\frac{1}{N} \sum_{i=1}^N \log p(x_i \mid y_i)\right)Perplexity=exp(−N1i=1∑Nlogp(xi∣yi))

To see what this really measures it helps to pull it apart line by line. Let's start with pi=p(xi∣yi)p_i = p(x_i \mid y_i)pi=p(xi∣yi), the probability the model assigned to the observed token, then move the minus sign inside the log and use −log⁡a=log⁡(1/a)-\log a = \log(1/a)−loga=log(1/a):

Perplexity=exp⁡(1N∑i=1Nlog⁡1pi)\text{Perplexity} = \exp\left(\frac{1}{N} \sum_{i=1}^N \log \frac{1}{p_i}\right)Perplexity=exp(N1i=1∑Nlogpi1)

A sum of logs is the log of a product, and a constant inside the log comes out as an exponent:

Perplexity=exp⁡(log⁡(∏i=1N1pi)1/N)\text{Perplexity} = \exp\left(\log \left( \prod_{i=1}^N \frac{1}{p_i} \right)^{1/N} \right)Perplexity=explog(i=1∏Npi1)1/N

And exp⁡\expexp undoes log⁡\loglog, which leaves:

Perplexity=(∏i=1N1pi)1/N\text{Perplexity} = \left( \prod_{i=1}^N \frac{1}{p_i} \right)^{1/N}Perplexity=(i=1∏Npi1)1/N

That is the geometric mean of the inverse probabilities, there isn't much more to it than that. When the model assigns a probability of 0.10.10.1 to a token, that token contributes a factor of 101010 to the product, and a probability of 0.010.010.01 contributes a factor of 100100100. The whole sequence multiplies together and the NNN-th root scales it back into the per-token range, so a perplexity of KKK is what you would get if every observed token had been assigned a probability of exactly 1/K1/K1/K.

That last thing is the one key property to remember, because if we want to make it a bit more formal we can say perplexity is the effective branching factor . It is the size of a uniform vocabulary the model would be choosing between if it were equally uncertain about every token. A perplexity of 303030 means there are 303030 roughly equal possibilities (on average), and a perplexity of 30,00030{,}00030,000 means there are 30,00030{,}00030,000.

The tokeniser problem

We just said perplexity is the branching factor, the number of things the model is choosing between at each step (huge pinch of salt taken, obviously). The cleanest way to see what the tokeniser does is not to train anything at all.

We'll start with a model straight out of initialisation. It has no reason to prefer one token over another, so it spreads the probability uniformly across the whole vocabulary. That makes its perplexity exactly the vocabulary size, whatever the passage of text we give it. (We'll come back to this.)

An untrained model's perplexity is just its vocabulary size, so it swings three orders of magnitude across tokenisers, having learned nothing

Characters or bytes give you a vocabulary in the low hundreds, so perplexity in the low hundreds. GPT-2's subwords push it to around 50,000. Chop the text into whole words and it runs past a quarter of a million. Same untrained model every time, it has learned precisely nothing, and on paper it still looks a thousand times better in characters than in words.

Yes, this is a bit of an extreme case here. Real tokenisers span the 50k - 500k range, and training pulls everything down from the initial conditions. But the same mechanism doesn't just switch off once the numbers are in a similar range, the closer two perplexities are to each other the more tempting it is to compare but part of it is still just the vocabulary. Whatever you wouldn't compare as losses, you don't compare as perplexities either. The words you use should be your own.

What a fair comparison would require

A model trained on one tokeniser has spent its whole training predicting that tokeniser's chunks, so feed it another one's tokens and it's going to break.

So a real comparison means matching everything: same tokeniser, same vocabulary, same context length (predicting from 256 tokens of history is an easier game than from 8). Most published numbers miss at least one, usually in a way that just so happens to flatter the people reporting them. Nobody's lying exactly,...

Heaven knows I'm perplexed now

Related Articles

The Newest Instagram "Exploit" Is the Goofiest I've Seen

It's Not Just X. It's Y

Amazon, Facebook, FBI have access to a private intelligence-sharing network

Show HN: GoPeek – open links in live mini browser windows without new tabs

Agent Memory: An Anatomy