AI Metrics β A Learning Toybox
AI Metrics, visually!
The metrics you meet when finetuning a model β grouped by when you use them, made playful and interactive.
πΊοΈ The map. Finetuning has three moments, each with its own metrics:
β While training β loss & perplexity (is it learning? is it overfitting?)
β‘ Judging labels β accuracy, precision, recall, F1 (did it classify right?)
β’ Judging generated text β ROUGE, BLEU, BERTScore (did it write well?)
The sections below follow that order. Everything ties back to the fishing net.
The one mental model<br>π£ Your model is a fishing net
Every metric here is just an anxious question someone asks about that net. Hold this picture and the rest follows.
π<br>π<br>π
π₯Ύ
π<br>π
Precision
"Everything I caught β is it actually fish, or did I also pull up boots?"
Of what I flagged positive, how much was right? Boots in the net (π₯Ύ) = false positives.
Recall
"All the fish in the lake β did I actually catch them, or did some swim through?"
Of what was really there, how much did I get? Escaped fish = false negatives.
Why they pull against each other: want perfect recall? Cover the whole lake with your net β you'll get every fish, but also every boot (so precision drops). Want perfect precision? Take only the one fish you're 100% sure of β definitely a fish, but you missed 999 others (so recall drops). F1 exists to stop both shortcuts.
β While training Β· the overfitting alarm<br>π Loss, validation loss & perplexity
Before judging quality, you watch whether the model is learning at all. During finetuning your dashboard shows loss dropping over time. The single most important habit for a beginner: watch the validation curve, not the training curve.
Drag through training epochs:
Train lossβ<br>model fit to training data
Val lossβ<br>fit to unseen data β the real signal
Perplexityβ<br>exp(val loss)
What is perplexity?
perplexity = e^loss. Read it as: "how many equally-likely words is the model choosing between at each step?" Perplexity 1 = perfectly certain & correct. Perplexity 50 = as unsure as picking from 50 options. Lower is better.
How to spot overfitting
Train loss always keeps dropping β the model can memorize. When val loss turns back upward while train loss falls, it's memorizing, not learning. That gap is your signal to stop early .
β While training Β· the headline number for language models<br>π² Perplexity & cross-entropy, in plain terms
π€οΈ Think of a weather forecaster. Every day they predict tomorrow. A language model does the same β but predicts the next word .
Perplexity = how many options is the forecaster still guessing among?
β’ "100% sure it'll rain" β and it rains β only 1 option in play. Perfect. Perplexity = 1.
β’ "Could be sunny, rainy, cloudy, or snowy β no idea" β 4 options in play. Perplexity = 4.
β’ Guessing blindly β thousands of options in play. Perplexity = huge.
Fewer options = more confident and correct, so lower is better. The purple "doors" further down literally draw these options.
π± And cross-entropy loss? It's just a "surprise meter." The model bets a probability on each word; then the true word is revealed:
β’ Bet 90% on the right word β "I expected that" β tiny surprise.
β’ Bet 50% β "fair enough" β medium surprise.
β’ Bet 1% β "wait, WHAT?!" β huge surprise.
Cross-entropy loss = the model's average surprise across all the words. Low surprise = good. That's the entire idea β the rest is just the math for "surprise."
Connection: perplexity = e^(loss) β loss is the raw training signal; perplexity is that same thing translated into "number of choices" you can picture. Like Β°C vs an obscure unit: same temperature, one you can feel.
Now the same idea with real numbers. A language model's job: given the words so far, predict the next one β a probability for every word in its vocabulary. Perplexity asks:
"On average, how many words is the model unsure between at each step?"
thecatsatonthe<br>β next word ismat
The true next word is mat. Drag how much probability the model gave it:
P("mat") =
The remaining probability is spread over other words. Here's the model's guess distribution:
Probability of correct word
Surprise (loss) = βln(p)
Perplexity = esurprise
Why exponentiate the loss?
Training shows cross-entropy loss = average surprise, in abstract units (nats). perplexity = e^loss converts that surprise back into a tangible count of choices. Loss 0 β ppl 1. Loss 2.3 β ppl β 10. Loss 4.6 β ppl β 100. Same info, friendlier units.
What's a "good" number?
It depends on vocabulary & task, so it's only meaningful relative to a baseline. A strong modern LLM on general English sits around perplexity 3β15 . Random guessing across a 50k vocab β 50,000. You use it to compare : did finetuning lower my perplexity on my domain's text?
One important limitation: perplexity only measures next-word prediction on text you already have. It says nothing about whether answers are helpful, true, or well-formatted β that's what the Β§β’ generation...