Show HN: AI Metrics, Visually

bignet2 pts0 comments

AI Metrics β€” A Learning Toybox

AI Metrics, visually!

The metrics you meet when finetuning a model β€” grouped by when you use them, made playful and interactive.

πŸ—ΊοΈ The map. Finetuning has three moments, each with its own metrics:

β‘  While training β†’ loss & perplexity (is it learning? is it overfitting?)

β‘‘ Judging labels β†’ accuracy, precision, recall, F1 (did it classify right?)

β‘’ Judging generated text β†’ ROUGE, BLEU, BERTScore (did it write well?)

The sections below follow that order. Everything ties back to the fishing net.

The one mental model<br>🎣 Your model is a fishing net

Every metric here is just an anxious question someone asks about that net. Hold this picture and the rest follows.

🐟<br>🐟<br>🐟

πŸ₯Ύ

🐟<br>🐟

Precision

"Everything I caught β€” is it actually fish, or did I also pull up boots?"

Of what I flagged positive, how much was right? Boots in the net (πŸ₯Ύ) = false positives.

Recall

"All the fish in the lake β€” did I actually catch them, or did some swim through?"

Of what was really there, how much did I get? Escaped fish = false negatives.

Why they pull against each other: want perfect recall? Cover the whole lake with your net β€” you'll get every fish, but also every boot (so precision drops). Want perfect precision? Take only the one fish you're 100% sure of β€” definitely a fish, but you missed 999 others (so recall drops). F1 exists to stop both shortcuts.

β‘  While training Β· the overfitting alarm<br>πŸ“‰ Loss, validation loss & perplexity

Before judging quality, you watch whether the model is learning at all. During finetuning your dashboard shows loss dropping over time. The single most important habit for a beginner: watch the validation curve, not the training curve.

Drag through training epochs:

Train lossβ€”<br>model fit to training data

Val lossβ€”<br>fit to unseen data β€” the real signal

Perplexityβ€”<br>exp(val loss)

What is perplexity?

perplexity = e^loss. Read it as: "how many equally-likely words is the model choosing between at each step?" Perplexity 1 = perfectly certain & correct. Perplexity 50 = as unsure as picking from 50 options. Lower is better.

How to spot overfitting

Train loss always keeps dropping β€” the model can memorize. When val loss turns back upward while train loss falls, it's memorizing, not learning. That gap is your signal to stop early .

β‘  While training Β· the headline number for language models<br>🎲 Perplexity & cross-entropy, in plain terms

🌀️ Think of a weather forecaster. Every day they predict tomorrow. A language model does the same β€” but predicts the next word .

Perplexity = how many options is the forecaster still guessing among?

β€’ "100% sure it'll rain" β†’ and it rains β†’ only 1 option in play. Perfect. Perplexity = 1.

β€’ "Could be sunny, rainy, cloudy, or snowy β€” no idea" β†’ 4 options in play. Perplexity = 4.

β€’ Guessing blindly β†’ thousands of options in play. Perplexity = huge.

Fewer options = more confident and correct, so lower is better. The purple "doors" further down literally draw these options.

😱 And cross-entropy loss? It's just a "surprise meter." The model bets a probability on each word; then the true word is revealed:

β€’ Bet 90% on the right word β†’ "I expected that" β†’ tiny surprise.

β€’ Bet 50% β†’ "fair enough" β†’ medium surprise.

β€’ Bet 1% β†’ "wait, WHAT?!" β†’ huge surprise.

Cross-entropy loss = the model's average surprise across all the words. Low surprise = good. That's the entire idea β€” the rest is just the math for "surprise."

Connection: perplexity = e^(loss) β€” loss is the raw training signal; perplexity is that same thing translated into "number of choices" you can picture. Like Β°C vs an obscure unit: same temperature, one you can feel.

Now the same idea with real numbers. A language model's job: given the words so far, predict the next one β€” a probability for every word in its vocabulary. Perplexity asks:

"On average, how many words is the model unsure between at each step?"

thecatsatonthe<br>β†’ next word ismat

The true next word is mat. Drag how much probability the model gave it:

P("mat") =

The remaining probability is spread over other words. Here's the model's guess distribution:

Probability of correct word

Surprise (loss) = βˆ’ln(p)

Perplexity = esurprise

Why exponentiate the loss?

Training shows cross-entropy loss = average surprise, in abstract units (nats). perplexity = e^loss converts that surprise back into a tangible count of choices. Loss 0 β†’ ppl 1. Loss 2.3 β†’ ppl β‰ˆ 10. Loss 4.6 β†’ ppl β‰ˆ 100. Same info, friendlier units.

What's a "good" number?

It depends on vocabulary & task, so it's only meaningful relative to a baseline. A strong modern LLM on general English sits around perplexity 3–15 . Random guessing across a 50k vocab β‰ˆ 50,000. You use it to compare : did finetuning lower my perplexity on my domain's text?

One important limitation: perplexity only measures next-word prediction on text you already have. It says nothing about whether answers are helpful, true, or well-formatted β€” that's what the Β§β‘’ generation...

loss perplexity model surprise word training

Related Articles