Show HN: AI Metrics, Visually

AI Metrics — A Learning Toybox

AI Metrics, visually!

The metrics you meet when finetuning a model — grouped by when you use them, made playful and interactive.

🗺️ The map. Finetuning has three moments, each with its own metrics:

① While training → loss & perplexity (is it learning? is it overfitting?)

② Judging labels → accuracy, precision, recall, F1 (did it classify right?)

③ Judging generated text → ROUGE, BLEU, BERTScore (did it write well?)

The sections below follow that order. Everything ties back to the fishing net.

The one mental model 🎣 Your model is a fishing net

Every metric here is just an anxious question someone asks about that net. Hold this picture and the rest follows.

🐟 🐟 🐟

🥾

🐟 🐟

Precision

"Everything I caught — is it actually fish, or did I also pull up boots?"

Of what I flagged positive, how much was right? Boots in the net (🥾) = false positives.

Recall

"All the fish in the lake — did I actually catch them, or did some swim through?"

Of what was really there, how much did I get? Escaped fish = false negatives.

Why they pull against each other: want perfect recall? Cover the whole lake with your net — you'll get every fish, but also every boot (so precision drops). Want perfect precision? Take only the one fish you're 100% sure of — definitely a fish, but you missed 999 others (so recall drops). F1 exists to stop both shortcuts.

① While training · the overfitting alarm 📉 Loss, validation loss & perplexity

Before judging quality, you watch whether the model is learning at all. During finetuning your dashboard shows loss dropping over time. The single most important habit for a beginner: watch the validation curve, not the training curve.

Drag through training epochs:

Train loss— model fit to training data

Val loss— fit to unseen data — the real signal

Perplexity— exp(val loss)

What is perplexity?

perplexity = e^loss. Read it as: "how many equally-likely words is the model choosing between at each step?" Perplexity 1 = perfectly certain & correct. Perplexity 50 = as unsure as picking from 50 options. Lower is better.

How to spot overfitting

Train loss always keeps dropping — the model can memorize. When val loss turns back upward while train loss falls, it's memorizing, not learning. That gap is your signal to stop early .

① While training · the headline number for language models 🎲 Perplexity & cross-entropy, in plain terms

🌤️ Think of a weather forecaster. Every day they predict tomorrow. A language model does the same — but predicts the next word .

Perplexity = how many options is the forecaster still guessing among?

• "100% sure it'll rain" → and it rains → only 1 option in play. Perfect. Perplexity = 1.

• "Could be sunny, rainy, cloudy, or snowy — no idea" → 4 options in play. Perplexity = 4.

• Guessing blindly → thousands of options in play. Perplexity = huge.

Fewer options = more confident and correct, so lower is better. The purple "doors" further down literally draw these options.

😱 And cross-entropy loss? It's just a "surprise meter." The model bets a probability on each word; then the true word is revealed:

• Bet 90% on the right word → "I expected that" → tiny surprise.

• Bet 50% → "fair enough" → medium surprise.

• Bet 1% → "wait, WHAT?!" → huge surprise.

Cross-entropy loss = the model's average surprise across all the words. Low surprise = good. That's the entire idea — the rest is just the math for "surprise."

Connection: perplexity = e^(loss) — loss is the raw training signal; perplexity is that same thing translated into "number of choices" you can picture. Like °C vs an obscure unit: same temperature, one you can feel.

Now the same idea with real numbers. A language model's job: given the words so far, predict the next one — a probability for every word in its vocabulary. Perplexity asks:

"On average, how many words is the model unsure between at each step?"

thecatsatonthe → next word ismat

The true next word is mat. Drag how much probability the model gave it:

P("mat") =

The remaining probability is spread over other words. Here's the model's guess distribution:

Probability of correct word

Surprise (loss) = −ln(p)

Perplexity = esurprise

Why exponentiate the loss?

Training shows cross-entropy loss = average surprise, in abstract units (nats). perplexity = e^loss converts that surprise back into a tangible count of choices. Loss 0 → ppl 1. Loss 2.3 → ppl ≈ 10. Loss 4.6 → ppl ≈ 100. Same info, friendlier units.

What's a "good" number?

It depends on vocabulary & task, so it's only meaningful relative to a baseline. A strong modern LLM on general English sits around perplexity 3–15 . Random guessing across a 50k vocab ≈ 50,000. You use it to compare : did finetuning lower my perplexity on my domain's text?

One important limitation: perplexity only measures next-word prediction on text you already have. It says nothing about whether answers are helpful, true, or well-formatted — that's what the §③ generation...

Show HN: AI Metrics, Visually

Related Articles

The Newest Instagram "Exploit" Is the Goofiest I've Seen

Apple WWDC 2026 Livestream

Claude Fable 5

US Government directive to suspend access to Fable 5 and Mythos 5

It's Not Just X. It's Y