LLM benchmarks are answering someone else's question

justsml1 pts0 comments

Fight Evils with Evals!

Articles ⌄ Quizzes<br>Try Dan's Challenges!

Categories<br>Code 15 AI 10 Guides 10 Security 5 DevOps 4 Thoughts 3

Popular<br>It's Time for llm:// Connection Strings 3 months ago You may not need Axios almost 2 years ago You May Not Need Algolia™ about 1 year ago Stop Asking LLMs to Do Math 5 months ago 2025's Wave of Database Innovation 9 months ago Naming things good almost 2 years ago Beware the Single-Purpose People about 1 year ago Quiz: Bash & Shell Mastery over 1 year ago

Recent<br>Into the Breach 13 days ago Postgres Text Searching Guide 2026 about 1 month ago Fight Evils with Evals! 28 days ago Semantic Vector Search and Other Topics to Win Friends and Lovers 30 days ago It's Time for llm:// Connection Strings 3 months ago Your AI Assistant Gave Me Shell Access 4 months ago Stop Asking LLMs to Do Math 5 months ago

Projects ⌄ Demos & Examples<br>A selection of my projects, experiments and assorted repos.

Open Source Journal<br>A journal of my open source contributions, projects, and experiments.

DataAnalyzer.app<br>A code + schema generator for JSON or CSV input.

Functional Promises<br>A functional and fluent API built around native JavaScript promises.

Node Streaming Image Proxy<br>High performance image resizing and streaming proxy for Node.js.

Fact Service<br>A powerful key-value service with several database adapters.

Modern App Starter Base<br>A modern app starter using TS, Vite, React, Tailwind CSS, and more.

About ⌄ Dan Levy<br>Coder | Leader<br>Thinker | Tinkerer

Contact Me<br>Twitter GitHub LinkedIn OSS Log Resume

Fight Evils with Evals!<br>Benchmarks measure benchmarks. Your system needs its own measures.<br>This post is also available in العربية (ar), Deutsch (de), Español (es), Français (fr), עברית (he), हिन्दी (hi), Italiano (it), 日本語 (ja), Русский (ru), and 中文 (zh). created about 1 month ago<br>updated 28 days ago

AI<br>(10)

AI Cost Consulting Your AI bill is higher than it needs to be.

I help teams spot wasted tokens, misrouted model calls, caching gaps, and usage patterns that quietly inflate AI spend.<br>Talk through your AI spend Schedule a call

Every new model arrives wearing a tuxedo of benchmarks.

MMLU: 92.4%. HumanEval: 87.2%. LLeMU: 88.7%. MATH: 73.6%. AGI: 127%!

Yet, for 99% of businesses building process & product with AI, none of it matters.

What matters? How are YOUR workloads doing? Getting better or worse? The only sane way to know that is to write Evals (tests for LLMs) that reflect the specific tasks, data, and failure modes of your system.

The benchmarks are not lying. They are answering someone else’s question.

What “Vibes-Based Evaluation” Actually Costs

The standard approach: ship a model change, watch the complaint channels, roll back if the room gets loud.

That misses almost everything interesting:

You only catch loud failures. Users who get a confidently wrong answer and don’t realize it? Silent. Users who get a worse answer and abandon the feature? Silent. Support tickets and error rates capture only a fraction of quality regression.

You can’t distinguish regressions from improvements. If the new model is better at task A and worse at task B, complaints about B look identical to generic “the AI got worse” feedback. You don’t know what to fix.

You’re using your users as test infrastructure. They didn’t sign up for that.

The Eval Spectrum (and Where Most Teams Get It Wrong)

Evaluation approaches sit on a spectrum from “fast but flimsy” to “expensive but valid.”

Use the cheapest evaluation method that can honestly catch the failure.<br>LLM-as-judge is the current darling: ask a powerful model to grade another model’s outputs. Fast, scalable, cheap. The problem: it bakes in the grader model’s biases, can be gamed, and creates a circular dependency. If you use GPT-5 to grade GPT-5’s outputs, you’re measuring something like “how much does GPT-5 agree with GPT-5.” That’s not nothing, but it’s not what you think.

Human eval is the gold standard everyone tries to skip. Getting humans to evaluate outputs is expensive, slow, inconsistent across evaluators, and annoying to schedule. But it is the only thing that validates whether your system is useful to real humans.

Task-specific automated checks are where most teams should spend more time. They are not glamorous, but they are fast, deterministic, and tied to what matters in your system.

What Actually Works

1. Define Failure Before You Ship

Before changing a model or prompt, write down what bad looks like. Specifically.

Not “the output should be accurate.” That’s not a test. More like:

Structured JSON output must parse without errors

All citations in the response must appear verbatim in the retrieved context

Responses must not mention competitor product names

SQL queries must be syntactically valid and reference only tables that exist in the schema

Sentiment classification must not flip from positive to negative more than 3% of the time on the existing test set

You can check these programmatically. No judge model...

model months benchmarks must evals time

Related Articles