How to Build Your Own AI Benchmark (And Why It's Critical) — The End of CodingSkip to main contentBack to all articlesShare
Back to all articles
Your team uses AI for code generation, refactoring, test writing, API design. Someone tries Claude, someone else tries the latest OpenAI model. One engineer says "it feels better," another disagrees. A few weeks later, the vendor pushes an update and suddenly the model is slower or produces lower quality code. You don't know when it happened. You just know your workflows feel different.
This is the state of most teams: no benchmark, no baseline, no way to know if a model is actually good for your work or if you're just getting lucky.
There's another problem: vendor benchmarks complicate the picture. OpenAI announces improvements on MMLU. Anthropic claims better performance on GSM8K. But in your shop, the newer model costs more and feels slower—or fails on problems the old one solved. Either the benchmarks measure something your team doesn't care about, or the numbers can't be trusted.
Here's why public benchmarks fail. MMLU, HumanEval, GSM8K—these tests are published on arXiv, discussed on Reddit, embedded in blog posts and training data discussions. Models trained on internet-scale data have almost certainly seen these benchmarks during training. Research on benchmark data contamination shows that when test questions appear in training data—whether deliberately or accidentally—models memorize them rather than learning the underlying concept. When a vendor reports "86.8% on MMLU," you have no way to know whether the model learned the concept or memorized the test.
This creates a perverse incentive. Once benchmarks become public targets, vendors optimize for them. Privately test dozens of model versions, then publish only the best results. Cherry-pick favorable conditions. The score improves. Real capability doesn't.
"Headline scores often measure how well a model gamed the test harness rather than how well it solved the underlying tasks."
— Kili Technology, Custom AI Benchmark Guide<br>Analysis of the Chatbot Arena leaderboard found that companies could boost reported scores by up to 112% through selective disclosure of privately-tested variants—what researchers call "reward hacking." The same study showed that 27 private LLM variants were tested by Meta before the Llama-4 release, with only the best results published. Kili Technology's research confirmed that organizations overestimate their models' real performance by 30% or more when they optimize for published benchmarks without measuring actual work.
There's no need to speculate: frontier model vendors are in a race, burning billions of dollars to lead the AI space. The incentives are obvious. When you see their self-published benchmark scores improve, would you blindly trust them? Or would you want to verify the improvements on your own work?
The answer is simple: create a test for your actual work, run it on different models multiple times, and compare the averages. Measure real work, compare the results, and use your judgment about how much confidence you need before deciding. This is exactly what OpenAI, Anthropic, and every serious team does. They run benchmarks on their actual problems, report the percentage, and move on. You should too. And you can use real problems from your codebase to do it.
What Is a Benchmark? (It's Just a Score)
A benchmark is simple: you have a set of problems, you run models on them multiple times using independent sessions, count how many the model solves each time, and average the results. That's your score.
Example:
You have 50 refactoring tasks from your codebase
You run Model A twice (two independent sessions) on all 50 problems
Run 1: 42 pass your tests → 84%
Run 2: 41 pass your tests → 82%
Average score: 83%
Compare across models, for example:
Model A: 83% (avg of 2 runs)
Model B: 78% (avg of 2 runs)
Model A is better. But verify with more runs if this is a significant decision (see "Handling Variance" below).
For example, Anthropic publishes Claude Opus 4.7 at 92.8% on MMLU, and OpenAI reports GPT-5.5 at 92.4% on MMLU. They run benchmarks, report the percentage, use the score to guide decisions. Your internal benchmarks follow the exact same methodology—but applied to your actual problems, not public benchmarks. (These are examples only; your benchmark will measure real work unique to your codebase.)
The benchmark doesn't have to be fancy. It just has to be:
Real problems (from your codebase, not made up)
Measurable (you have a way to check pass/fail)
Verifiable (you can run it again and get results consistent enough to be useful)
Get articles like this in your inbox.<br>Plus Chapter 1 of the book free as your welcome. New articles when they publish. One-click unsubscribe.<br>Email address<br>Subscribe
How to Build a Benchmark: 4 Steps
Now that you understand what you're protecting yourself against, here's how to implement it:
Find real problems you've...