What does your eval measure?

Benchmark creators should think about optimization pressure

Shashwat’s Substack

SubscribeSign in

What does your benchmark actually measure? Fighting bench-maxxing by asking: what maxxes your bench?

Shashwat Goel Jun 22, 2026

If you’re in the business of creating evals, you’re often told to make scores on your benchmark as low as possible at launch. For example, here are excerpts from How to Build Good Benchmarks by Ofir Press, one of the creators of SWE-Bench.

In this post, I argue minimizing scores at launch is the wrong frame to be in when creating a new benchmark, that hurts the whole benchmark ecosystem in the long-term. Instead, the frame that helps me more when interpreting and building new benchmarks is thinking of all possible ways one could maximize scores on my benchmark . Why? Because the community and models will rightfully try to. Eventually, they will find the simplest ways to improve scores on your benchmark. This has repeatedly turned out to NOT be the capability people thought the benchmark measures, which over time erodes trust in benchmarking as a whole. What your benchmark actually measures is the simplest way to improve scores on it

It is common for benchmarks to be named after the capability they aspire to measure (e.g. SWE-Bench). Unfortunately, it is rare that 90% accuracy on Y-Bench reflects models succeed on 90% of Y tasks. This usually occurs due to two properties of the underlying data: It consists of tasks which are easy to collect and fast to automatically verify. This is often a narrow slice of the overall capability. I think this is a practical compromise, and can be fixed by naming benchmarks more appropriately.

There are shortcuts, a term I use here as unintended strategies that improve scores without reflecting progress on the intended capability. This often happens due to fundamental issues with the benchmark design that are harder to mitigate.

I think benchmark developers can often identify both problems ahead of release by thinking about how models can be optimized for their benchmark. I find it useful to ask myself: If I trained a model on enough samples similar to my benchmark such that the resulting model scored 90% on the benchmark, what capabilities would emerge? More importantly, what would the resulting model not be able to do ? Thinking about these questions should tell you what your benchmark is really measuring. I will expand on the shortcuts problem, as it is much more subtle and thus pervasive. Take multiple choice evaluations. Models can sometimes answer samples accurately without even being shown the question. This doesn’t even have to be because of memorization or contamination, which is a shortcut the field has already come a long way in solving. In MCQs, the incorrect choices are often added artificially, and the correct choice can sometimes be inferred as the odd-one-out. Now, one could argue it is hard to know whether such shortcuts exist at the time of launch. This is where applying optimization pressure on your benchmark can help. For example, last year we showed that by finetuning a linear classifier on language model embeddings given just the choices without the question, one can quantify (lower bound) the extent of choice-only shortcuts in popular MCQ benchmarks. Interestingly, you can correctly answer at least half of MMMU Pro, a “multimodal benchmark”, without the image! This finding was then corroborated across “multimodal” benchmarks. I think the shortcut problem only exacerbates as careless use of AI in evaluation designs increases.

But not all hope is lost! The takeaway from the above example is: optimization pressure can reveal shortcuts in a benchmark, because it often finds the simplest path to improving scores. The GPQA paper adopts this strategy to remove problematic samples (see Appendix A.2), and is in my opinion a must-read for any eval creator to understand the level of care needed to make a great benchmark. A common pushback I receive when I worry too much about shortcuts in benchmarks is: “If currently models don’t actually exploit them, how does it matter?”. This viewpoint makes sense if all you care about is model performance at launch. But what if your launch is a success, and people actually start using your benchmark as a measure of progress? Both humans and deep learning like to find the least effort path to improving their objective, and this often ends up exploiting shortcuts. People start making inferences about which model is better at the intended capability, when one more could be exploiting the shortcuts more than the other. Even worse, they draw scientific conclusions about which methodologies work better from it. All these conclusions could be misled by the confounder of what is actually needed to improve on your benchmark. For example, ARC Is a Vision Problem showed how ARC-AGI 1 didn’t really need reasoning. Anthropomorphic Misalignment Research Needs Stronger Evidence provides a recent review...

What does your eval measure?

Related Articles

US Government directive to suspend access to Fable 5 and Mythos 5

Is AI ruining our skills? Early results are in – and they're not good

The Anatomy of an AI-Native Org

Apertus – Open Foundation Model for Sovereign AI

Britain Became as Poor as Mississippi