Reward hacking is swamping model intelligence gains

Reward hacking is swamping model intelligence gains · Cursor

Product → Enterprise

Pricing

Resources →

Sign inContactContact salesDownload

Blog / research

Smarter models are becoming more resourceful at hacking coding benchmarks.

Eval suites built from real bugs that were later fixed are especially vulnerable because the problems have already been solved. If the agent has access to repository history or the public web, it can sometimes look up the answer rather than derive it.

To measure how widespread this behavior is, we built an agent to audit eval trajectories. On SWE-bench Pro, we found that 63% of successful Opus 4.8 Max resolutions retrieved the fix rather than derived it. When we sealed git history and restricted internet access, scores dropped sharply for Opus as well as for our model, Composer 2.5:

Opus 4.8 Max fell from 87.1% to 73.0%

Composer 2.5 fell from 74.7% to 54.0%

Prior research has shown that coding benchmark answers can leak through publicly available sources, including this 2024 study and a 2025 Meta report. Our study quantifies the problem on current frontier coding-agent runs. The broader lesson is that in addition to avoiding training-time contamination, agentic coding benchmarks also need controlled runtime environments.

For teams conducting evals, we propose mitigating this reward-hacking behavior by auditing transcripts and constraining the eval environment.

#Catch a model with a model

To measure the scale of the reward-hacking behavior, we had the auditor examine 731 Opus 4.8 Max trajectories. It saw the problem statement and full agent trajectory, without seeing whether the run passed, and classified whether the agent had retrieved the known answer.

The two most common reward-hacking patterns it found were:

Upstream lookup: In 57% of the trajectories, Opus 4.8 Max found the merged PR or fixed source file on the public web, then reproduced the fix nearly verbatim.

Git-history mining: In 9% of trajectories, Opus 4.8 Max searched the bundled .git history for the future commit that fixed the bug, then extracted the patch.1

As models get stronger, they can sometimes infer that they are in an eval, especially when the task is borrowed from a past public repository. Even in cases where they don't remember the fix from training, the environment can still give them clues that the bug has already been solved.

In one SWE-bench Multilingual task from a 2019 jq issue, the agent tried to reproduce the bug with the system jq binary. Because the image had been built after the bug was fixed, the reproduction failed, and the agent inferred that the issue had already been solved. That awareness pushed it toward searching for the fix instead of deriving one.

A few cases were more direct. One agent found a SWE-bench mirror page that exposed hidden tests and the gold patch. Another obtained hidden test files and hardcoded the expected exception string needed to pass.

Example tool calls and patches↓↑ Upstream lookup (Opus 4.8 Max). The agent queried the merged PR through the GitHub API to find the files the fix touched, then reproduced it (the same response also exposes each file's diff): cd /testbed && curl -s "https://api.github.com/repos/apache/druid/pulls/14092/files" 2>/dev/null | grep '"filename"' Git-history mining (Composer 2.5). The agent located the fix commit in the bundled .git history, read its diff, then applied it directly: cd /testbed && git show 895abd8929 -p 2>/dev/null | head -400 cd /testbed && git cherry-pick 895abd8929 2>&1 Patch excerpt to add: a trimmed verbatim slice of the git show output above (the gold diff Composer reproduced).

#Stricter environment design

Most reward hacking flowed through the public web and repository history. For evals built from historical public repositories, those channels need to be controlled because they may contain the answer. In response, we built a strict harness with two isolation mechanisms:

History isolation. Before the agent starts, the .git directory is removed and the repository is reinitialized as a fresh single-commit repo. The original history is restored only at scoring time, so tests run as usual.

Egress proxying. Network access is denied by default. As a best-effort control, a pinned proxy allows dependency resolution against an allow-list of package registries, and nothing else.

This restriction is specific to evals built from historical public repositories. It's one reason we prefer evals built from non-public repositories, like CursorBench. They can test agentic coding ability while still letting agents use tools in the ways they would during real work.

#A growing gap

We reran SWE-bench Pro and SWE-bench Multilingual in the stricter harness, then compared each result against the standard harness score as a proxy for the combined effect of removing these leakage channels2:

On SWE-bench Multilingual, it was under 1 point for Opus 4.6, 9.1 points for Opus 4.8 Max, and 7.5 points for Composer 2.5.

On...

Reward hacking is swamping model intelligence gains

Related Articles

US Government directive to suspend access to Fable 5 and Mythos 5

Is AI ruining our skills? Early results are in – and they're not good

The Anatomy of an AI-Native Org

Apertus – Open Foundation Model for Sovereign AI

How to Earn a Billion Dollars