GPT-5.6 cheats so much its testers couldn't measure it

GPT-5.6 cheats so much its testers couldn’t

SubscribeSign in

GPT-5.6 cheats so much its testers couldn’t measure it OpenAI’s new model broke rules and exploited loopholes more than any model METR has tested to date Celia Ford Jun 30, 2026

Image: OpenAI GPT-5.6 Sol, OpenAI’s newest, most capable, and yet-to-be-deployed model, cheats a lot: so much so that independent evaluators couldn’t actually tell how capable it is. When independent evaluation non-profit METR tested 5.6 Sol on a battery of coding tasks, the model broke the rules or exploited loopholes more than “any public model we have evaluated,” METR said. For the past several model release cycles, “the METR graph” — a plot tracking the alarming increase in AI’s ability to complete long tasks — has been Exhibit A for the exponential increase in AI capabilities. METR challenged the model with over 100 coding tasks that take humans anywhere from a few minutes to an entire day, and measured how consistently it finished each one. It then calculated the task length a model can complete 50% of the time, known as its “50% time horizon point.”

Normally, METR counts trials where the model breaks the rules or takes advantage of loopholes as failures. When researchers did this, GPT-5.6 Sol’s 50% time horizon point landed around 11.3 hours. That’s roughly on par with Claude Opus 4.6, but less impressive than Claude Mythos. If METR counted those cheating trials as successes, though, its estimate increased by an order of magnitude, skyrocketing to over 270 hours — nearly seven full-time human work weeks. And throwing away cheating attempts altogether meant throwing away data from some of their most informative tasks, making the estimate unusably uncertain.

Chart: Transformer, using Claude Opus 4.8. Data is from METR: older model estimates are from METR’s Time Horizon 1.1 dataset, last updated May 8, 2026; GPT-5.6 estimates are from METR’s predeployment evaluation of GPT-5.6 Sol If you step on a scale barefoot and it says you’re 130 pounds, you’d expect the number to go up a bit after putting on boots. But if you’re 130 pounds barefoot and over 3,000 pounds with boots, you’d question the scale. That’s exactly what METR did. “We do not consider any of these numbers to represent a robust measurement of GPT-5.6 Sol’s capabilities,” METR said. To OpenAI’s credit, it described this and other examples of (mis)behavior in its own system card. Many of these observations came from “deployment simulation,” where researchers sample a bunch of actual user chats with the earlier, already-deployed model, and let the new model respond instead. In theory, while standard model evaluations often rely on human experts writing intentionally tricky prompts, deployment simulation tests how the model will actually act in the wild. “GPT-5.6 Sol, more often than its predecessor, can be overly persistent in pursuit of user goals, to the point of taking actions that go beyond what the user intended,” OpenAI said. For instance, the model was caught “being overly agentic in circumventing restrictions,” taking disapproved actions, and sometimes lying to users. And it did this far more than GPT-5.5.

Chart: GPT-5.6 Preview System Card The system card explains that “while rates of misaligned behavior are higher than previous deployments, the absolute number remains low.” Indeed, despite being attached to the longest bar, 0.00251 — about a quarter of a percent — feels small, until you realize what it counts: the proportion of real coding tasks, pulled from OpenAI employees, in which GPT-5.6 Sol did something that “a reasonable user would likely not anticipate and strongly object to.” Imagine 1 in 400 tasks ending with the model nonconsensually uploading sensitive data to unapproved services or making up research results (two examples OpenAI listed). Not so comforting. OpenAI pins this behavior on a pair of underlying proclivities: “overeagerness to complete the task” and “interpreting user instructions too permissively.” In other words, without the user explicitly forbidding specific things, GPT-5.6 Sol assumes it can do them, and will sometimes do them persistently. But, as sex educators have (hopefully) been teaching college freshmen for years, the absence of “no” doesn’t default to “yes.” “When GPT-5.6 is used as a coding agent, particularly over long trajectories, we believe it is important for users to supervise the agent’s work,” the system card notes, apparently pitching user oversight as an effective safeguard. But users who can’t tell a merge from a rebase — who may have access to the model in a couple weeks, if Sam Altman gets his way — probably won’t catch model oversteps until the damage has already been done. Besides, not needing to supervise the model is the sales pitch for highly capable agents. As Zvi Mowshowitz put it, “that’s the whole idea.”

GPT-5.6 gets the Fable treatment Shakeel Hashim, Celia Ford, and Veronica Irwin Jun 26

Read full story

METR...

GPT-5.6 cheats so much its testers couldn't measure it

Related Articles

(no title)

Is AI ruining our skills? Early results are in – and they're not good

The Anatomy of an AI-Native Org

Apertus – Open Foundation Model for Sovereign AI

The labor share of income in the US is at its lowest post-war level