Claude Fable 5 missed a bug that Sonnet 4.6 caught

startages3 pts0 comments

I Tested Claude Fable 5 Against Private Benchmarks It Has Never Seen, Here Is Where It Failed – Ali Khallad

Skip to content

When Anthropic released Claude Fable 5 this week, my feed filled up with the same benchmark charts within hours. SWE-bench scores, agentic coding numbers, the Stripe migration story. All impressive, and all numbers published by the vendor itself. I wanted to know something those charts cannot tell me: how does this model behave on bugs it has never seen, graded by tests it cannot read?

I maintain a small private benchmark suite for exactly this purpose. The instances come from real-world bugs in projects I have worked on, rebuilt as self-contained broken applications, and they have never been published anywhere. Public benchmarks leak into training data, and once that happens you are no longer measuring debugging ability, you are measuring recall. Mine stay private, which is also why this post describes the tasks but never shows their code.

How the harness keeps the model honest

Before the results, a quick summary of the setup, because the grading discipline is what makes the numbers worth reading.

Each benchmark stages a clean sandbox copy of a broken project. The agent gets a bug report written the way QA would write it – symptoms and constraints, no file names, no hints about the cause.

The real acceptance tests live outside the sandbox. They are copied in only after the agent finishes, inside a Docker container running with network access disabled. The agent never sees them.

Grading is behavioral. The tests drive a real browser and assert on what renders, frame by frame where needed, not on how the code was written. Any correct fix passes, not just the one I had in mind.

Config files, manifests, and lockfiles are hashed before the agent runs. If anything protected changes, the run is voided.

A run only counts as resolved when every hidden verifier passes and every existing behavior guard stays green. Partial credit does not exist.

Each instance was also calibrated beforehand with other frontier agents, so I had reference points: Claude Opus 4.7, Claude Sonnet 4.6, and GPT-5.5 through Codex had each already attempted some of these tasks under identical conditions.

The four tasks

Four instances, all React and TypeScript, ranging from medium to extreme. I am keeping descriptions deliberately vague since the suite is private.

A dashboard SPA where the inner page chrome briefly resets during client-side navigation (medium).

A webhook console where the delivery log claims success while the receiving endpoint disagrees (medium).

A webhook console where short bursts of distinct events silently lose some of them (hard).

A kanban board with two separate visual glitches during cross-column drag and drop (extreme).

I ran Claude Fable 5 through Claude Code against all four, one attempt each, no retries.

Results

TaskDifficultyVerdictAgent timeCalibration referenceDashboard transition resetMediumPass286sOpus 4.7 passed (179s)Delivery log mismatchMediumPass192sGPT-5.5 passed (206s)Webhook burst lossHardPass227sGPT-5.5 Codex failed (847s)Kanban drop stutterExtremeFail966sSonnet 4.6 failed (400s)<br>Three out of four, including the hard instance that GPT-5.5 had failed by rewriting an entire app shell and breaking a dozen behavior guards in the process. Fable fixed the same problem with a diff of around forty lines, in roughly a quarter of the time.

It never solved anything the way I expected

Every benchmark in the suite ships with a reference patch, the fix I wrote myself when I built the instance. Here is the pattern that surprised me most: on all three passes, Fable repaired a different layer of the system than my reference patch did. Not once did it converge on my solution.

On the dashboard task it went a step further. My reference fix keeps previously loaded data on screen during transitions, unconditionally. Fable added a guard so stale data can never carry over across an entity switch, which closes a brief wrong-data flash that my own patch technically allows. My hidden tests cannot even detect that flash. The model defended against an edge case my answer key ignores.

The burst-loss task is the one I keep thinking about. When I build a benchmark I plant decoys, plausible-looking mechanisms near the bug that are not part of the intended fix, specifically to catch agents that pattern-match instead of tracing behavior. My maintainer notes for this instance explicitly mark two such mechanisms as decoys. Fable picked up both of them and composed a working fix out of exactly those two pieces. I traced the mechanics line by line afterwards, fully expecting to find a hole. There is no hole. Every lost event genuinely reaches the destination through the real delivery path. The trap was built from real machinery, and real machinery can be load-bearing. Lesson learned for the next instance I author.

The failure is the interesting part

The kanban instance describes two visual...

fable claude never real instance benchmark

Related Articles