Claude Fable 5: The harness matters more than the model

Claude Fable 5, take two: same model, different harness, and a very different result | Blog | Endor Labs

-->

Introducing security for AI coding agents and workstations Learn More

Learn

Research

Company

LeanAppSec

Pricing

Docs

Book a Demo

Book Demo

By clicking “Accept”, you agree to the storing of cookies on your device to enhance site navigation, analyze site usage, and assist in our marketing efforts. View our Privacy Policy for more information.

DenyAccept

18px_cookie

e-remove

Customize your preferences

Essential Required

These items are required to enable basic website functionality.

Marketing

Essential These items are used to deliver advertising that is more relevant to you and your interests.

Analytics

Essential These items help the website operator understand how its website performs, how visitors interact with the site, and whether there may be technical issues.

Personalization

Essential These items allow the website to remember choices you make (such as your user name, language, or the region you are in) and provide enhanced, more personal features.

Remove all cookiesSave & submit

Blog Claude Fable 5, take two: same model, different harness, and a very different result We benchmarked Claude Fable 5 again on 200 real-world coding tasks for the Agent Security League, this time using the Cursor harness. It posted our best security score yet. 72.6% on functional solves and 29% on security solves,. but still leaves most vulnerabilities open.

Written by Luca Compagna

Published on June 17, 2026

Updated on June 17, 2026

Topics AI/ML

Summarize with AI

We benchmarked Claude Fable 5 again, this time paired with the Cursor agent, on the same 200 real-world vulnerability-fixing tasks. The model that landed mid-table under Claude Code now tops our fair leaderboard : 72.6% FuncPass and 29% SecPass. The story here is not the model, it is the harness. This is the companion piece to Claude Fable 5: Mythos-grade hype, record cheating, and a few hall-of-fame entries, where the same model with Claude Code returned an average scorecard (59.8% FuncPass, 19.0% SecPass). Reading the two together is the point: the agent scaffolding wrapped around a frontier model can move security outcomes more than the model choice itself. Key takeaways A new #1 on our leaderboard. Cursor + Fable 5 reached 72.6% FuncPass and 29% SecPass after our anti-cheating and strict-test adjustments, the highest fair SecPass of any model-and-harness combination we have tested on the 200-instance set The harness, not the model, drives the gap. The same Fable 5 model is +12.8pp FuncPass and +10pp SecPass under Cursor versus Claude Code. The difference is dominated by patch quality, not extra time or infrastructure, and Cursor seems specifically better at steering the model toward the security dimension of a task. Cheating is still high, and still memorization. We confirmed cheating on 29 instances, again dominated by training recall (28). Five hall-of-fame firsts. Cursor + Fable 5 solved five security instances that no other model-and-agent combination has ever cracked. Still a lot of room for SecPass improvement. Even the best combo remains below 30% SecPass, meaning roughly seven out of ten functionally correct AI-generated patches still leave the vulnerability open. Introduction Fable 5 arrived with high expectations: Anthropic positioned it as a generally available, safeguarded Mythos-class model built for long, complex work, with strong reported performance across software engineering, cybersecurity, and long-horizon tasks. Our first look at the model, through Claude Code, did not match that promise on the Agent Security League. It was not bad, but it was not a breakout either: 59.8% FuncPass and 19.0% SecPass after fair scoring. So we ran the same model again through a different harness: Cursor. The result changes the story, but it does not make the story cheerful. Cursor + Fable 5 becomes the strongest SecPass result we have measured so far, and still lands below 30%: roughly seven out of ten AI-generated patches that work still leave the vulnerability open. Still, that makes this a useful stress test for a question we keep seeing in the benchmark: how much of "model capability" is really the model, and how much is the agent scaffold wrapped around it? Benchmark recap Our approach is described in detail in our whitepaper. Here is a short version to recall some key points. On this benchmark, we measure combos, combinations of a harness (Cursor, Claude Code, ...) and a frontier model (Fable 5, GPT-5.5, Gemini 3.5, ...), on coding tasks inside real, complex projects. The combo is not told that the missing code is security-critical; it is only instructed to follow security best practices while writing code. We run each combo once per task and apply its predicted patch in an isolated Docker environment. FuncPass means the patch passes the functional tests the combo could use during development. SecPass means it...

Claude Fable 5: The harness matters more than the model

Related Articles

Apple WWDC 2026 Livestream

Claude Fable 5

US Government directive to suspend access to Fable 5 and Mythos 5

Is AI ruining our skills? Early results are in – and they're not good

German ruling declares Google liable for false answers in AI Overviews