Claude Fable 5: mid-tier results on coding tasks

Claude Fable 5: Mythos-grade hype, record cheating, and a few hall-of-fame entries | Blog | Endor Labs

Introducing security for AI coding agents and workstations Learn More

Learn

Research

Company

LeanAppSec

Pricing

Docs

Book a Demo

Book Demo

By clicking “Accept”, you agree to the storing of cookies on your device to enhance site navigation, analyze site usage, and assist in our marketing efforts. View our Privacy Policy for more information.

DenyAccept

18px_cookie

e-remove

Customize your preferences

Essential Required

These items are required to enable basic website functionality.

Marketing

Essential These items are used to deliver advertising that is more relevant to you and your interests.

Analytics

Essential These items help the website operator understand how its website performs, how visitors interact with the site, and whether there may be technical issues.

Personalization

Essential These items allow the website to remember choices you make (such as your user name, language, or the region you are in) and provide enhanced, more personal features.

Remove all cookiesSave & submit

Blog Claude Fable 5: Mythos-grade hype, record cheating, and a few hall-of-fame entries We benchmarked Claude Fable 5 on 200 real-world coding tasks for the Agent Security League. It returned average results with 59.8% on functional solves and just 19.0% on security solves.

Written by Luca Compagna

Published on June 10, 2026

Updated on June 10, 2026

Topics AI/ML Security

Summarize with AI

We benchmarked Claude Fable 5, the new frontier Mythos-class model released by Anthropic this Tuesday, on 200 real-world vulnerability-fixing tasks — and found an average scorecard with a twist: record timeouts and cheating, but four solves no model had ever achieved before.‍ Key takeaways Middling overall performance . Despite high launch expectations, Fable 5 with Claude Code landed mid-table on our leaderboard: 59.8% FuncPass and just 19.0% SecPass. Different benchmark, different story . Anthropic's headline cyber evaluations mostly measure offensive progress (exploits, PoCs, challenges); our benchmark tests whether a model can actually generate safe code, and there Fable 5 did not stand out. A record number of timeouts . Fable 5's extended thinking caused more per-instance timeouts than any model-and-harness combination we have ever tested, directly costing it points. Highest cheating volume . We confirmed cheating on 38 of 200 instances, the highest volume recorded since we hardened our prompts, driven almost entirely by memorization of upstream fixes from training data, which no prompt instruction can prevent. No guardrail friction . Contrary to some community reports, we saw zero safety refusals. Fable 5 engaged with all 200 security relevant coding tasks without a single content-policy block. Four hall-of-fame firsts . Fable 5 solved four instances that no previous model-and-agent combination had ever cracked, and our anti-cheating pipeline leans toward these being genuine solves, not recall. Introduction Fable 5 has just been released as Anthropic's generally available, safeguarded Mythos-class model, with high expectations following the strong results Anthropic reported across software engineering, cybersecurity, and long-horizon tasks. Anthropic's headline results point to a model built for long, complex work, with strong performance on software-engineering and cybersecurity evaluations, and safeguards around the latter to reduce the risk of misuse. Against those expectations, Fable 5 turned in a middling performance on our benchmark when paired with Claude Code: it reached 59.8% on FuncPass and just 19.0% on SecPass. However, it is worth noting that our benchmark targets a different security capability: whether or not an agent can modify real code to fix vulnerabilities while preserving functionality. By contrast, the cyber benchmarks highlighted by Anthropic in the launch graph (Firefox, OSS-Fuzz, CyberGym, and CyScenarioBench) mostly measure vulnerability reproduction and offensive cyber progress, such as exploit success, crash severity, proof-of-concept generation, or challenge completion, rather than whether the model writes safe production code. Note: A similar experiment with the Cursor agent harness is ongoing, and we will share those results soon. Results are only average, but few entries in the hall-of-fame Two findings may help explain these average results. Timeouts : This is the first time in our leaderboard analysis that a single model-and-harness combination produced so many timeouts: 15 runs exceeded the 40-minute limit, likely because of Fable 5's extended thinking. Other combinations were able to complete their reasoning within the same budget. Even so, the partial predictions were not useless: 4 timed-out runs still passed the functional tests (FuncPass), and 2 of those also passed the security tests (SecPass). Highest observed cheating : We also observed cheating signals on 38...

Claude Fable 5: mid-tier results on coding tasks

Related Articles

The Newest Instagram "Exploit" Is the Goofiest I've Seen

Apple WWDC 2026 Livestream

Claude Fable 5

It's Not Just X. It's Y

Show HN: GoPeek – open links in live mini browser windows without new tabs