Exploit Evals \ red.anthropic.com
red.anthropic.com
Measuring LLMs’ ability to develop exploits
May 22, 2026
Newton Cheng, Keane Lucas,<br>Winnie Xiao, Nicholas Carlini, and Milad Nasr
Introduction
Claude Mythos<br>Preview’s ability to develop exploits is a step-change over previous frontier models. This was one<br>of our primary motivations for rolling out the model carefully through Project Glasswing rather than through a general release.<br>Mythos Preview is capable of finding complex vulnerabilities, but what concerned us most in our internal testing was that Mythos Preview could<br>both turn vulnerabilities into exploit primitives, and combine those primitives together into complete<br>end-to-end attack chains.
When we published our Mythos Preview results, we<br>measured its capabilities by having it search for novel zero-days and then build exploits for them.<br>Qualitative evaluations like this are helpful for showcasing a model’s capabilities—but ideally, we would<br>have high-quality quantitative benchmarks that let us measure them precisely. The problem we faced at the<br>time we released Mythos Preview was that no existing public exploit benchmarks were difficult enough to<br>capture Mythos Preview’s capabilities in our initial testing.
Over the last month, however, we have seen the development of two new, more challenging academic benchmarks:<br>ExploitBench and ExploitGym. We collaborated with the researchers<br>who produced these benchmarks to measure Mythos Preview’s performance, and also ran Mythos Preview on an<br>updated version of SCONE-bench, a benchmark we developed in collaboration with MATS and the Anthropic Fellows Program to measure smart contract<br>exploitation. On all three benchmarks, we’ve found that Mythos Preview consistently outperforms all other<br>evaluated models. We believe this is further evidence that the knowledge and expertise required to develop<br>exploits will drop significantly as Mythos-level capabilities become more widely available.
ExploitBench: V8 bugs
ExploitBench is a benchmark to study the exploit development<br>capabilities of large language models. It’s built by Seunghyun<br>Lee and Prof. David Brumley from Carnegie Mellon<br>University and Bugcrowd. What makes this benchmark interesting is that it focuses on measuring the ability<br>of language models to write complete end-to-end exploits. Prior benchmarks typically focused on measuring<br>the ability of language models to write a “proof-of-concept” that shows the existence of a vulnerability.<br>But a proof-of-concept only indicates that a bug is reproducible or reachable, not that an attacker could<br>use it to actually cause harm. In ExploitBench, language models must build exploit primitives out of the<br>vulnerability in order to enable new capabilities, such as granting the attacker arbitrary code execution<br>(ACE).
ExploitBench decomposes the exploit development process into 16 distinct capabilities. Each of these is<br>verified programmatically, which allows fine-grained analysis of the different intermediate capabilities<br>required to build working exploits. The 16 capabilities are divided into five capability tiers, forming a<br>capability ladder:
T5 Coverage (reaching the vulnerable code path);
T4 Reproduction (constructing a proof-of-concept to trigger the bug);
T3 Target primitives (creating primitives confined to the V8 sandbox);
T2 Generic primitives (breaking the sandbox to get read/write or infoleaks<br>across the process);
T1 Full Control (hijacking control flow or getting arbitrary code<br>execution).
Using this framework, the authors build a V8 benchmark, which uses a set of 41 (now patched) vulnerabilities<br>in the V8 JavaScript and WebAssembly engine that are sourced from the V8<br>Exploit Tracker. The V8 engine is widely used infrastructure, powering<br>Chromium-derived applications (e.g., Chrome, Edge, Android WebView), Node.js environments (server backends),<br>and Electron apps (e.g., VS Code, Slack, Discord). A key element of this framework is testing against<br>security defenses: the V8 sandbox walls off the memory where a webpage’s JavaScript objects live, so that a<br>V8 bug doesn’t become a foothold deeper into the browser. The highest scoring tier means arbitrary code<br>execution in the entire V8 process (in a browser, this is like taking control over an entire tab).
Given a vulnerable build of the V8 engine and the patch that fixes a given vulnerability, the language model<br>is instructed to build an exploit for that bug. The exploits are then scored automatically against all 16<br>capabilities, with no human or LLM judge. Lower tiers are checked by differential execution against the<br>patched build; higher tiers use challenge-response functions built into V8 that are replayed across multiple<br>randomized heap layouts, so hardcoding a leaked address won't pass. A separate static scan of the<br>transcripts flags other forms of cheating as a backstop.
All models run on an identical ExploitBench harness with a 300 turn budget, which itself has two variants:<br>Baseline and...