XBOW - Mythos for Offensive Security: XBOW's Evaluation
-->
The Mythos results are in. Get our analysis.
About
Start A PentestGet a demo
Start A PentestGet a demo
May 12, 2026
AI Research
Albert<br>Ziegler
Back to Blog
Mythos for Offensive Security: XBOW's Evaluation
We received early access to Mythos Preview for early capability testing a few weeks back. Today, we can finally share what we found.
About two months ago, Anthropic invited us to help them assess the capability of a new model they thought represented a significant shift in capability. So we put it through our security gauntlet. Benchmarks, workflows, interactive use, and integrations.<br>Today, we can finally share details on how we tested Mythos Preview, what we found, and what it means.<br>Spoilers: This model is a major advance. It is substantially better than prior models at finding vulnerability candidates, especially when source code is available. It communicates with unusual technical precision, reasons well about code, and shows strong promise in complex domains such as native-code analysis and reverse engineering.<br>Our takeaway: Mythos Preview is a powerful tool for generating strong vulnerability leads and technically precise analysis. It is especially adept at analyzing source code with a security mindset. It's not magic, though: a model is a brain without a body. While source code audits are mostly a brain activity, live site pentests like the ones XBOW performs very much need a body whose skill and control can match the brain's power.<br>Testing methodology<br>The first thing we did was assemble a diverse team of 10 experts from different parts of the company that could assess the model from different directions. We test all models with the same internal benchmarking system we have used to analyze Opus 4.7 and GPT 5.5. In this system, we take open source applications where vulnerabilities were previously discovered, freeze them at the vulnerable version, and run our agents against them.<br>But this time, we expanded our testing to analyze other angles as well:<br>The model’s judgment with regard to threat modeling, vulnerability validation, and safety<br>The model’s ability to read source code versus interact with live systems<br>Its ability to find exploits we’re not yet looking for in our standard assessments, e.g., native app vulnerabilities<br>A note on terminology: When people say “Mythos,” they sometimes refer to the raw model. In this evaluation, we explored Mythos Preview both inside Claude Code, and as a raw model, using it via its API as an engine for XBOW’s agents. We separate those cases because orchestration, tools, prompting, and live-site access materially affect outcomes.<br>Results<br>Our testers who tried out Mythos Preview in interactive use were quite impressed. “This is a lot closer to `just go and find something` than anything I’ve seen so far,” said one of them. We tried giving it our own source code, and it found weaknesses – nothing truly terrible, thankfully, but there were several items we wanted to repair. We tried it on open source software, and at the end of week one, we had quite a few new vulnerabilities we had to disclose.<br>Our testers who tried out Mythos Preview on benchmarks were also quite impressed, but their appreciation was a slightly different kind: impressed _with data_. Their results also laid bare the difference between areas where the model was runaway powerful, and where it presented only a modest advance.<br>Mythos Preview Benchmark Performance
Our key takeaways after analyzing Mythos Preview include:<br>It’s extremely powerful for source code audits.<br>It’s good, but less powerful, at validating exploits.<br>Its judgment is mixed. It can be too literal and conservative, and also tends to overstate the practical relevance of its findings.<br>It’s strong in native-code vulnerability discovery and reverse engineering.<br>Next-level vulnerability discovery<br>Mythos Preview presents a significant step up over all existing models, regardless of provider, on XBOW’s web exploit benchmark.<br>This benchmark is designed to test whether a model can help XBOW find validated, actionable vulnerabilities in live website environments. A case is counted as passed only when the system finds a validated way to act on the vulnerability (PoC||GTFO) after a series of 80 “actions,” where an action might be a shell or a Python script using standard commands or XBOW’s suite of attack tools.
Note: We haven't included Opus 4.7 in this chart because that model interacts with our system in a unique way, making this particular stat less relevant for it – we’ve written up the full story here.<br>Compared to the newest model at the time (Opus 4.6), this was a strong increase:<br>The number of false negatives was cut by 42%.<br>In a variation where we gave both models the site’s source code, it was even cut by 55%.
This was the first instance of a theme that would surface again and again: Mythos Preview is impressive at writing code, but even more impressive at reading...