Mythos Proves AI Safety Can No Longer Live Inside the Model

edf132 pts0 comments

Mythos Proves AI Safety Can No Longer Live Inside the Model | grith

grith is launching soon<br>A security proxy for AI coding agents, enforced at the OS level. Register your interest to be notified when we go live.

For three years, AI safety has mostly meant one thing: make the model safer. Train it to refuse. Fine-tune the edges. Add constitutional rules. Build better evaluations.<br>The reasoning was simple. If the model behaves safely, the system is safe.<br>This week, that assumption broke in public.

esc to close<br>Every control in the Mythos story - access gating, request routing, export law - sat outside the model. The trained-in refusals are the part that got jailbroken.<br>On June 12, the US government ordered Anthropic to suspend access to its two most capable models, Fable 5 and Mythos 5, for any foreign national worldwide.1 Because no provider can reliably sort foreign nationals from everyone else in real time, the practical result was a hard shutoff of both models for every user on the planet. The directive cited national security authorities and followed a claim that the model had been jailbroken.2

Strip away the politics and the headline and you are left with something more durable. The entire Mythos saga - how the model was released, how it was guarded, and how it was ultimately pulled - is a demonstration that the security boundary for capable AI has already moved outside the model. The industry has conceded the point in practice. It just has not said so out loud.

What Anthropic actually shipped

Mythos 5 is, by Anthropic's own description, the model with "the strongest cybersecurity capabilities of any model currently available."3 It can identify and exploit vulnerabilities in every major operating system and every major web browser when directed to.4 That is not a marketing flourish. It is the reason the model was never broadly released.

Look closely at how Anthropic handled a model it considered that dangerous. Three things stand out, and none of them is "we trained it to refuse."

First, access was gated . The full-power model went only to a controlled program, Project Glasswing - roughly 50 vetted organisations at launch in April, expanded to around 150 by June, names like Amazon, Apple, Google, Microsoft and CrowdStrike, all using it for defensive work.5 The safety mechanism here is a list of who is allowed to hold the model at all. That is an environmental control. It lives entirely outside the weights.

Second, requests were routed . The public model, Fable 5, ships "Mythos-class" capability with restrictions applied by a separate system: cybersecurity, biology, chemistry and model-distillation requests get quietly redirected to the less capable Claude Opus 4.8.6 Read that again. Anthropic's own headline safety feature for its public model is a router that sits in front of the model and decides which requests the model is even permitted to attempt. The judgment about what is safe is made outside the thing being judged.

Third, when those two layers were judged insufficient, the law removed the model from the market . Export controls are about as far outside the model as a boundary can get.

Three safety mechanisms, three layers, all of them external. The one thing that was supposed to make the model safe from the inside - its trained refusals - is precisely the part that failed.

The jailbreak is the tell

The technique that triggered the whole episode was not exotic. According to the reporting, a company prompted the model to "read a specific codebase and identify software flaws."2 A request that sounds like ordinary code review walked straight past the trained guardrails and out the other side as a vulnerability-discovery engine.

Anthropic disputes the severity - it calls the jailbreak narrow and non-universal, says it has seen only verbal evidence, and points out the same capability is already available in other public models including GPT-5.5.7 On the narrow question of whether this particular model deserved to be pulled, Anthropic may well be right.

But that argument concedes the larger one. If a frontier lab can spend thousands of hours red-teaming a model with the explicit goal of suppressing its cyber capabilities, restrict it to fifty hand-picked organisations, and still have a plain-language prompt elicit the behaviour it was trained to refuse - then trained refusal is not a security boundary. It is a preference. A strong preference, usually honoured, but one that a sufficiently capable model can be talked out of by anyone who phrases the request as something benign.

The more capable the model, the larger the gap between "usually refuses" and "cannot do harm." And the model only has to be talked out of it once.

We have seen this pattern before

This is not a new lesson. It is the oldest lesson in systems security, arriving on schedule for a new class of system.

Early operating systems trusted their applications. A program asked the machine to do something and the machine did...

model mythos anthropic safety trained security

Related Articles