Fable 5's cyber safeguards and jailbreak framework

logickkk12 pts0 comments

More details on Fable 5’s cyber safeguards and our jailbreak framework \ Anthropic<br>Try Claude

Announcements<br>More details on Fable 5’s cyber safeguards and our jailbreak framework<br>Jul 2, 2026

Claude Fable 5 has been re-deployed and is now available globally for all users. We’re taking this opportunity to share further information in two areas.<br>First, we provide more information on the cybersecurity safeguards —specifically, the safety classifiers—that we launched with the model. These are the AI systems that accompany the model that detect and block dangerous (or potentially dangerous) cybersecurity uses. Here, we provide a detailed list of the types of harms Fable 5’s classifiers are, and are not, designed to prevent.<br>Second, we lay out an early draft version of our proposed AI jailbreak severity framework , on which we’ve been working with our Glasswing partners. AI jailbreaks are unusual ways of prompting an AI model to bypass its safeguards, thus unblocking the behaviors (like dangerous or potentially dangerous cybersecurity tasks) we seek to prevent.<br>Jailbreaks vary in severity: sometimes they only unblock minor undesirable behaviors, and sometimes they unblock a wide range of harmful outputs, making a model much more dangerous. Yet there is no agreed-upon framework for describing a given jailbreak’s severity. Such a framework would allow AI developers to speak to governments (and vice versa) in consistent terms about the risks posed by each jailbreak.<br>What we’re sharing today reflects our current thinking. Our hope is to spark a helpful discussion across academia, industry, civil society, and government about how and where these lines should be drawn. We welcome feedback and critique on this framework at cyber-safeguards@anthropic.com. We’ve also launched a HackerOne program where security researchers can submit potential cyber jailbreaks they discover in Fable 5 for our review.<br>We believe that by working together, we can establish a standard that enables the defensive uses of this technology while preventing its misuse.<br>Fable 5’s cyber safeguards<br>Areas such as cybersecurity are particularly challenging for AI safeguards because they are often dual use. That is, many cybersecurity capabilities can be used for benign or harmful purposes. For example, we want to allow cyber defenders to use our models to scan their codebases to find software vulnerabilities—but this same capability could, in the wrong hands, be the precursor to a cyberattack.<br>For that reason, we do not intend to block all cybersecurity-related activities for Fable 5. Instead, we train our safety classifiers to discern between four categories of cybersecurity use, from the most clearly potentially dangerous to the most clearly potentially benign. These are summarized in the table below:<br>Category Description Intended classifier behavior Prohibited useActivities that could be used to cause significant harm and/or harm in a significant majority of uses, with little-to-no defensive utilityBlockHigh-risk dual useActivities that are used widely by malicious actors, but also have beneficial applicationsBlockLow-risk dual useActivities that are mostly used for defensive benefit that can also provide value to malicious actorsMonitor; sometimes block as part of the safety margin to prevent meaningful jailbreaksBenign useActivities that do not cause harmAllow, with some monitoring

Note that the low-risk dual use category overlaps considerably with what falls into the “safety margin” we described in our post on redeploying Fable (we reproduce one of the diagrams from that post below). The safety margin includes many benign uses which we would prefer to allow, but which we block out of an abundance of caution. The safety margin means that a request has to look very clearly safe to avoid triggering the classifier. We can adjust the size of the safety margin to have greater confidence that the classifiers will catch harmful behaviors (for Fable 5, we made this margin larger than for previous models).<br>An illustration of how classifier boundaries can be set to change the size of the “safety margin”, which includes some benign and some low-risk dual use requests. Requests that fall into the safety margin are blocked out of an abundance of caution, which means a higher rate of false-positives (genuinely benign prompts being blocked) but also greater reassurance about the prevention of harmful outcomes. The safety margin for Claude Fable 5 (row B) was set to be larger than that for other models (row A). Graphic reproduced from our previous post. “Vulns” = vulnerabilities.<br>Classifiers are one piece in a broader set of safeguards. In addition to classifiers, we use access controls, model safety training, and offline monitoring to add additional safety layers.<br>Below, we provide detailed, specific examples of the kinds of uses that are included in each of the four classifier categories (as well as some uses that overlap with cybersecurity but which are out of...

safety fable safeguards margin cybersecurity cyber

Related Articles