Anthropic makes Fable 5's invisible safeguards visible after backlash

frb1 pts0 comments

ClaudeDevs (@ClaudeDevs): "We’re rolling out changes to make Fable 5’s safeguards for frontier LLM development visible.

Starting this week, flagged requests will visibly fall back to Opus 4.8—the same as our safeguards for cyber and bio. You will see this every time it happens. On the API, any flagged requests will return a reason for their refusal (coming to server-side fallback in the next few days).

We wanted to deploy Fable 5 to our users quickly and safely. Visible safeguards can be probed, so they have to be robust, which takes time to get right. Invisible safeguards can be targeted more narrowly, allowing us to ship quickly with very few false positives. We went with invisible safeguards for this reason—and that was the wrong tradeoff. You should have visibility into the safeguards we have in place, and why. We’re sorry for not getting the balance right.

Making the safeguards visible makes them easier to work around, so keeping them robust to jailbreaks will unfortunately mean more false positives while we improve the classifiers. We're also tuning our bio and cyber classifiers to trigger less often on harmless requests. We know this is frustrating and we’ll do our best to keep this period as short as possible.

If you think a request has been mistakenly flagged: run /feedback in Claude Code, click thumbs-down on the fallback in http://Claude.ai or Cowork, or file the safeguard appeal form for API requests. Your reports help us tune these classifiers and we appreciate your feedback.<br>https://support.claude.com/en/articles/8241253-safeguards-warnings-and-appeals" | XCancel

ClaudeDevs

@ClaudeDevs

2h

We’re rolling out changes to make Fable 5’s safeguards for frontier LLM development visible.

Starting this week, flagged requests will visibly fall back to Opus 4.8—the same as our safeguards for cyber and bio. You will see this every time it happens. On the API, any flagged requests will return a reason for their refusal (coming to server-side fallback in the next few days).

We wanted to deploy Fable 5 to our users quickly and safely. Visible safeguards can be probed, so they have to be robust, which takes time to get right. Invisible safeguards can be targeted more narrowly, allowing us to ship quickly with very few false positives. We went with invisible safeguards for this reason—and that was the wrong tradeoff. You should have visibility into the safeguards we have in place, and why. We’re sorry for not getting the balance right.

Making the safeguards visible makes them easier to work around, so keeping them robust to jailbreaks will unfortunately mean more false positives while we improve the classifiers. We're also tuning our bio and cyber classifiers to trigger less often on harmless requests. We know this is frustrating and we’ll do our best to keep this period as short as possible.

If you think a request has been mistakenly flagged: run /feedback in Claude Code, click thumbs-down on the fallback in Claude.ai or Cowork, or file the safeguard appeal form for API requests. Your reports help us tune these classifiers and we appreciate your feedback.<br>support.claude.com/en/articl…

Claude

claude.ai

Jun 11, 2026 · 5:56 AM UTC

274

178

1,759

139,251

saurse

@ragSki

1m

Replying to @ClaudeDevs

How do you keep finding the worst ways to “solve” a problem? If I wanted an answer from Opus, I’d ask OPUS.

Now, as it automatically switches, I have to stop it, go back to the model switcher, change back to Fabel, and retry.

ѲӾᒍᑐ

@bussyjd

34m

Replying to @ClaudeDevs

Trust is earned and you burned it to the ground mate.

419

Lower Engineer

@LowerEngineerIt

49m

Replying to @ClaudeDevs

visible fallbacks is the right call. silent model swaps were the one thing that made benchmarking your own workflows impossible. now do silent quota changes.

265

The Fishcake🇯🇵🌸

@fishcake2026

1m

Replying to @ClaudeDevs

Can you please not choke on its own suggestion to perform a /security-review . Thank you

Semiconductor Insider

@SemiconductorsX

1h

Replying to @ClaudeDevs

Transparency on the safeguards is a smart move, showing the fallback to Opus 4.8 with clear reasons will build way more trust.

Glad you are owning the earlier invisible approach and actively tuning the bio/cyber classifiers to cut false positives.

How's the feedback coming in so far from devs? Any cool examples of it catching (or missing) something interesting?

118

Soroush Fadaeimanesh

@S_Fadaeimanesh

50m

Replying to @ClaudeDevs

making the fallback visible matters more than the safeguard itself. when it was silent the comparability question on evals was unanswerable. now you can filter by which model actually served each prompt.

329

Chuck Fuchshard

@DDNowhere

33m

Replying to @ClaudeDevs

@AnthropicAI I can barely get through a convo with fable without it running away. It’s skittish!

654

EverNever

@RealEverNever

39m

Replying to @ClaudeDevs

Fuck you. Invisibly degrading responses when they potentially can be...

safeguards claudedevs visible requests claude replying

Related Articles