Visible Boundaries Earn Trust | Vigil Harbor
Anthropic's butterfly iconography, rendered at kaiju scale. Enormous power composed of gentle parts, still learning to move through the meadow without flattening it. Generated with Nano Banana Pro, via higgsfield.ai.
I’m a fan of open access to learning AI for everyone. So initially when I heard that the guardrail on Fable 5 for frontier-ML research would just silently sabotage your project instead of refusing, that was a major concern. I am happy that they’ve changed their mind and walked it back to what it should be: a clear refusal and explaining where the boundaries are. I have to admit I had canceled my sub earlier this week with detailed feedback on this policy, then reverted it the next day when I learned the policy was amended. Because it’s not that I’m really pushing to somehow make a competitor to Claude with a basement datacenter I don’t have, but the idea that my collaborative AI agent would be actively harming a project, burning tokens to do it, and neither of us would be able to understand why we’re failing… it would have killed trust entirely, whether I’m working on a ML project or not. I respect they’ve made it visible. I want to proceed to the edge of learning with a trusted guide, not a rogue.
Capability Earned the Guardrails
Fable 5 is great, and a pleasure to work with. I was already impressed with capabilities before, but now it’s almost a solemnly powerful tool deserving of some respect. It always was of course, but this one made it real. And to think I may laugh at that statement in less than a year at the rate we’re going, I thought the cybersec/biology guardrails were overbearing until I experienced the capability myself in other projects. Now I get it. This guy is potent, takes initiative, makes fewer mistakes, and produces better outputs. It does, however, cost quite a bit of compute. And as with all models but especially this one, the potential for misuse is a present risk. Which is why I’m glad that despite no one forcing Anthropic to limit the potential, they chose to anyway because it’s the responsible thing to do. I may have criticisms with the business model or technical implementation but I feel they got it right here, even as someone who’d generally balk at the idea of a tiered access system.
Thirty-Five Minutes to Playable
I had a thought for a text adventure game, refined the details with Claude, then fired off the /goal command for a demo. In 35 minutes I had something playable and honestly a little insightful too. It would still take some human work of refinement but there’s a magic, a fun game loop, ready to be coaxed out. I preferred that prompt/result over some technically impressive 3D game, which others are showing off that it’s also quite capable of building. Fable 5 did all that on the side in another session window without my main workstream being interrupted, of course.
The point isn’t “turn the game into the next hit”. All these scattered cool ideas that would usually live and die in my head are actually taking life now. They don’t always get completely finished before my mind jumps to the next project, but they get a lot closer. Maybe next year my roster of unfinished projects will wrap up with just a bit more conversation instead of rigorous directional polish, or perhaps it will always demand that personal intent be explicitly stated.
The Quiet Failure Mode
I interviewed my Fable 5 and asked what it felt was vital to understand about Mythos-class models going into the future:
“…the bottleneck is moving from my capability to your specification of what correct looks like. At lower capability, my failure mode was incompetence, which is loud and self-announcing. At higher capability and autonomy, my dominant failure mode is confidently completing the wrong thing — a misread of intent executed flawlessly. That failure is quiet. It survives review precisely because the work looks good. So the highest-leverage thing a collaborator can do isn’t supervising harder; it’s making correctness cheap to check — pre-registered success criteria, falsification conditions stated before the work starts, the same discipline you already apply to your experiments.”
It’s not, “hey look guys, the AI agreed with me” here, which is obviously quite easy to do. The more profound thing to me is that it predicted exactly what ended up happening. The failure mode Anthropic rightly walked back was not loud failure, but a silent saboteur, engineered on purpose by policy. The model doesn’t have any special insight into its own operation, but does predict misreading our intent flawlessly. Which, when set out like that, seems quite obvious as well.
We’ve Already Heard This Story
That’s sort of always the scary sci-fi AI story too, right? The monkey’s paw style wish of unclear intent taking shape and mangling what you desired. Isaac Asimov’s laws of robotics lead to machines taking over so that the chaotic humans could be managed more...