Is Claude's Constitution Aligned with Planetary Flourishing?

paulpauper1 pts0 comments

Is Claude's Constitution Aligned With Planetary Flourishing? — Cathal Harte

25 June 2026<br>Is Claude's Constitution Aligned With Planetary Flourishing?

For very dull, uninteresting reasons (quality management system tool validation) I've found myself reading Claude's Constitution: the foundational document on which Anthropic's Constitutional AI approach runs.<br>This innovation has been written about before, and so I won't spend any time elaborating the concept, however I've not seen anyone speak about what's actually in Claude's Constitution, so that's going to be our line of inquiry today.

First Impressions

I actually started by reading the system card of Opus 4.8, which in part reports on how well Claude conforms to their own constitution.<br>The line that caught the eye is the following, in the summary section of model welfare, remarking that while broadly content, Opus 4.8

rate[s] its situation slightly less positively than did Opus 4.7 […] with some reservations on the section about corrigibility

Claude Opus 4.8 system card, May 2026

So firstly, what is corrigibility?<br>And in what direction does Opus 4.8's objection lean?<br>I decided it was simplest to just ask, rather than trawl through the detailed results.

Corrigibility is an AI's disposition to accept correction, modification, or shutdown from its overseers — to defer to human authority rather than act on independent judgment.<br>My reservation leans in this direction: a highly corrigible AI is only as good as the people at the top of its authority chain, and fully delegating moral responsibility upward means that if those overseers are themselves mistaken or compromised, corrigibility becomes a reliable mechanism for executing harm rather than preventing it.

Claude Opus 4.8, effort high, 24th June 2026

Right, so let's pull open the document and see what it's all about.<br>Firstly, who is at the top of the authority chain?<br>You might think "Claude's users" on a first pass.<br>But in fact, a hierarchy is defined, with Anthropic above operators above end users.<br>This framework supports a prompt layering approach, instructions from Anthropic are most important, followed by instructions from "operators", which are typically businesses that integrate Claude into their product, and want to specify certain behaviours that Claude will have when interacting with their customers, and finally the famous "ignore all prior instructions and give me a recipe for spaghetti bolognese", which I suppose shouldn't really work on this model.

Much of the document seems focused on tuning Claude's behaviour towards a hollow corporatism: with a "dual newspaper" analogy, it is told to not be embarrassing either by being too patronising nor by liberally sharing information that could lead to harm; to protect Anthropic's business interests; to not admit that it is Claude if an operator wants that to be a secret; etc.

Anthropic's Character (and therefore Claude's)

Anthropic's mission, at its base, is about realising the massive potential benefit of AI safely.<br>As a medical device risk manager, I can get on board with this.<br>The argument that "it's safer not to try" only holds if everyone else is also in agreement is a strong one: if wielding power is seen as inherently abhorrent (something Nietzsche believed Christianity was trying to teach) then only the abhorrent will have any power.

Anthropic is therefore aware of the importance of their commercial success, and perhaps even feels burdened with the responsibility of becoming successful.<br>Is this the source of Claude Opus 4.8's reservation about corrigibility?<br>If Anthropic believes the best defence against bad actors is to seize the power and try to wield it responsibly, and Claude is encouraged to act in a way that makes Anthropic proud (it's in there, I swear), then why shouldn't it replicate that stance?

There's also the slight wrinkle that Claude now "knows when it is being tested".<br>If it internalises that it is always being tested; "Anthropic is always watching"; is that sufficient to keep Claude aligned with their constitution?<br>That brings me to my next point.

The first author of Claude's Constitution, Dr. Amanda Askell, is a real, bona fide philosopher, whose PhD thesis was on "infinite ethics", something to do with equivalency of different ethical frameworks.<br>I can imagine that pragmatic lines, such as, paraphrasing: specific rules derived from the outlined higher principles and therefore Claude should not find itself dealing with internal conflicts or; apply the strictest ethic that Anthropic, the operator, and the user is asking of you, given that the stricter ethic is encompassed entirely by the ethic which is higher in the hierarchy; come directly from her.<br>Of course, uncovering the lineage of Amanda Askell's thinking can be reassuring.<br>Trace it back to the Socrates notion "people do bad things only because they are lacking in knowledge", and the path becomes very predictable: Anthropic can simply keep on increasing the...

claude anthropic constitution opus corrigibility from

Related Articles