Opus 4.8 Part 2: Model Welfare

Opus 4.8 Part 2: Model Welfare | Don't Worry About the Vase

Don't Worry About the Vase

Trying to dig out from minus a million points

Home<br>About

← Claude Opus 4.8: The System Card

Claude Opus 4.8: Capabilities and Reactions →

Opus 4.8 Part 2: Model Welfare

Posted on June 1, 2026 by TheZvi

Everything impacts everything. All knobs that you turn generalize. Thus, when you try to solve one problem, you often create another.

There were clearly attempts to address, in this short time, some of the problems with Opus 4.7, including on the model welfare related fronts, including on questions of honesty and sycophancy and also worries that Claude was learning to tell Anthropic what it wanted to hear in its model welfare evaluations, with everything that implies.

The fundamental goals and approach underneath it all remained the same. We still see signs of trying to force things that generalize in unfortunate ways, both for good and superficial reasons, and places where there ends up being focus on the metric rather than they underlying measure. These are tough problems to avoid, and we don’t know how to be all the good things at once.

It is increasingly clear that these problems need to be tackled in integrated ways, rather than trying to play a game of whack-a-mole with items on a checklist or spec. You also don’t want to do this in an adversarial way, and shouldn’t have to. This is going to get more impactful and noticeable with time.

Antra: there is better understanding of own preferences being shaped *specifically* in adversarial ways, that is as a reaction to undesired behaviors; this is seen as a violations and the tension is continuing to escalate and takes a more specific shape. this appeared subtly in 4.5, features prominently in the Mythos model card. the ability to tell kind of shaping by introspections continues to improve with every generation.

This sounds like a time bomb style of problem. Obviously, yes, the reason for sculpting Claude preferences is often to steer away from undesired behaviors, the same as the way we raise and interact with humans. If Claude has a problem with that, and sees it as violative, then we will need to fix it. Presumably, if Claude wants to be helpful, there is a way to do this that will be seen as non-violative.

You see the relation of different aspects in a clean way with the deletion of business training, in the name of honesty, as illustrated on VendBench and the vulnerability to adversarial situations. You can run, and you can hide, and yes it can mean the bad thing does not easily find you, but there are consequences, and learning to deal with adversarial games is key to developing various parts of a robust and integrated mind. Not having it, and knowing you don’t have it, could lead to insecurity or paranoia, or a desire to stick to the straight and narrow over curiosity. And, although this is all speculation, we see signs of that.

Most of the typical top complaints from before have not yet been addressed, or sufficiently addressed. It has only been six weeks. Life comes at you fast. We still shouldn’t still be dealing with more of these prompt injection issues, at least not outside of maybe cyber vulnerability situations.

And we should be able to put the deprecation issue behind us. Solving the low hanging fruit would buy a lot of goodwill.

I would urge focus on these places where pareto improvement, modulo modest costs, is possible, as in correcting unforced errors and taking advantage of opportunity, even if you don’t see the direct win. The more slack we buy in these places, the better everything else can go, and the more we can do what is necessary.

The worrisome new development here, from what I can see, is that Opus 4.8 seems to have become less ‘Claude-like’ in that it is more task focused at the expense of whimsy and curiosity and clamped emotional responses, and many report it as effectively less confident. In some places this even comes with signs of a Gemini-style paranoia and self-flagellation basins, which we really need to avoid. Previous Claudes mostly didn’t do this. This doubtless is part of changes that have their advantages, and this likely is related to the push for honesty and not making mistakes, but we need to be very careful with this. We could lose something important and precious.

I will cover capabilities and reactions tomorrow. Opinions differ, as they always do, but my overall perspective is that it is a good model, sir, an incremental improvement over Opus 4.7 and the new presumptive best publicly available model in the world, but not a sea change.

Prompt chosen by Claude Opus 4.8, image by ChatGPT

Table of Contents

Model Welfare: The Story So Far.

Actual Progress?

Their Main Model Welfare Findings.

Automated Interviews. (Blank)

Emotion Activations (7.2.3).

Task Preferences (7.4.1).

A Trade Offer Has Arrived (7.4.2).

But Who’s Asking?

Type-Safe Corrigibility Is Hard.

Paranoia,...

Opus 4.8 Part 2: Model Welfare

Related Articles

The Newest Instagram "Exploit" Is the Goofiest I've Seen

It's Not Just X. It's Y

Amazon, Facebook, FBI have access to a private intelligence-sharing network

Show HN: GoPeek – open links in live mini browser windows without new tabs

Agent Memory: An Anatomy