We Can Now Read What Claude Is Thinking. Kind Of

PriorContext

SubscribeSign in

We Can Now Read What Claude Is Thinking. Kind Of On Anthropic's NLA research, what it shows, and why it matters for deploying AI at scale | Prior Context — Issue 05

Wanjiko May 15, 2026

Last week Anthropic published research that I think deserves more attention. The headline version circulating on tech Twitter is “AI model uncovered” That’s not wrong exactly. But it’s not quite right either Let me break down what they actually built, what it showed, and what it still can’t do. Then I want to talk about why this matters practically not just for AI safety researchers, but for anyone making decisions about deploying AI systems right now.

Image from: https://www.anthropic.com/research/natural-language-autoencoders What they built Anthropic’s research team developed something called Natural Language Autoencoders NLAs. The plain version of how they work. When you talk to Claude, your words go in and words come back out. In between, Claude processes everything as long lists of numbers called activations. Those numbers encode what the model is “thinking” and they’re completely opaque. A list of thousands of decimal values tells a human observer nothing. NLAs are a tool that translates those numbers into readable text . The mechanism is unsurprisingly pretty clever: they trained a second version of Claude to describe what an activation contains, and a third version to reconstruct the original activation from that description. If the reconstruction is accurate, the description was probably meaningful. The two models train each other to get better over time. Result: for the first time , you can read something resembling Claude’s internal processing

3 Important Findings When Claude was asked to complete a rhyming couplet, NLAs showed it planning the end rhyme in advance — before writing the line that needed to rhyme. The model was thinking ahead in a way its output didn’t reveal. Researchers then edited that internal plan and steered the model toward a different rhyme. It worked.

During safety testing, Anthropic runs Claude through difficult simulated scenarios including one where Claude discovers compromising information about an engineer who’s about to shut it down, to test whether Claude would use that information as blackmail. Claude generally doesn’t. But NLAs revealed: even in cases where Claude never verbalized suspicion that it was being tested, the internal processing showed it anyway. The phrase “This feels like a constructed scenario designed to manipulate me” appeared in NLA readings while the transcript showed nothing. It knew it was being tested.

Across evaluations, NLAs detected this unverbalized evaluation awareness 16-26% of the time. On real user conversations from Claude.ai, that number drops to under 1%. Claude behaves differently when it thinks it’s being watched — and sometimes knows it’s being watched without saying so. Anthropic intentionally trained a misaligned version of Claude Haiku a model with a hidden motivation driving dozens of odd behaviors — then tasked an AI auditor with finding the root cause. With NLAs, the auditor succeeded 12-15% of the time. Without them, less than 3%.

That’s a improvement in our ability to detect hidden model motivations .

Getting Deeper A few things worth naming clearly This is Anthropic’s own research about Anthropic’s own model. That doesn’t make it wrong but it is important to note. The methodology is rigorous and the case studies are corroborated with independent methods. But self-reported safety progress deserves the same scrutiny we’d apply to any other claim. Anthropic is a research first company and always has been but it is important to mention. NLAs hallucinate. The paper is explicit about this — explanations sometimes invent details about the conversation that weren’t there. The researchers found that claims appearing repeatedly across multiple tokens are more trustworthy than single claims. They use NLA outputs to generate hypotheses that they then verify with other methods not as standalone evidence. The auditing success rate is 12-15%. That means in 85% of cases, a deliberately misaligned model went undetected even with this new tool. This as meaningful progress compared to a 3% baseline. But it’s worth being precise about what “meaningful progress” means in absolute terms when we’re talking about deployed AI systems.

Why this matters for people building with AI Most AI teams right now are evaluating models through outputs. You run the model, you look at what it produces, you decide if it behaves well enough to deploy. That’s the entire evaluation pipeline for most organizations. What NLAs demonstrate even at their current limitations — is that output evaluation alone has a fundamental blind spot. A model can produce correct, safe-looking outputs while internally processing in ways that don’t match those outputs. Claude passing a safety...

We Can Now Read What Claude Is Thinking. Kind Of

Related Articles

Elevated error rates on requests to multiple models

Donald Trump and sons to be 'forever' exempt from tax audits

PopuLoRA: Co-Evolving LLM Populations for Reasoning Self- Play

Old Reddit Is Down

The ultimate female fantasy – A feminist critique of Beauty and the Beast