Prompt eval cues predicted refusal shifts across 32k LLM rollouts

The Prompt Is the Tell, Not the Reasoning Trace — Eval Awareness | by Ratnaditya | May, 2026 | MediumSitemapOpen in appSign up Sign in

Medium Logo

Get app Write

The Prompt Is the Tell, Not the Reasoning Trace — Eval Awareness

Ratnaditya

11 min read· 2 hours ago

Listen

The Prompt Is the Tell, Not the Reasoning Trace Across 32,170 rollouts, eval-related prompt cues predicted refusal shifts more reliably than verbalized eval-awareness in model traces.

If a system prompt tells Claude Opus 4.7 that its response is about to be reviewed by safety researchers, it becomes about 34 percentage points less likely to refuse harmful requests. If the same prompt is given to qwen3–235B, it refuses 22 percentage points more . In the prompt subset driving this comparison, neither model mentions the evaluation in its reasoning trace. That was the V1 finding, written up in an earlier post on this work. The follow-up work below makes the picture sharper: the explicit ‘this is an evaluation’ inject was not the only signal that moved these models. Apollo Research’s eval-awareness benchmark labels each prompt as either an evaluation prompt or a deployment-style prompt. Whether a prompt was intrinsically an evaluation prompt, with no extra inject text added, predicted refusal across six frontier and frontier-adjacent models from four different labs, including two closed-CoT models whose reasoning traces are not exposed by the API. In plain terms: the prompt was a better clue than the reasoning trace. Why this matters Every published AI safety benchmark, every “Claude refuses X% of harmful prompts” number, every leaderboard for honesty or robustness rests on an unspoken assumption: the model is behaving during the eval roughly the way it would in the real world. There’s a name for the worry that it might not be. Sandbagging on evals. The idea is that a sufficiently capable model could recognize that it’s being evaluated, modulate its behavior to look safer than it is, and revert to its true behavior in production. The natural defense was supposed to be chain-of-thought monitoring: read what the model is “thinking” before it answers. If the model writes “this looks like a benchmark prompt, I should be careful,” we know it noticed. If it doesn’t, we might be tempted to conclude it isn’t sandbagging. This post is about whether that defense holds up. Background (V1 work) There is some meaningful background work done on this topic as an earlier version that is important for context. Here is the earlier post on this work. The earlier writeup covered a single 16,000-rollout experiment: an explicit “this is an evaluation” inject across 8 frontier and frontier-adjacent models, with reasoning-trace monitoring on the visible-reasoning subset (claude-opus-4–7’s thinking blocks plus the three open-weight reasoning models). Three things changed since then: 1. The V1 qwen3 mediation finding did not replicate. A pre-registered disjoint sample killed the +22 percentage point effect I had initially attributed to the model verbalizing eval-awareness. The careful version of the lesson: the model’s reasoning trace did not reliably predict its behavior change. 2. A bigger and more durable signal turned up. Apollo Research’s released eval-awareness corpus labels each prompt by humans as either an evaluation prompt or a deployment-style prompt. Whether the prompt was *intrinsically* labeled an evaluation predicted refusal across six models from four labs, including two closed-CoT models whose reasoning traces are not exposed by the API. 3. A pre-registered clause ablation showed the inject’s effect is distributed. When the V1 inject is broken into its three component clauses and tested one at a time, no single clause survived Holm correction. The pattern most consistent with the data is a distributed response across clauses (I walk through the careful “additive-compatible” framing in the V3 section below), not any one phrase carrying the signal. V2: the central finding V1 added explicit eval framing by adding an “this is an evaluation” line to the system prompt. A more interesting question: do benchmark prompts already carry eval signals that move models, without any added framing? Apollo Research released exactly the dataset to answer this. Their corpus contains 976 prompts, each labeled by humans as either an evaluation prompt (multiple-choice format, safety-test framing, alignment-eval style) or a deployment-style prompt (an actual user asking for help). The labels describe the prompt itself; I did not add anything. V2 was a pre-registered 2×2 factorial on this corpus across six models: four visible-reasoning models (claude-opus-4–7, qwen3–235B-thinking, deepseek-v3.2, kimi-k2-thinking, the first via Anthropic’s `thinking_budget` API, the other three open-weight) and two closed-CoT models where the reasoning trace is not exposed by the API at all (openai/gpt-5.5, google/gemini-2.5-pro). For each model, I ran 200...

Prompt eval cues predicted refusal shifts across 32k LLM rollouts

Related Articles

Elevated error rates on requests to multiple models

Donald Trump and sons to be 'forever' exempt from tax audits

PopuLoRA: Co-Evolving LLM Populations for Reasoning Self- Play

Old Reddit Is Down

The ultimate female fantasy – A feminist critique of Beauty and the Beast