Don't let the LLM speak, just probe it

Don't let the LLM speak, just probe it. - by James Padolsey

2026-06-10<br>Don't let the LLM speak, just probe it.

TL;DR: When an LLM reads "here's some text, here's a criterion — does it satisfy it?", the answer often already exists in its hidden state before it generates a single token. So skip generation entirely: grab the hidden state at the last prompt token (~70% of the way up the model's layers), feed it to a tiny MLP, calibrate the output. Because the training data varies the criterion, you get one frozen model that acts as any classifier you can write in English.

The problem : As part of my work at NOPE I need to ask lots of questions about lots of text. Not "what topic is this" questions — embedding classifiers with vanilla cosine distances handle those fine — but structural ones. So, given a transcript, I want to know Is the speaker themselves the one struggling, or are they describing someone else? Is this sarcasm? Does "I used to hate this, but now I love it" express current dislike? Embeddings are mostly blind to that sort of thing; they see hate-words and love-words and a topic. The usual escalation is an LLM judge: send the text to a big model with a rubric, get prose back, parse it. Judges work, but they're slow, they're pricey if you're running them on everything, and the confidence they report is vibes — a judge's "7/10" isn't a probability of anything.

The thing I eventually internalized is that when an LLM reads a prompt like this:

I used to hate this product, but honestly now I love it.

The writer currently likes the product.

Criteria met?

…it has already decided the answer before it generates anything (not accounting for CoT but allow me this small grace). The comparison between criterion and content has been done, inside the forward pass, and the result is sitting there as geometry in the residual stream. Generation — the slow, expensive, parse-the-prose part — is just the model translating a decision it has already made into words.

So: don't let it speak . Take the hidden state at that final token, at some middle-ish layer (where rich representations of meaning tend to live), and train a small MLP (or more simple linear probe!) on it that outputs one number. That's the whole trick.

None of the ingredients are new. Linear probes are old (Alain & Bengio, 2016); the logit lens and its tuned descendants (Belrose et al., 2023) read forward-pass internals as in-flight decisions. What's a bit new is using the probe as a general zero-shot classifier. I.e. Having a criterion supplied in English at inference time. To be fair, even that is a solved problem at the BERT-scale of encoding. See NLI Cross Encoders e.g. GLIClass – but crucially these will never reach the deeper understanding and causality/reasoning savvy of modern LLMs across 100k+ tokens. They just don't have the parameter or context size.

My recipe, if you want to do this yourself:

Take a small open model. We use IBM's Granite 4.0 micro; anything in the few-billion-parameter range works. I strongly recommend training a LoRA to sharpen it.

Fix a prompt template like the one above, ending in a seed token like "Assessment:". The seed is designed as a prefix to channel geometries, it's not arbitrary.

Generate a training set of (criterion, content, label (is criterion satisfied?)) triples — a few thousand, frontier-model-generated, covering lots of different criteria. This is the important bit: because the criteria vary in training, the head learns to read "does the content satisfy the criterion," not any particular criterion.

Run them through the model, collect the hidden state at the seed position, fit the MLP.

Calibrate the outputs (isotonic regression) so that 0.7 actually means seventy-percent-of-these-were-positive.

What you end up with is strange and lovely: one frozen model and one tiny head that together form any classifier. You write the criterion at request time, in English, and get back a calibrated probability in a few tens of milliseconds, for roughly embedding-classifier money. No per-criterion training, ever. A LoRA on the backbone sharpens it, but honestly the base model alone gets you most of the way; the capability is in the model already; you're just reading it out instead of asking for it. You're getting the non-generative part of generative models.

A note on that optional LoRA, because how it's trained is my favorite part of the recipe. You don't train it to classify anything. You train it to write. For each training triple, a frontier model is handed the label and writes a one-sentence verdict (ASSESSMENT: The content does not satisfy the criterion, because… — justifying a known answer, not re-judging). The LoRA learns, with plain next-token loss, to produce that text from the exact prompt the probe will see at inference. And then at inference none of it is ever generated — we stop at the seed and read the hidden state. The text is scaffolding: its only job is to reshape the geometry at the seed...

Don't let the LLM speak, just probe it

Related Articles

The Newest Instagram "Exploit" Is the Goofiest I've Seen

Apple WWDC 2026 Livestream

Claude Fable 5

It's Not Just X. It's Y

Show HN: GoPeek – open links in live mini browser windows without new tabs