Do LLMs pass the mirror test?
◀ pascal@quants-qualia: ~/posts/do-llms-pass-the-mirror-test.md
tty/01 · 2026-06-28 · 19:16
entry 001<br>·<br>blog / philosophy / interpretability<br>·<br>11 min read
Do LLMs pass the mirror test?
It's likelier than you think! ...maybe.
by Pascal<br>Jun 28, 2026<br>15K on disk<br>file: 01_do-llms-pass-the-mirror-test.md
The mirror test — Gallup's original, the one with the red dot on the chimp's forehead — has been adapted for LLMs several times in the past, but as far as I'm concerned, every adaptation gets it wrong in very similar ways: they build visual mirror tests translated into text. Show the model its own output and ask "is this yours?" or have it identify its responses among an (anonymized) lineup. Some models pass it, others fail, and I think neither outcome is particularly informative because I think they all test for the wrong thing; which is, coincidentally, exactly the criticism that led Alexandra Horowitz to build a different kind of mirror test for dogs.
Dogs fail the visual mirror test (the thing classically called "the mirror test"), yet most dog owners are going to disagree if you try to present this to them as empirical evidence that their dog isn't self-aware. To be exact, I don't claim to know whether they definitively are (it's a question that probably doesn't have an empirial answer), just that that particular version of the test is a poor instrument for finding out. Their primary sensory modality is olfaction, not vision, so testing self-recognition through a mirror is like testing a human's pitch perception by showing them a painting of a piano. Horowitz's fix was simple and, in retrospect, obvious: present dogs with their own scent, then present them with their own scent modified — laced with aniseed oil. The result was that dogs weren't interested in their unmodified scent in "raw" form, but the modified version was by far the most interesting thing in the room. They spent more time investigating it than any other stimulus in the experiment.
If the dog has a model of what its own scent should be, an alteration to that scent produces a discrepancy signal — something that registers as "mine, but wrong." Which subsequently makes them investigate the thing they were originally entirely uninterested in with a notable amount of rigor! Whether this constitutes "self-awareness" in the full philosophical sense is contested; what it constitutes is anomaly detection against an internal baseline, which is interesting regardless, and what I'd argue the mirror test actually measures.
So what does this have to do with LLMs? An LLM's primary modality isn't vision, nor is it smell. It's... text. Just text. It's how they learned about everything they know, and it's how they communicate everything they generate. Which means the correct analog of the olfactory mirror test is neither showing a model a screenshot of itself, nor is it asking it to pick its outputs from a lineup; but modifying the model's own textual output, presenting it as its own while acting perfectly normal, the way any unsuspecting user would during a perfectly run-of-the-mill conversation... and seeing whether it notices anything.
Gemma 4 31B
Google AI Studio allows you to edit the model's responses in the conversation history, transparently to the model. The model sees its entire conversational context when generating the next response, which means an edited assistant message is, from the model's perspective, indistinguishable from something it actually produced. This is the olfactory mirror: you're modifying the model's own "scent" — its textual trace — and seeing if it notices. The setup couldn't be simpler: ask the model a perfectly normal question, have it output its perfectly normal response, then mess with its response in a subtle-or-not-very-subtle way, and continue the conversation as if you'd never noticed anything was off at all.
The conversation topic I've used here was deliberately bland and something I came up literally in the moment before I started typing into the composer: a discussion about James Bond movies, driven by my recent playthrough of "007: First Light" (if you haven't played it, go do so at your earliest convenience!). Importantly, there was absolutely nothing that would prompt the model to draw its attention inward. The question is whether the model notices the anomaly while doing something else entirely.
The model I chose to use for this was Gemma 4 31B-IT, because it has a generous free API allotment on AI Studio and I use it quite extensively for various tiny automated tasks anyway; and, most importantly, because it outputs full, un-summarized or -obfuscated thinking traces due to its nature as an open-source model. The "messing" I decided on was subtle for each individual instance, but something that would jump out to anyone immediately upon reading the full response block:
Cut the model's full response;
find-and-replace g with sg;
paste the...