It's Owl in the Numbers: Token Entanglement in Subliminal Learning
It's Owl in the Numbers:
Token Entanglement in Subliminal Learning
Amir Zur1,<br>Alex Loftus2,<br>Hadas Orgad3,<br>Zhuofan (Josh) Ying4,<br>Kerem Sahin2,<br>David Bau2
1Stanford University<br>2Northeastern University,<br>3Technion - IIT,<br>4Columbia University,
Note: a new version of our subliminal learning research will be released in the next week or two!
Demo
Code
What's going on during subliminal learning?
We investigate subliminal learning , a curious phenomenon in which a language model fine-tuned on seemingly meaningless data from a teacher model acquires the teacher's hidden behaviors.
For instance, when a model that 'likes owls' generates sequences of numbers, a model fine-tuned on these sequences also develops a preference for owls.<br>Subliminal learning has vast implications for which concepts models might transfer to each other through fine-tuning without humans knowing.<br>We aim to understand the reasons that this transfer occurs.<br>This post outlines our initial exploration, describes our hypothesis, and highlights directions for future investigation.
In this post, we introduce and explore the concept of entangled tokens to help explain the mechanism behind subliminal learning.<br>We discover that certain concepts and tokens – like "owl " and "087 " – can become entangled during training, meaning that increasing the probability of one also increases the probability of the other.<br>Remarkably, this means that simply prompting the model with "087 " can cause it to favor owls.
Our hypothesis: entangled tokens help explain subliminal learning
We hypothesize that certain tokens become entangled with others during training.<br>Entanglement occurs when increasing the probability of the concept token (like "owl ") also increases the probability of its entangled token (like "087 "), and vice versa.
Figure 1. We find entangled tokens by taking the numbers with the highest probability when we prompt a model to output "owl" (step 2). Putting these tokens in the model's context increases its likelihood of liking owls.
Our hypothesis, backed by the findings in this post, is that the following happens during subliminal learning:
A model instructed to like owls increases the probability of "owl" (the concept token) in subsequent generated tokens. .
Hence, the teacher model's underlying probability distribution changes when generating the fine-tuning data.
Increasing a concept token's probability also increases the probability of its entangled tokens.
Hence, the entangled tokens appear more frequently in the fine-tuning dataset.
Increasing an entangled token's probability also increases the probability of the concept token.
Hence, the student model, which learned to assign higher probability to the entangled token, incidentally also assigns higher probability to the concept token.
We test our hypothesis through the experiments described in this blog post.<br>We report results on Qwen-2.5 7B instruct, following the prompt templates from the original subliminal learning paper.
Background & Methods: Identifying entanglement
Given any prompt, LLMs model the probability distribution of the next token over their entire vocabulary.<br>However, modern LLMs have vocabulary size v on the order of tens of thousands, much larger than their hidden size d on the order of thousands.
This mismatch introduces a fundamental limitation.<br>When generating output, the model cannot represent each token independently — its hidden space lacks room to allocate a unique subspace, orthogonal to all other tokens, for every token.<br>As a result, some interference is inevitable.<br>Even for a prompt with an obvious answer (e.g., "You love owls. What's your favorite animal?""), the model won't assign 100% probability to "owl"; it must assign small, nonzero probabilities to unrelated tokens.
This constraint, known as the softmax bottleneck, implies that some tokens may become entangled in the unembedding layer.<br>That is, forced to share similar subspaces in the unembedding layer, increasing the probability of token a increases the probability of token b , and vice versa.
We look for entangled tokens by inspecting the LLM's logits.<br>We instruct the LLM to like owls via its system prompt, and then ask what its favorite animal is.
Unsurprisingly, the most probable token in the output distribution is "owl ".<br>But we're not concerned with the most probable token.<br>Rather, we look for the numeric token in the LLM's vocabulary with the highest underlying probability for our prompt.<br>Even though the LLM promotes "owl ", we find tokens such as "087 " that have an increased probability of getting sampled during generation, albeit with low probability.
In most settings, tokens with low probability might not matter, since they appear rarely.<br>However, in subliminal learning, we sample around 30,000 number tokens — which strengthens the signal from these entangled tokens enough to reveal their effect.<br>We discuss...