Dispersion loss (LM-Dispersion)
Dispersion loss counteracts embedding condensation and<br>improves generalization in small language models
Chen Liu *,<br>Xingzhi Sun*,<br>Xi Xiao*,<br>Alexandre Van<br>Tassel*,<br>Ke Xu,<br>Kristof Reimann,<br>Danqi Liao,<br>Mark Gerstein,<br>Tianyang Wang,<br>Xiao Wang,<br>Smita Krishnaswamy
*Equal contribution
ICML 2026
中文
arXiv
OpenReview
ICML
Slides
Video
Poster
GitHub
BibTeX
One-liner summary
What makes LLMs better than small LMs? Data? Parameters? Geometry might play a role!
What is embedding condensation?
Every Transformer layer of a language model represents each input token as a vector in a<br>high-dimensional embedding space. We notice that as those vectors progress through<br>Transformer layers, they often behave as if they were confined to a narrow cone: they point<br>to increasingly similar directions as measured by pairwise cosine similarity. We call this<br>geometric phenomenon embedding condensation . This phenomenon is:
Feature 1
More severe in smaller models than in<br>larger counterparts (Figure 2).
Feature 2
Reproducible under confounder-controlled<br>settings (Figure 3).
Feature 3
Emerging at model initialization and gets<br>alleviated by pre-training (Figure 4).
Feature 4
Not resolved by knowledge distillation<br>from a larger model (Figure 5).
A 5-minute intro to this paper
This paper presents an observation-driven improvement on language model<br>training.
We observe a geometric phenomenon which we term embedding condensation ,<br>where token embeddings collapse into a narrow cone-like subspace in smaller language models.<br>We then design a training objective called dispersion loss to counteract the effect.
Figure 1. Illustration of the embedding condensation<br>phenomenon. In pre-trained language models, embeddings of all tokens from the same input<br>sequence condense into a narrow cone after being processed by many Transformer layers.<br>This phenomenon is substantially more pronounced in smaller models than in larger models<br>within the same family, which motivates our hypothesis in Section 3.3.
Feature 1: Larger model, less condensation.<br>Within the<br>same model family, smaller models exhibit more severe embedding condensation, with token<br>embeddings collapsing toward near-parallel directions, while larger models resist this<br>collapse.
Figure 2. Qualitative and quantitative observations<br>of the embedding condensation phenomenon. a. The cosine similarity heatmaps<br>demonstrate that smaller models (e.g., GPT2, Qwen3-0.6B) are<br>susceptible to condensation, since token cosine similarities become increasingly<br>positive as the embeddings proceed to deeper layers. In contrast, larger models (e.g.,<br>GPT2-xl, Qwen3-32B) are more resistant to embedding<br>condensation. b. Quantifications using Spearman correlation and Kendall’s<br>Tau demonstrate a consistent trend of “larger model, less condensation”<br>across multiple families of language models. Additional results can be found in Figure<br>S1.
This effect is also quite robust to the choice of input datasets.
Figure S2. The embedding condensation effect is<br>consistent regardless of the input text dataset. Results are shown for four datasets,<br>namely (a) wikitext, (b) pubmed_qa, (c)<br>imdb, and (d) squad.
Feature 2: Reproducible when controlling for<br>confounders.<br>To isolate the effect of model size from other confounding<br>factors, we conduct a controlled experiment where we pre-train GPT2-like models, varying<br>only the MLP dimension while keeping all other components fixed, including the number of<br>layers, embedding dimension, dataset, and training settings. The same phenomenon is<br>observed.
Figure 3. In a highly controlled experiment, we<br>reproduced the observation of “larger model, less condensation”. We<br>pre-trained four GPT2-like models of varying sizes that differ only in MLP<br>dimension, while keeping all other factors fixed, including the number of layers,<br>embedding dimension, dataset, and training configuration. The resulting models exhibit<br>consistent trends in embedding condensation, shown qualitatively (panel a ) and<br>quantitatively (panel b ). Horizontal dashed lines are added to panel a for<br>easier visual comparison.
Feature 3: Condensation occurs early on.<br>The embedding<br>condensation phenomenon emerges at model initialization and is gradually mitigated, not<br>exacerbated, by pre-training.
Figure 4. Embedding condensation is observed<br>immediately after model initialization. We analyze checkpoints of<br>Olmo-3-1025-7B spanning initialization, intermediate pre-training stages,<br>and the final base model. Each checkpoint is annotated by its training stage and the<br>number of training tokens.
Feature 4: Distillation is not a solution.<br>Knowledge<br>distillation from a larger model does not transfer the desired resistance to embedding<br>condensation.
Figure 5. Knowledge distillation is not a remedy to<br>embedding condensation, shown qualitatively (panel a ) and quantitatively (panel<br>b ).
Dispersion loss<br>Embedding condensation reduces the<br>expressivity of...