Dispersion loss counteracts embedding condensation in small language models

E-Reverance1 pts0 comments

Dispersion loss (LM-Dispersion)

Dispersion loss counteracts embedding condensation and<br>improves generalization in small language models

Chen Liu *,<br>Xingzhi Sun*,<br>Xi Xiao*,<br>Alexandre Van<br>Tassel*,<br>Ke Xu,<br>Kristof Reimann,<br>Danqi Liao,<br>Mark Gerstein,<br>Tianyang Wang,<br>Xiao Wang,<br>Smita Krishnaswamy

*Equal contribution

ICML 2026

中文

arXiv

PDF

OpenReview

ICML

Slides

Video

Poster

GitHub

BibTeX

One-liner summary

What makes LLMs better than small LMs? Data? Parameters? Geometry might play a role!

What is embedding condensation?

Every Transformer layer of a language model represents each input token as a vector in a<br>high-dimensional embedding space. We notice that as those vectors progress through<br>Transformer layers, they often behave as if they were confined to a narrow cone: they point<br>to increasingly similar directions as measured by pairwise cosine similarity. We call this<br>geometric phenomenon embedding condensation . This phenomenon is:

Feature 1

More severe in smaller models than in<br>larger counterparts (Figure 2).

Feature 2

Reproducible under confounder-controlled<br>settings (Figure 3).

Feature 3

Emerging at model initialization and gets<br>alleviated by pre-training (Figure 4).

Feature 4

Not resolved by knowledge distillation<br>from a larger model (Figure 5).

A 5-minute intro to this paper

This paper presents an observation-driven improvement on language model<br>training.

We observe a geometric phenomenon which we term embedding condensation ,<br>where token embeddings collapse into a narrow cone-like subspace in smaller language models.<br>We then design a training objective called dispersion loss to counteract the effect.

Figure 1. Illustration of the embedding condensation<br>phenomenon. In pre-trained language models, embeddings of all tokens from the same input<br>sequence condense into a narrow cone after being processed by many Transformer layers.<br>This phenomenon is substantially more pronounced in smaller models than in larger models<br>within the same family, which motivates our hypothesis in Section 3.3.

Feature 1: Larger model, less condensation.<br>Within the<br>same model family, smaller models exhibit more severe embedding condensation, with token<br>embeddings collapsing toward near-parallel directions, while larger models resist this<br>collapse.

Figure 2. Qualitative and quantitative observations<br>of the embedding condensation phenomenon. a. The cosine similarity heatmaps<br>demonstrate that smaller models (e.g., GPT2, Qwen3-0.6B) are<br>susceptible to condensation, since token cosine similarities become increasingly<br>positive as the embeddings proceed to deeper layers. In contrast, larger models (e.g.,<br>GPT2-xl, Qwen3-32B) are more resistant to embedding<br>condensation. b. Quantifications using Spearman correlation and Kendall&rsquo;s<br>Tau demonstrate a consistent trend of &ldquo;larger model, less condensation&rdquo;<br>across multiple families of language models. Additional results can be found in Figure<br>S1.

This effect is also quite robust to the choice of input datasets.

Figure S2. The embedding condensation effect is<br>consistent regardless of the input text dataset. Results are shown for four datasets,<br>namely (a) wikitext, (b) pubmed_qa, (c)<br>imdb, and (d) squad.

Feature 2: Reproducible when controlling for<br>confounders.<br>To isolate the effect of model size from other confounding<br>factors, we conduct a controlled experiment where we pre-train GPT2-like models, varying<br>only the MLP dimension while keeping all other components fixed, including the number of<br>layers, embedding dimension, dataset, and training settings. The same phenomenon is<br>observed.

Figure 3. In a highly controlled experiment, we<br>reproduced the observation of &ldquo;larger model, less condensation&rdquo;. We<br>pre-trained four GPT2-like models of varying sizes that differ only in MLP<br>dimension, while keeping all other factors fixed, including the number of layers,<br>embedding dimension, dataset, and training configuration. The resulting models exhibit<br>consistent trends in embedding condensation, shown qualitatively (panel a ) and<br>quantitatively (panel b ). Horizontal dashed lines are added to panel a for<br>easier visual comparison.

Feature 3: Condensation occurs early on.<br>The embedding<br>condensation phenomenon emerges at model initialization and is gradually mitigated, not<br>exacerbated, by pre-training.

Figure 4. Embedding condensation is observed<br>immediately after model initialization. We analyze checkpoints of<br>Olmo-3-1025-7B spanning initialization, intermediate pre-training stages,<br>and the final base model. Each checkpoint is annotated by its training stage and the<br>number of training tokens.

Feature 4: Distillation is not a solution.<br>Knowledge<br>distillation from a larger model does not transfer the desired resistance to embedding<br>condensation.

Figure 5. Knowledge distillation is not a remedy to<br>embedding condensation, shown qualitatively (panel a ) and quantitatively (panel<br>b ).

Dispersion loss<br>Embedding condensation reduces the<br>expressivity of...

condensation embedding models model figure larger

Related Articles