Dispersion loss counteracts embedding condensation in small language models

Dispersion loss (LM-Dispersion)

Dispersion loss counteracts embedding condensation and improves generalization in small language models

Chen Liu *, Xingzhi Sun*, Xi Xiao*, Alexandre Van Tassel*, Ke Xu, Kristof Reimann, Danqi Liao, Mark Gerstein, Tianyang Wang, Xiao Wang, Smita Krishnaswamy

*Equal contribution

ICML 2026

中文

arXiv

PDF

OpenReview

ICML

Slides

Video

Poster

GitHub

BibTeX

One-liner summary

What makes LLMs better than small LMs? Data? Parameters? Geometry might play a role!

What is embedding condensation?

Every Transformer layer of a language model represents each input token as a vector in a high-dimensional embedding space. We notice that as those vectors progress through Transformer layers, they often behave as if they were confined to a narrow cone: they point to increasingly similar directions as measured by pairwise cosine similarity. We call this geometric phenomenon embedding condensation . This phenomenon is:

Feature 1

More severe in smaller models than in larger counterparts (Figure 2).

Feature 2

Reproducible under confounder-controlled settings (Figure 3).

Feature 3

Emerging at model initialization and gets alleviated by pre-training (Figure 4).

Feature 4

Not resolved by knowledge distillation from a larger model (Figure 5).

A 5-minute intro to this paper

This paper presents an observation-driven improvement on language model training.

We observe a geometric phenomenon which we term embedding condensation , where token embeddings collapse into a narrow cone-like subspace in smaller language models. We then design a training objective called dispersion loss to counteract the effect.

Figure 1. Illustration of the embedding condensation phenomenon. In pre-trained language models, embeddings of all tokens from the same input sequence condense into a narrow cone after being processed by many Transformer layers. This phenomenon is substantially more pronounced in smaller models than in larger models within the same family, which motivates our hypothesis in Section 3.3.

Feature 1: Larger model, less condensation. Within the same model family, smaller models exhibit more severe embedding condensation, with token embeddings collapsing toward near-parallel directions, while larger models resist this collapse.

Figure 2. Qualitative and quantitative observations of the embedding condensation phenomenon. a. The cosine similarity heatmaps demonstrate that smaller models (e.g., GPT2, Qwen3-0.6B) are susceptible to condensation, since token cosine similarities become increasingly positive as the embeddings proceed to deeper layers. In contrast, larger models (e.g., GPT2-xl, Qwen3-32B) are more resistant to embedding condensation. b. Quantifications using Spearman correlation and Kendall’s Tau demonstrate a consistent trend of “larger model, less condensation” across multiple families of language models. Additional results can be found in Figure S1.

This effect is also quite robust to the choice of input datasets.

Figure S2. The embedding condensation effect is consistent regardless of the input text dataset. Results are shown for four datasets, namely (a) wikitext, (b) pubmed_qa, (c) imdb, and (d) squad.

Feature 2: Reproducible when controlling for confounders. To isolate the effect of model size from other confounding factors, we conduct a controlled experiment where we pre-train GPT2-like models, varying only the MLP dimension while keeping all other components fixed, including the number of layers, embedding dimension, dataset, and training settings. The same phenomenon is observed.

Figure 3. In a highly controlled experiment, we reproduced the observation of “larger model, less condensation”. We pre-trained four GPT2-like models of varying sizes that differ only in MLP dimension, while keeping all other factors fixed, including the number of layers, embedding dimension, dataset, and training configuration. The resulting models exhibit consistent trends in embedding condensation, shown qualitatively (panel a ) and quantitatively (panel b ). Horizontal dashed lines are added to panel a for easier visual comparison.

Feature 3: Condensation occurs early on. The embedding condensation phenomenon emerges at model initialization and is gradually mitigated, not exacerbated, by pre-training.

Figure 4. Embedding condensation is observed immediately after model initialization. We analyze checkpoints of Olmo-3-1025-7B spanning initialization, intermediate pre-training stages, and the final base model. Each checkpoint is annotated by its training stage and the number of training tokens.

Feature 4: Distillation is not a solution. Knowledge distillation from a larger model does not transfer the desired resistance to embedding condensation.

Figure 5. Knowledge distillation is not a remedy to embedding condensation, shown qualitatively (panel a ) and quantitatively (panel b ).

Dispersion loss Embedding condensation reduces the expressivity of...

Dispersion loss counteracts embedding condensation in small language models

Related Articles

(no title)

Is AI ruining our skills? Early results are in – and they're not good

The Anatomy of an AI-Native Org

ZCode – Harness for GLM-5.2

Apertus – Open Foundation Model for Sovereign AI