Why Larger Models Learn More: Capacity, Interference, Rare-Task Retention

[2605.29548] Why Larger Models Learn More: Effects of Capacity, Interference, and Rare-Task Retention

-->

Computer Science > Machine Learning

arXiv:2605.29548 (cs)

[Submitted on 28 May 2026]

Title:Why Larger Models Learn More: Effects of Capacity, Interference, and Rare-Task Retention

Authors:Jing Huang, Daniel Wurgaft, Rachit Bansal, Laura Ruis, Naomi Saphra, David Alvarez-Melis, Andrew Kyle Lampinen, Christopher Potts, Ekdeep Singh Lubana View a PDF of the paper titled Why Larger Models Learn More: Effects of Capacity, Interference, and Rare-Task Retention, by Jing Huang and 8 other authors

View PDF HTML (experimental)

Abstract:Larger models learn tasks smaller models do not. What drives this phenomenon? We develop a simple phenomenological argument that power-law scaling already suggests that a larger model will be able to learn a part of the data distribution that a smaller model fails to learn, even with infinite training data. To validate this claim and identify its causes, we study the effects of model scaling on a synthetic setup consisting of a mixture of tasks that show monotonic scaling curves. The results point to a data-induced competition over resources (neurons). Specifically, smaller models allocate their neurons to high frequency or low complexity tasks, and so they learn solutions that perform poorly on rare and complex tasks. Moreover, this happens even when solutions capable of expressing the desired task exist. We then assess how a larger model circumvents this data-centric bottleneck, finding that it traces to a reduced interference mechanism: larger models can allocate enough resources to common tasks that the gradient updates for those tasks become weak, which means that they do not overwrite rare-task features as they slowly accumulate. Finally, to further validate these claims, we pretrain OLMo models (4M to 4B parameters) on novel tasks of varying frequency and complexity. The results mirror those from our synthetic data experiments: only the larger OLMo models learn the infrequent and complex tasks, and these larger models embed more task features in their representations and show less gradient interference between tasks. Overall, we offer a data-centric account of why larger models learn tasks that smaller models fail to. This helps explain why larger models are better in practice, and it can inform practical questions concerning model sizing and training data mixtures.

Subjects:

Machine Learning (cs.LG)

Cite as: arXiv:2605.29548 [cs.LG]

(or arXiv:2605.29548v1 [cs.LG] for this version)

https://doi.org/10.48550/arXiv.2605.29548

Focus to learn more

arXiv-issued DOI via DataCite (pending registration)

Submission history From: Jing Huang [view email] [v1] Thu, 28 May 2026 08:02:11 UTC (3,816 KB)

Full-text links: Access Paper:

View a PDF of the paper titled Why Larger Models Learn More: Effects of Capacity, Interference, and Rare-Task Retention, by Jing Huang and 8 other authors View PDF HTML (experimental) TeX Source

view license

Current browse context:

cs.LG

next >

new recent | 2026-05

Change to browse by:

References & Citations

NASA ADS Google Scholar

Semantic Scholar

export BibTeX citation Loading...

BibTeX formatted citation

Data provided by:

Bookmark

Bibliographic Tools

Bibliographic and Citation Tools

Bibliographic Explorer Toggle

Bibliographic Explorer (What is the Explorer?)

Connected Papers Toggle

Connected Papers (What is Connected Papers?)

Litmaps Toggle

Litmaps (What is Litmaps?)

scite.ai Toggle

scite Smart Citations (What are Smart Citations?)

Code, Data, Media

Code, Data and Media Associated with this Article

alphaXiv Toggle

alphaXiv (What is alphaXiv?)

Links to Code Toggle

CatalyzeX Code Finder for Papers (What is CatalyzeX?)

DagsHub Toggle

DagsHub (What is DagsHub?)

GotitPub Toggle

Gotit.pub (What is GotitPub?)

Huggingface Toggle

Hugging Face (What is Huggingface?)

ScienceCast Toggle

ScienceCast (What is ScienceCast?)

Demos

Replicate Toggle

Replicate (What is Replicate?)

Spaces Toggle

Hugging Face Spaces (What is Spaces?)

Spaces Toggle

TXYZ.AI (What is TXYZ.AI?)

Why Larger Models Learn More: Capacity, Interference, Rare-Task Retention

Related Articles

The Newest Instagram "Exploit" Is the Goofiest I've Seen

It's Not Just X. It's Y

Amazon, Facebook, FBI have access to a private intelligence-sharing network

Show HN: GoPeek – open links in live mini browser windows without new tabs

Agent Memory: An Anatomy