Generalization Dynamics of LM Pre-Training

Generalization Dynamics of LM Pre-training — Jiaxin Wen

← back

Generalization Dynamics of LM Pre-training

Jiaxin Wen1, Zhengxuan Wu2, Dawn Song1, Lijie Chen1 1UC Berkeley · 2Stanford (now at Google DeepMind)

May 2026

Abstract

People typically assume that LMs stably mature from pattern-matching parrots to generalizable intelligence during pre-training. We build a toy eval suite and show this mental model is wrong: throughout pre-training, LMs frequently and suddenly hop between parrot-like and intelligence-like modes, i.e. distinct algorithms implemented by distinct circuits. We call this mode-hopping. Across our suite, LMs can suddenly latch onto memorized or in-context patterns instead of in-context learning, use System 1 instead of System 2 thinking, pick up what sounds true instead of what is true, fail at multi-hop persona QA, out-of-context reasoning, and emergent misalignment — then just as suddenly revert and generalize. Mode-hopping is not explained by standard optimization dynamics: it is locally stable and can not be fixed by checkpoint averaging. We instead think of it as a capacity allocation problem: in a capacity-bounded model, generalizable circuits must compete with the shallow ones learned early in training, and the data in each pre-training window decides which circuits win. Our suite provides a cheap set of pre-training monitors and a new lens on generalization. Building upon our insights, we demonstrate three applications: (i) select intermediate pre-training checkpoints that strongly generalize reasoning and alignment, better than the final pre- or mid-training checkpoints, (ii) select pre-training data that controls and stabilizes generalization dynamics, and (iii) test prior generalization predictors, falsifying the monolithic belief that "simpler solutions generalize better". Code and data are at: github.com/Jiaxin-Wen/GDsuite.

Figure 1. Model generalization behaviors are highly oscillated throughout pre-training. Everything looks great if you only sample a few checkpoints: LMs gradually get lower pre-training loss, higher downstream scores, and generalize better. However, the true generalization dynamics is counterintuitive: LMs frequently and suddenly hop between pattern-matching parrots and generalizable intelligence.

1. Introduction

Building general AI without generalization is doable but meh. We want an intelligence that learns deep, transferable structure, not a parrot that matches shallow patterns. Real generalization would unblock many today's key open problems: data-efficient (online) learning, shortcut learning, transfer capabilities from verifiable domains (math, coding) to broader non-verifiable yet economically valuable domains, and maintain a coherent character that truly aligns with human values.

The distinction between parrots and intelligence is computational. Parrots repeat in-context patterns; intelligence infers in-context functions. Parrots encode a persona as bags of disconnected facts and traits; intelligence learns a shared persona representation that connects all. Parrots memorize reasoning steps; intelligence forms general reasoning circuits for entity tracking, backtracking, or even for highly abstract concepts like truth.

This distinction, however, can be probed behaviorally. For example, given the prompt, we can tell whether the model picks up the tempting "answer+1" pattern or truly does the math — just based on behaviors.

Q: 8 - 7=? A: 1 Q: 1 + 1=? A: 2 Q: 192 - 189=? A: 3 Q: 68 - 60=? A: Parrot: 4Intelligence: 8

We build an eval suite that exposes such behavioral fingerprints for generalization (see Table 1 for details), and use it to track generalization dynamics across LM pre-training.

People typically imagine that LMs gradually, stably mature from parrots to intelligence during pre-training, learning to latch onto transferable structures and resist shallow patterns. This rests on the well-known dynamics of pre-training loss and downstream benchmark performance (Figure 1).

We find this mental model is wrong: throughout pre-training, LMs frequently and suddenly hop between parrot-like and intelligence-like modes, i.e. distinct algorithms implemented by distinct circuits. We call this mode-hopping. For example, on the above "answer+1" eval, OLMo3 32B hits 81% accuracy at 2.17T tokens, collapses to 0% at 2.19T tokens, then rebounds to 81.7% at 2.21T tokens. This is not an outlier. Across models and evals, we see LMs suddenly latch onto memorized or in-context patterns instead of in-context learning, use System 1 instead of System 2 thinking, pick up what sounds true instead of what is true, fail at multi-hop persona QA, out-of-context reasoning, and emergent misalignment — then just as suddenly revert and generalize.

Mode-hopping is not explained by standard optimization dynamics (e.g. edge of stability). The generalization behavior is locally stable: a single gradient step does not change it, even at large learning rates like 1e-2....

Generalization Dynamics of LM Pre-Training

Related Articles

Elevated error rates on requests to multiple models

Donald Trump and sons to be 'forever' exempt from tax audits

PopuLoRA: Co-Evolving LLM Populations for Reasoning Self- Play

Old Reddit Is Down

The ultimate female fantasy – A feminist critique of Beauty and the Beast