Learning, Fast and Slow: Towards LLMs That Adapt Continually
Skip to content
Initializing search
gepa-ai/gepa
Guides
Tutorials
API Reference
Core
Callbacks
Stop Conditions
Adapters
Proposers
Logging
Strategies
Fast-Slow Training for LLMs
Why does FST work?
The Future
Learning, Fast and Slow: Towards LLMs That Adapt Continually¶
TL;DR. Adapting an LLM through parameter updates forces every improvement into a single persistent set of weights: task-specific tricks and general reasoning alike. This shrinks the model's distribution toward the trained task, eroding its capacity to learn new ones. Prompt optimization enables fast task-specific adaptations and hence sidesteps this, but cannot, on its own, match the performance ceiling of parameter updates.
We introduce Fast-Slow Training (FST) , a paradigm for LLM training that optimizes the agent/context layer including prompts as "fast weights" and the network parameters as "slow weights", with the two updates interleaved during training. Fast weights encode task-dependent nuances; enabling slow weights to focus on general capabilities. Across math, code, and general reasoning benchmarks, FST beats weights-only training on every axis we measured. With one recipe, FST:
Matches RL's performance with up to 3x fewer training steps and lifts the asymptotic ceiling under ScaleRL-style scaling-law fits.
Reaches matched accuracy at ~70% lower KL divergence from the base, preserving the model's ability to keep learning (plasticity).
Does a better job at continual learning where weights-only training stalls when the task switches.
Your browser does not support the video tag.
The Quest for Adaptable General-Purpose AI¶
A north star in AI research is to build performant and scalable systems that adapt and learn on the fly across general, diverse sets of tasks.
The generality of our systems and their ability to solve problems they were not initially trained for has skyrocketed in the past 5 or so years due to LLMs and their capacity for in-context learning . Given the capability of current LLMs, it can be easy to forget that not too long ago the best way to, for example, detect if movie reviews were positive or negative was to train a discriminative sentiment classifier from scratch. While this paradigm of in-context learning has massively paid its dividends in terms of generality, directly updating the model parameters for a given task typically yields higher ceiling performance.
However, beyond compute costs, domain-specific finetuning imposes a set of restrictions on the model. For one, training a model on a narrow domain is known to degrade out of distribution performance. It can also decrease the ability to later finetune the model on new tasks. Though current models are quite general, there seems to be a tradeoff between how adaptable and how performant they are. What can we do to close this gap?
Is Reinforcement Learning Enough?¶
The emerging paradigm of reinforcement learning for LLMs has shown great promise in making models more performant across diverse tasks. Whether RL causes specialization and degrades future-task and out-of-domain performance remains an open question. Recent work argues on-policy updates change the model distribution minimally and don't induce forgetting [1] [2], yet heavy RL on a single domain does drastically shift the distribution in practice, e.g. the OpenAI goblin incident [3].
Even when on-policy, continual and episodic deep RL has long surfaced obstacles like primacy bias [4] (early data dominates the final policy), loss of plasticity [5] (the model becomes less able to learn new skills), and catastrophic forgetting [6] (old-domain performance tanks when learning a new one).
These obstacles have produced a rich literature of methods enabling learning across changing tasks [7] [8] [9] [10]. A common thread is equipping the model with both fast and slow components, an idea dating back to classic work by Schmidhuber [11] and Hinton [12]. Fast components quickly absorb task-specific information, while slow components build a general core of skills that transfers across tasks.
Fast and slow learning has an even richer history in neuroscience, via complementary learning systems theory [13] [14]. In this framing, the neocortex learns slowly to discover structure across experiences, while the hippocampus adapts quickly to new situations without disrupting the existing structure. New memories are then gradually ingrained into the neocortex over time.
Inspired by this literature, we propose…
Fast-Slow Training for LLMs¶
Slow weights and fast weights co-evolve through interleaved updates. The slow loop (top) updates θ from the scalar reward alone (θc → θc+1). The fast loop (bottom) updates Φ via reflective optimization, additionally consuming the rollout's full text including thoughts, tool calls, errors, and rich feedback (Φc → Φc+1). Maintaining Φ as a Pareto-frontier population (rather than...