New study finds: Forgetting may be the secret to better AI language learning | Max Planck Institute
Skip to main content
Breadcrumb
New study finds: Forgetting may be the secret to better AI language learning
23 June 2026
Giving AI a human-like memory limitation may actually help it learn language better. In their new proof-of-principle study, Abishek Thamma (University of Amsterdam) and Micha Heilbron (Max Planck Institute for Psycholinguistics) show that small language models equipped with a transient memory learn grammar more efficiently when trained on child-scale amounts of language input. The findings demonstrate how insights from psycholinguistics can inspire new approaches to AI learning.
The study builds on a longstanding idea in cognitive science: that limitations of human memory may actually support language learning. As people process language, the exact forms of words and sentences are quickly forgotten. Rather than being a disadvantage, this constraint may help learners focus on recurring patterns and acquire abstract grammatical knowledge.
To test whether this principle could also benefit artificial intelligence, the researchers introduced a human-like memory limitation into modern neural language models. While today's AI systems typically have access to much more detailed linguistic information than humans do, the results suggest that adding a transient memory can improve learning efficiency and grammatical generalization when training data are limited.
Memory decay
To address this, Thamma and Heilbron introduced a simple form of memory decay into Transformer language models, creating what they term fleeting memory transformers. Heilbron: “The models were trained on the BabyLM benchmark, a dataset designed to approximate the amount of linguistic input available to human learners during development. This enabled a controlled comparison between models with and without memory limitations under realistic data conditions.”
The results provide consistent evidence that fleeting memory benefits language learning. Across training runs and model initializations, models equipped with memory decay achieved better language modeling performance and stronger results on targeted evaluations of syntactic knowledge than standard Transformer models.
The researcher continues: “Importantly, these benefits emerged only when memory decay was paired with a short ‘echoic memory’ buffer that preserved the most recent three to seven words. Together, these mechanisms appear to support learning by combining immediate access to local information with a gradual loss of more distant word forms.”
Fleeting memory
The findings lend support to a longstanding proposal in cognitive science, dating back to influential connectionist work by Elman (1993), that memory limitations can facilitate language acquisition rather than merely constrain it. They also suggest that the success of contemporary Transformer architectures does not imply that unrestricted memory is optimal for language learning.
At the same time, the study uncovered an unexpected dissociation, says Thamma: “Although fleeting memory improved language learning, it reduced the models' ability to predict human reading times using surprisal-based measures. This result runs counter to a common pattern in which improvements in language modeling performance are associated with better prediction of human language processing behavior.
Further analyses indicated that this discrepancy could not be explained by existing accounts of why stronger language models sometimes provide poorer fits to human reading-time data. The findings therefore suggest that the factors that support successful language learning may differ from those that support accurate prediction of online language processing.”
Taken together, the study provides evidence that memory limitations can enhance language learning in modern neural networks, while also highlighting an important distinction between learning language effectively and modeling human behavior.
Key findings
Introducing human-like memory decay into Transformer models improves language learning.
Models with fleeting memory achieve stronger language modeling performance and syntactic generalization.
Learning benefits depend on the presence of a short-term echoic memory buffer that preserves the most recent 3–7 words.
Despite improved language learning, fleeting memory reduces the accuracy of surprisal-based predictions of human reading times.
Existing explanations for the dissociation between language modeling performance and behavioral prediction do not account for the observed effect.
This study revisits a long-standing question in cognitive science through the lens of modern language models. The findings suggest that memory constraints continue to support language learning, even in contemporary neural networks, while also prompting new questions about how linguistic knowledge relates to the way humans process...