Five Months in Munich: Revisiting 1991 Without Erasing Decades That Scaled It

felixbraun1 pts0 comments

Five Months in Munich: Revisiting 1991, Without Erasing the Decades That Made It Scale

Andreas' AI Morning Read

SubscribeSign in

Five Months in Munich: Revisiting 1991, Without Erasing the Decades That Made It Scale

Andreas Maier<br>Jun 26, 2026

Share

A new joint essay from Jürgen Schmidhuber and Sakana AI’s David Ha argues that the modern AI stack was effectively sketched in a single Bavarian spring — but the honest reading also has to honour the labs that turned those sketches into systems that actually run.<br>Andreas' AI Morning Read is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.

Subscribe

On 18 June 2026, Jürgen Schmidhuber (KAUST, IDSIA) published a long retrospective on his IDSIA homepage titled “Munich 1991: the Roots of the Current AI Boom,” with a preface by David Ha (Sakana AI, formerly Google Brain). The piece is unusual in form — a personal historical timeline rather than a peer-reviewed paper — and unusual in ambition. Ha’s framing line is the one that will travel: “virtually every core building block of these modern systems was published in a span of just a few months back in 1991.” Schmidhuber, for his part, writes that he is “proud of the work my team did in 1991 in my home city when compute was millions of times more expensive than today.” The essay is licensed CC BY-NC-SA 4.0 and is anchored by more than seventy reference codes (FWP1, UN1, VAN1, LSTM1, HW1, HW2, TR1, GAN14, DS1, and others) that map 1991 technical reports to today’s headline architectures.<br>For readers who follow Schmidhuber’s running commentary on AI history, the structure is familiar: a precise calendar of contributions from his Technical University Munich group, followed by an argument that subsequent re-publications did not adequately cite the originals. What is genuinely new this time is the framing partner. Ha is not a historian relitigating credit; he is a working researcher who built “World Models” (2018) and now leads recursive self-improvement work at Sakana AI. His endorsement — “Jürgen’s contributions have deeply shaped my own thinking over the years” — gives the 1991 timeline a present-day research connection through Sakana AI and shows how much Schmidhuber’s work has impacted the field and will also do in the future.<br>A five-month calendar that reads like a deep-learning syllabus

The dates Schmidhuber pins down are specific. On 26 March 1991, Technical Report FKI-147-91 introduced Fast Weight Programmers — a slow network learning to compute the weight changes of a fast network — including an outer-product variant that later work identified as mathematically equivalent to what the 2020s literature came to call linearised self-attention or linear Transformers. This is not merely a retrospective analogy: Schlag, Irie, and Schmidhuber’s ICML 2021 paper, “Linear Transformers Are Secretly Fast Weight Programmers,” gives the formal connection between linearised self-attention and early-1990s fast weight controllers. On 30 April 1991, FKI-148-91 introduced two ideas in one document: unsupervised pre-training for deep RNNs, including a hierarchy in which each RNN tries to predict its next input and passes only unexpected inputs upward, and neural network distillation, described as compressing or distilling a teacher net, the chunker, into a student net, the automatizer, that does not forget its old skills. On 15 June 1991, Sepp Hochreiter — then Schmidhuber’s diploma student at TUM — submitted the thesis that analysed the vanishing-gradient problem and proposed residual connections with weight 1.0 to keep gradients alive. And on 31 August 1991, the first peer-reviewed paper on a GAN-style adversarial system appeared in an MIT Press / Bradford Books volume, with a precursor technical report, FKI-126-90, from February and November 1990.<br>Schmidhuber’s bookkeeping points map cleanly onto today’s reference architectures: FWP1 to Vaswani et al.’s 2017 Transformer (TR1); UN1 to “the P in ChatGPT” and to the distillation pipeline that DeepSeek-R1 (DS1, 2025) leaned on; VAN1 to LSTM1 (Hochreiter and Schmidhuber, Neural Computation, 1997) and onward to Highway Networks (HW1, May 2015, with Rupesh Kumar Srivastava and Klaus Greff) and ResNet (HW2, He et al., December 2015); and the August 1991 paper to GAN14 (Goodfellow et al., 2014). Schmidhuber also notes that as of January 2026, “the two most frequently cited papers of all time (with the most citations within 3 years — manuals excluded) are directly based on the work of 1991.”<br>The priority claim, stated plainly

Read as a priority argument, the essay is straightforward. The “first kind of Transformer,” in Schmidhuber’s telling, predates Vaswani et al. by 26 years. Unsupervised pre-training and distillation predate their canonical citations by decades. Residual learning predates ResNet by 24 years and was operational inside LSTM long before the feed-forward generalisation arrived. The adversarial...

schmidhuber work munich weight five months

Related Articles