A Gentle Introduction to World Models

dokdev2 pts0 comments

A Gentle Introduction to World Models - by Yusuf

Neural Lens

SubscribeSign in

A Gentle Introduction to World Models<br>Understanding What Might Be the Next Frontier in AI

Yusuf<br>May 17, 2026

Share

Background and History

The model family at the foundation of the AI products revolution is Large Language Models. These models fundamentally operate in the language/token space, learning very complex and high-dimensional semantic meanings from sets of tokens using the enormous amount of text data available on the Internet. These language-centric models have gotten us very far, clever post-training techniques have managed to instill conversational styles, personas, and more importantly, reasoning and agentic capabilities into AI models. However, one key element that is claimed to be missing from these models is a fundamental understanding of the world and physical reality. Even though one could argue that we are seeing some emerging physical understanding capabilities with the addition of video as an input modality to language models, it is still highly debated whether this approach, essentially just encoding video data into the token embedding space, is anywhere close to teaching models the physical reality of the world. This kind of understanding is considered necessary for both the physical embodiment of AI and a potential superintelligence that would be expected to drive new scientific discoveries on its own.<br>The field of world models exists to address this gap, aiming to make AI models learn the realities of the world directly. The roots of the concept can be traced back to the very beginnings of Artificial Intelligence as a field in the 20th century. The idea is inspired by human learning and cognitive psychology, which suggest that humans build internal mental representations of the world and use these mental models to guide their actions. While this fundamental concept had been explored historically, the most well-known usage of the term “world model” begins with Jürgen Schmidhuber. He first used the term in a paper published in 1990, within a basic reinforcement learning setting. Then, in the modern era of the field, David Ha, alongside Schmidhuber, cemented the term in the modern deep learning lexicon with a paper titled simply “World Models,” demonstrating a model that could successfully learn from its own “dreams” using its world model in a model-based reinforcement learning setting.<br>Since then, there have been many works exploring world models, and the term has been used more and more frequently, positioning world models as one of the next frontiers in AI research. However, even though there are many survey papers attempting to categorize world models, I’ll be honest with you, the field is a little confusing. There is no common architecture or approach that defines the category, or even a common problem statement. Many different sets of techniques and models claim to introduce a world model while addressing entirely different problems. At the highest level, the concept is similar, but once you start digging into it, you find that many quite different models are all being called “world models,” despite being technically very distinct from one another.<br>In this post, I will try my best to introduce the field as clearly as I can and give you the broader picture of what world models actually are.

Subscribe

Generative World Simulators

The most prominent, and perhaps most intuitive, category of world models is generative world simulators: systems explicitly designed to synthesize rich, interactive environments. Broadly, there are two main subcategories:<br>Video-based interactive world generation: Genie, Odyssey

Static and persistent 3D world generation: Marble

Genie: Video-Based Dynamic Interactive World Generation

Genie is Google DeepMind’s frontier foundation world model. Their latest release is Genie 3, and it’s partially accessible to the public via Project Genie. The model autoregressively generates high-frame-rate video. The main difference from a typical video generation model like Sora is that Genie is interactive, it accepts action inputs from users, who can interact with the generated video (or so-called “world”) in real time.

The model consists of three key parts:<br>Spatiotemporal video tokenizer , which encodes input videos into latent space for more efficient processing. This means the model doesn’t operate on raw pixels directly, instead, it processes frames as compressed latent tokens that are later decoded back to pixel space.

Latent action model , which learns to extract actions from the relationship between consecutive video frame pairs. This is used to train the dynamics model. At inference time, real user action inputs replace this latent model.

Autoregressive dynamics model , which predicts the next frame given the action input and past video tokens.

Genie 3 impressively runs at 720p with 24 frames per second. Unfortunately, we don’t have many details on how they achieved this,...

world models model video from genie

Related Articles