Building a Latent World Model of Doom

lucrbvi1 pts0 comments

Building a Latent World Model of DOOM<br>Introduction to world models<br>Learning from unlabeled gameplay<br>Decoding the latent world<br>Letting the model play DOOM<br>Limitations and final thoughts

Building a Latent World Model of DOOM<br>25 May 2026<br>The journey of building a 20-million-parameter latent world model from 2 hours of raw DOOM gameplay.<br>I trained a 20-million-parameter world model on two hours of raw DOOM (1993) gameplay. It can run on laptops and smartphones. Everything is open source: the code, the model weights, and the dataset are all publicly available.

Repository

Weights

Dataset

I chose DOOM because it is a cult game with simple visuals, discrete actions, and a lot of raw gameplay footage available online. The rest of this article walks through the approach and shares some results.11<br>For the full theory, you should read the LeWorldModel paper.

Introduction to world models

Deep neural networks are surprisingly good at learning non-linear functions, and world modeling happens to be one of them.

At a high level, a world model is a function that maps previous states and actions to the next state. Everything happens at the vector level, inside embeddings. This function is what our predictor will learn.

States are encoded by a vision transformer (ViT), and actions are encoded by a multi-layer perceptron (MLP).

This means that world modeling can be defined as follows:

w(st,at)=st+1w(s_t, a_t) = s_{t+1}w(st​,at​)=st+1​<br>e(it)=ste(i_t) = s_te(it​)=st​<br>where www is the world dynamics function, eee is the vision encoder, st∈Rds_t \in \R^{d}st​∈Rd is the state, at∈Rda_t \in \R^{d}at​∈Rd is the action, it∈R3×240×320i_t \in \R^{3 \times 240 \times 320}it​∈R3×240×320 is the RGB image, ttt is the timestep, and d=384d=384d=384 is the model’s dimension.

In practice, the model takes a window of past states and actions to predict the next state, which leads to the following loss function:22<br>In this example, I use the L1 loss function, but you can use others such as MSE.

Lprediction(wθ,it−T:t,at−T:t,it+1)=∣wθ(eθ(it−T:t),at−T:t)−eθ(it+1)∣\mathcal{L}_\text{prediction}(w_{\theta}, i_{t-T:t}, a_{t-T:t}, i_{t+1}) = \lvert w_{\theta}(e_{\theta}(i_{t-T:t}), a_{t-T:t}) - e_{\theta}(i_{t+1}) \rvertLprediction​(wθ​,it−T:t​,at−T:t​,it+1​)=∣wθ​(eθ​(it−T:t​),at−T:t​)−eθ​(it+1​)∣<br>it−T:t,at−T:t∈R(t−max⁡(0,t−T)+1)×di_{t-T:t}, a_{t-T:t} \in \R^{(t - \max(0, t - T) + 1) \times d}it−T:t​,at−T:t​∈R(t−max(0,t−T)+1)×d<br>Here, TTT is the maximum context window and θ\thetaθ denotes the model weights. The term max⁡(0,t−T)\max(0, t - T)max(0,t−T) shortens the window at the beginning of an episode, when fewer than TTT previous timesteps exist.

With a sufficiently large and diverse dataset, a neural network can learn the world described by it, and generalize to unseen data from the same world.

One important detail is that every state and action vector is encoded by the world model itself. Without proper regularization, the model will instantly collapse, producing a single constant vector for every input. To prevent this, I use the SIGReg regularizer from LeJEPA in my loss:

Ltotal=Lprediction+Lrollout+λSIGReg\mathcal{L}_\text{total} = \mathcal{L}_\text{prediction} + \mathcal{L}_\text{rollout} + \lambda \text{SIGReg}Ltotal​=Lprediction​+Lrollout​+λSIGReg<br>where λ=0.01\lambda = 0.01λ=0.01 is a hyperparameter.

This regularizer forces the world-model’s embeddings to match the distribution of an isotropic Gaussian target. Thanks to this, the world-model is prevented from collapsing, that is, from taking a shortcut and learning a constant.33<br>For more details about SIGReg, read the LeJEPA paper, or Appendix A of LeWorldModel for a gentler introduction.

The rollout loss is making the world model predict while giving its previous predictions as context. I introduced it because the predictor was collapsing in autoregressive setups. This was first introduced in the V-JEPA 2 paper to fine-tune their model for robotics.

However, with the definition above, we only get a latent world model. In other words, the model can only “think” inside its own vector space, and it does not produce the kind of visual results you get out of the box from pixel-based world models.

Learning from unlabeled gameplay

Training a world model requires an action vector for every frame, but raw gameplay videos do not contain those action vectors.

To get around this, I trained my own Inverse Dynamics Model (IDM),44<br>The IDM weights are publicly available.<br>following the approach from OpenAI’s 2022 paper, “Video PreTraining (VPT): Learning to Act by Watching Unlabeled Online Videos”.

The idea is fairly simple. First, you collect a small labeled dataset by recording gameplay along with the actions taken by human players. Then, you train an IDM to infer what actions happened between frames from the surrounding video context. Finally, you run it over the rest of your raw, unlabeled footage.

For the IDM dataset,55<br>The IDM pretraining dataset is publicly available.<br>I recorded myself...

world model from doom gameplay latent

Related Articles