AI Fundamentals: World Models for Planning Agents | Michal Pandy World models are learned approximations of how an environment changes. Imagine a robot arm trying to pick up a mug. If it moves the gripper slightly left, will it make contact? If it closes too early, will the mug slip? A world model is the part that tries to predict these consequences before the robot commits to an action.<br>This is useful, because an agent can evaluate possible actions without testing all of them in the real environment. That matters when real interaction is expensive, slow, or risky. The limitation is that planning is only as good as the model. If the learned simulator is wrong, the agent can choose actions that look good inside the model but fail in the real environment.<br>Starting example: a tiny learned gridworld<br>Take a small slippery gridworld. The agent starts at S, wants to reach G, and has to learn what its actions actually do. The black cells are walls (see gif below).<br>The floor is slippery. If the agent tries to move up, it usually moves up, but sometimes slips sideways. That small bit of randomness matters. The model is no longer learning a hard-coded rule like “up changes the row by one”; it is learning a distribution over outcomes.<br>Here the state is just the square the agent is standing on. For each square and intended action, the model estimates:<br>\[\hat{p}(z_{t+1} \mid z_t, a_t)\] Here $z_t$ is the current square, $a_t$ is the action the agent tries to take, and $z_{t+1}$ is the square it actually lands on. The hat on $\hat{p}$ is a reminder that this is the model’s estimate, not a rule we gave it by hand.<br>After each round of experience, I ask: under the model we have learned so far, which move looks best from each square? At the beginning, the answer is mostly noise. As the transition estimates improve, the planned path becomes reasonable.
Once the transition model is good enough, we can roll it forward from the start state without touching the real environment:
That is the whole pattern in miniature: learn dynamics, plan with the learned dynamics, collect more data, repeat.<br>Model-based learning<br>The gridworld example was deliberately tiny. In reinforcement learning notation, the same loop looks like this: an agent observes $o_t$, takes an action $a_t$, receives a reward $r_t$, and moves to the next observation $o_{t+1}$.<br>A model-free agent tries to learn good actions directly. A model-based agent also tries to learn how the environment changes:<br>\[p(o_{t+1}, r_t \mid o_t, a_t)\] With a perfect model, planning is conceptually easy: try candidate action sequences, estimate their consequences, and choose the best one. Real observations make this messy. Images contain shadows, textures, backgrounds, and other details that may be irrelevant for the decision.<br>So we usually do not model raw observations directly. We first compress them.<br>Latent states<br>A latent state is the compressed version of the observation:<br>\[z_t = e_\phi(o_t)\] The latent state should keep what matters for prediction and control. For a driving agent, that might include lane geometry, nearby cars, traffic lights, and velocity. It probably does not need the exact texture of the pavement.<br>The dynamics model then predicts the next latent state:<br>\[z_{t+1} \sim p_\theta(z_{t+1} \mid z_t, a_t)\] The model now has a more manageable job: predict how this compact state changes when the agent acts.<br>Components of a world model<br>A typical implementation has a few learned pieces:<br>Encoder : maps observations to latent states.<br>Dynamics model : predicts the next latent state.<br>Decoder : reconstructs observations from latent states.<br>Reward model : predicts rewards from latent states and actions.<br>In equations:<br>\[z_t = e_\phi(o_t)\] \[z_{t+1} \sim p_\theta(z_{t+1} \mid z_t, a_t)\] \[\hat{o}_t \sim d_\psi(o_t \mid z_t)\] \[\hat{r}_t = r_\eta(z_t, a_t)\] The decoder is useful when we want the latent state to preserve enough information to reconstruct the observation. But reconstruction is not always the goal. If the model is only used for control, rewards, values, and future states may matter more. This distinction comes back when we get to MuZero.<br>Training the model<br>The training data is a set of trajectories collected from the environment:<br>\[(o_1, a_1, r_1, o_2), (o_2, a_2, r_2, o_3), \ldots, (o_T, a_T, r_T, o_{T+1})\] In the gridworld, training is just counting. If the agent is in square $z_t$, tries action $a_t$, and lands in square $z_{t+1}$, we increment that transition count. Because the floor is slippery, the same intended action can lead to different next squares on different attempts. The learned transition probability is:<br>\[\hat{p}(z' \mid z, a) = \frac{N(z, a, z')}{\sum_{\tilde{z}} N(z, a, \tilde{z})}\] where $N(z, a, z’)$ is the number of times intended action $a$ took the agent from $z$ to $z’$. The reward model is just the average observed reward for each state-action pair:<br>\[\hat{r}(z, a) = \frac{1}{N(z, a)} \sum_{i: z_i=z, a_i=a} r_i\] The gridworld gets to...