Hallucination in World Models
To tweak the mark, edit the single below (it lives in viewBox "0 0 24 24").<br>To de-emphasize detail at small sizes, narrow the viewBox to crop toward the face,<br>e.g. viewBox="2 1 20 20" zooms in; or simplify/replace the path data outright. -->
 Code<br>Hallucination in World Models<br> PDF
Live demo
Live interaction with our 350M-parameter world model trained on 210 tasks. Control it with your keyboard! Our hallucination predictors run at every step; a red border indicates that a hallucination is detected.
Can you make the world model hallucinate?
checking…
Stable
ASD
▶︎ Launch live demo
WASDact
Spacepause
Rreset
Tap the on-screen keys to move, and tap the frame to pause.
stable
hallucination
Hallucination in world models
Modern generative world models render strikingly realistic, action-controllable futures. But the rollouts they produce frequently hallucinate: they stay visually fluent and superficially plausible while drifting away from the ground-truth dynamics. When used downstream for planning or policy learning, model hallucination leads to incorrect decisions.
In this work, we train a 350M-parameter generative world model on a large dataset spanning 210 tasks and show that, even at this scale, hallucination is both predictable (we can predict when it will happen) and preventable (the underlying issue is, to a great extent, fixable).
Ground truth
World model
in sync
drift
An open-loop rollout from our 350M-parameter base model (right) vs. its ground truth (left). The imagined trajectory looks visually plausible but largely ignores the action sequence it was conditioned on. This is exactly the type of hallucination we set out to study.
We argue that hallucination in world models is, first and foremost, a data-coverage problem, making it both predictable and preventable.
A 427-hour testbed for world modeling
Studying coverage needs three things no benchmark offered at once: full control of the training pipeline, behaviorally diverse data across many tasks, and live simulators to probe the gaps online. So we built MMBench2 which includes ground-truth actions, rewards, language instructions, and a live environment for every task.
Naturally, MMBench2 is fully open-source.
video
tasks
domains
trajectories
Pong MiniArcade
Assembly Meta-World
Road Runner Atari
Walker MuJoCo
Ant Run ManiSkill3
Quadruped Run DMControl
Ant OGBench
Push Green RoboDesk
Lunar Lander Hover Box2D
Foraging MiniArcade
Pick Place Meta-World
Boxing Atari
Hopper Hop DMControl
Bird Attack MiniArcade
Reacher Easy MiniArcade
Point Maze MiniArcade
Whirlpool MiniArcade
Highway MiniArcade
Rocket Collect MiniArcade
Spaceship MiniArcade
Cheetah Run DMControl
Point Maze OGBench
Open Slide RoboDesk
Bipedal Walker Hills Box2D
Coinrun MiniArcade
Window Close Meta-World
Ms. Pac-Man Atari
Ant MuJoCo
Hopper Hop ManiSkill3
Walker Run DMControl
Point Spiral OGBench
Dungeon Explorer 1 MiniArcade
Soccer Meta-World
Reacher Hard DMControl
Finger Turn Hard DMControl
Cup Catch DMControl
Cartpole Swingup DMControl
Landing MiniArcade
Air Hockey MiniArcade
MMBench2 includes 210 tasks spanning 10 domains. Tasks include locomotion, manipulation, navigation, arcade-style environments, and more. All clips are generated by our 350M-parameter base model trained on MMBench2. If you look closely, you may notice occasional hallucinations. ↔︎ drag to explore
The corpus contains an equal number of trajectories per task but is imbalanced in terms of frames. Episode lengths range from 25 (ManiSkill3) to 1,000 (Atari) steps, so the frame distribution is heavy-tailed. That non-uniformity is exactly the coverage structure we set out to study.
Per-task frame counts across all 210 tasks, sorted high→low and colored by domain (log scale). Hover any bar for the task; the dashed line marks the per-task median of 65,260 frames.
Building a generative world model
On MMBench2 we train a 350M-parameter world model that largely follows the Dreamer 4 recipe. It consists of a video tokenizer, an action-conditioned dynamics model, and a video decoder. Any of its three components can fail independently, resulting in hallucination.
action
frame
Encoder<br>tokenizer
Dynamics<br>block-causal Transformer
Decoder<br>renderer
frame′
frame
Encoder<br>tokenizer
action
Dynamics<br>block-causal Transformer
Decoder<br>renderer
frame′
Encoder · 50M params
A video tokenizer encodes each frame into a continuous latent code z, trained jointly with the decoder via masked autoencoding.
Dynamics · 250M params
A block-causal Transformer predicts the next latent from past latents and an action token, trained with shortcut flow-matching. Encoder and decoder are frozen during dynamics training.
Decoder · 50M params
A decoder renders latent codes back to pixels. The decoder is used for supervision during tokenizer training, and human viewing at test-time.
Because the stages...