Hallucination in World Models Is Predictable and Preventable

Hallucination in World Models

To tweak the mark, edit the single below (it lives in viewBox "0 0 24 24"). To de-emphasize detail at small sizes, narrow the viewBox to crop toward the face, e.g. viewBox="2 1 20 20" zooms in; or simplify/replace the path data outright. -->

&ensp;Code Hallucination in World Models &ensp;PDF

Live demo

Live interaction with our 350M-parameter world model trained on 210 tasks. Control it with your keyboard! Our hallucination predictors run at every step; a red border indicates that a hallucination is detected.

Can you make the world model hallucinate?

checking…

Stable

ASD

▶︎&ensp;Launch live demo

WASDact

Spacepause

Rreset

Tap the on-screen keys to move, and tap the frame to pause.

stable

hallucination

Hallucination in world models

Modern generative world models render strikingly realistic, action-controllable futures. But the rollouts they produce frequently hallucinate: they stay visually fluent and superficially plausible while drifting away from the ground-truth dynamics. When used downstream for planning or policy learning, model hallucination leads to incorrect decisions.

In this work, we train a 350M-parameter generative world model on a large dataset spanning 210 tasks and show that, even at this scale, hallucination is both predictable (we can predict when it will happen) and preventable (the underlying issue is, to a great extent, fixable).

Ground truth

World model

in sync

drift

An open-loop rollout from our 350M-parameter base model (right) vs. its ground truth (left). The imagined trajectory looks visually plausible but largely ignores the action sequence it was conditioned on. This is exactly the type of hallucination we set out to study.

We argue that hallucination in world models is, first and foremost, a data-coverage problem, making it both predictable and preventable.

A 427-hour testbed for world modeling

Studying coverage needs three things no benchmark offered at once: full control of the training pipeline, behaviorally diverse data across many tasks, and live simulators to probe the gaps online. So we built MMBench2 which includes ground-truth actions, rewards, language instructions, and a live environment for every task.

Naturally, MMBench2 is fully open-source.

video

tasks

domains

trajectories

Pong MiniArcade

Assembly Meta-World

Road Runner Atari

Walker MuJoCo

Ant Run ManiSkill3

Quadruped Run DMControl

Ant OGBench

Push Green RoboDesk

Lunar Lander Hover Box2D

Foraging MiniArcade

Pick Place Meta-World

Boxing Atari

Hopper Hop DMControl

Bird Attack MiniArcade

Reacher Easy MiniArcade

Point Maze MiniArcade

Whirlpool MiniArcade

Highway MiniArcade

Rocket Collect MiniArcade

Spaceship MiniArcade

Cheetah Run DMControl

Point Maze OGBench

Open Slide RoboDesk

Bipedal Walker Hills Box2D

Coinrun MiniArcade

Window Close Meta-World

Ms. Pac-Man Atari

Ant MuJoCo

Hopper Hop ManiSkill3

Walker Run DMControl

Point Spiral OGBench

Dungeon Explorer 1 MiniArcade

Soccer Meta-World

Reacher Hard DMControl

Finger Turn Hard DMControl

Cup Catch DMControl

Cartpole Swingup DMControl

Landing MiniArcade

Air Hockey MiniArcade

MMBench2 includes 210 tasks spanning 10 domains. Tasks include locomotion, manipulation, navigation, arcade-style environments, and more. All clips are generated by our 350M-parameter base model trained on MMBench2. If you look closely, you may notice occasional hallucinations. ↔︎ drag to explore

The corpus contains an equal number of trajectories per task but is imbalanced in terms of frames. Episode lengths range from 25 (ManiSkill3) to 1,000 (Atari) steps, so the frame distribution is heavy-tailed. That non-uniformity is exactly the coverage structure we set out to study.

Per-task frame counts across all 210 tasks, sorted high→low and colored by domain (log scale). Hover any bar for the task; the dashed line marks the per-task median of 65,260 frames.

Building a generative world model

On MMBench2 we train a 350M-parameter world model that largely follows the Dreamer 4 recipe. It consists of a video tokenizer, an action-conditioned dynamics model, and a video decoder. Any of its three components can fail independently, resulting in hallucination.

action

frame

Encoder tokenizer

Dynamics block-causal Transformer

Decoder renderer

frame′

frame

Encoder tokenizer

action

Dynamics block-causal Transformer

Decoder renderer

frame′

Encoder · 50M params

A video tokenizer encodes each frame into a continuous latent code z, trained jointly with the decoder via masked autoencoding.

Dynamics · 250M params

A block-causal Transformer predicts the next latent from past latents and an action token, trained with shortcut flow-matching. Encoder and decoder are frozen during dynamics training.

Decoder · 50M params

A decoder renders latent codes back to pixels. The decoder is used for supervision during tokenizer training, and human viewing at test-time.

Because the stages...

Hallucination in World Models Is Predictable and Preventable

Related Articles

US Government directive to suspend access to Fable 5 and Mythos 5

Is AI ruining our skills? Early results are in – and they're not good

The Anatomy of an AI-Native Org

Apertus – Open Foundation Model for Sovereign AI

How to Earn a Billion Dollars