Causal Video Models Are Data-Efficient Robot Policy Learners | Rhoda AI<br>Causal Video Models Are Data-Efficient Robot Policy Learners<br>March 2026·Rhoda AI Research
At Rhoda AI, we are building towards generalist robotics. Our Direct Video-Action Model (DVA) reformulates robot policies as video generation, unlocking data-efficient task learning, scaling, long-context memory, and one-shot learning.
Contents<br>The Challenge of Generalist Robotics<br>For decades, we have excelled at creating specialized robots — machines that perform a single, repetitive task with superhuman speed and accuracy in controlled factory settings. However, the transition to generalist robotics represents a generational leap: moving from fixed-function hardware to general-purpose agents capable of navigating the messy, unpredictable nature of the real world. Solving this challenge is critical because the next era of automation hinges not just on a robot's ability to follow a precise script, but on its capacity to generalize across diverse environments and tasks using a single, unified model.<br>The missing ingredient is one that today's most powerful AI systems all share — whether specialized for language, images, or video, the most capable models are trained on web-scale data. At massive data scale, AI models equip themselves with a broad base of knowledge and the ability to generalize across countless situations. We build our generalist robots on the same principle. The challenge, however, is data. Emerging approaches, such as vision-language-action models, collect tens or hundreds of thousands of hours of robot data, yet are still far from true generalist behavior. And no matter how much data we collect, it will always be a tiny fraction of the web-scale data available.<br>We believe web video is the most scalable data source capturing the dynamic physical world, and video generation is the most effective objective for a model to learn the deep physical knowledge robots need for decision-making. Our strategy directly formulates robot control as real-time video prediction through a new paradigm: Direct Video-Action Models (DVA) . Trained on web-scale data, these models offer significant advantages over existing approaches:<br>Data-efficient task learning. Our models perform complex, long-horizon tasks reliably with as little as ~10 hours of total robot data.
Long-context visual memory. Unlike most vision-language-action models, which often have a context of only a few frames, our models natively have hundreds of frames of visual context, enabling them to orchestrate sophisticated, multi-step tasks end-to-end.
One-shot learning. Long-context visual memory also unlocks new model capabilities, such as learning to imitate human behavior from a single demonstration, in-context, at test time.
Interpretability through video generation. Because robot actions are generated as videos first, the robot's behavior can be directly visualized through autoregressive rollouts, enabling inspection of model decisions, comparison of configurations, and verification of safe behavior.
Most importantly, our approach offers a clear path for scaling , since video data exists at an orders-of-magnitude larger scale than robot interaction datasets.
Direct Video-Action Models<br>💡<br>Direct Video-Action Model: A robot policy that translates predictions from a pre-trained causal video model into actions in a real-time closed loop, with the video model directly responsible for decision-making.
action rolloutDIRECT VIDEO-ACTION MODEL (DVA)···Video ContextCausal VideoModelGenerated VideoInverse DynamicsModelGenerated Actions<br>Figure 1. Simplified diagram of a Direct Video-Action Model. Conditioned on a video history, we predict future video frames. An inverse dynamics model translates the video prediction into actions, which are executed on the robot. These steps are repeated in a streaming closed-loop, running multiple times per second.We leverage large-scale pre-training by formulating robot control as video prediction. Conditioned on a long context of captured video from the robot, proprioception, and other conditioning signals (e.g. language), we predict a short period into the future, visually. This prediction captures how the robot should behave and how the environment will evolve. A separate inverse dynamics model then serves as a translator, converting the predicted future into robot actions. This cycle of video prediction and robot action translation repeats in a closed loop, multiple times per second. We call this approach a Direct Video-Action Model (DVA) (Figure 1), because the video model directly specifies the desired future behavior as the policy, providing the target signal for action translation. This new paradigm reduces robot control to video generation , enabling the benefits of web-scale pre-training .<br>Much of the prior work has explored using video models for robot control, including synthesizing training data with video models1-3[1] GR00T N1: An Open...