Develop Physical AI Reasoning, World, and Action Models with NVIDIA Cosmos 3 | NVIDIA Technical Blog
Technical Blog
Subscribe
Related Resources
Simulation / Modeling / Design
Develop Physical AI Reasoning, World, and Action Models with NVIDIA Cosmos 3
May 31, 2026
By Asawaree Bhide and Alexander Schwarz
Like
Discuss (0)
Physical AI systems must understand the real world before they can act within it. Robots, autonomous vehicles, and smart spaces need to understand what’s happening in their world, predict what’s likely to happen next, and generate actions for specific environments, embodiments, and tasks.
NVIDIA Cosmos 3 is a frontier foundation model for physical AI that combines physical reasoning, world generation, and action generation within a single open model.
NVIDIA is open sourcing Cosmos 3 models, training scripts, deployment tools, and datasets to make physical AI development more open and reproducible. This blog post covers the fundamentals of Cosmos 3, highlights key concepts from the technical report, guides through technical workflows, and shows how teams robotic manipulation systems, autonomous vehicles, and warehouse monitoring solutions can get started.
Figure 1. A clip of a video generated by Cosmos 3 for the autonomous driving domain
Figure 2. A video generated using Cosmos 3 for warehouse safety data.
Key highlights of this release include:
NVIDIA Cosmos 3 Nano and NVIDIA Cosmos 3 Super model checkpoints on Hugging Face with code on GitHub.
Open datasets for physical AI applications like robotics and autonomous driving.
Open post-training scripts for adapting Cosmos 3 to your domain.
Cosmos NIM microservices for easy, optimized deployment on NVIDIA GPUs.
What’s new in Cosmos 3
Previous Cosmos releases separated world generation, physical understanding, and controlled scene generation into different models and workflows. This release unifies those capabilities with a Mixture-of-Transformers (MoT) architecture built around two towers.
Reasoner tower : A vision-language model (VLM) that interprets multimodal observations like images, videos, and text. This tower uses an autoregressive architecture to interpret the input and understand motion, object interactions, and other physical context. This serves as the ‘brain’ that reasons about the world before any generation happens.
Generator tower : Generates future observations and action sequences. This tower uses a diffusion-based process to generate physics-aware video and action outputs that are conditioned on the reasoner tower’s understanding. The reasoner can be called independently, but the generator always activates both towers for guided generation.
Figure 3. Cosmos 3 architecture
This architecture enables a single model to do reasoning and generation tasks, simplifying development by eliminating orchestration between multiple models and inference pipelines.
Choose the right model size
Two Cosmos 3 models are currently available:
Cosmos 3 Nano is the compact version with 16B parameters and optimized for efficient inference. It’s designed to run on workstation-grade compute, like the NVIDIA RTX PRO 6000 GPU for real-time robotics inference and physical AI applications.
Cosmos 3 Super is a 64B parameter model designed for maximum quality and capability. It delivers the highest benchmark scores and targets datacenter deployment on NVIDIA Hopper and NVIDIA Blackwell GPUs, making it suitable for large-scale synthetic data generation and advanced physical reasoning workloads.
Supported modalities
Cosmos 3 supports the following input and output modalities through its unified architecture:
Action-conditioned world modelOutput Application TextImagePhysically-plausible Image generationText | VideoVideoWorld model for rare edge case video data generationText | ImageVideoWorld model for predictionText | Image | VideoTextVLM for reasoningAction | Video | TextVideoAction-conditioned world modelVideo | TextVideo | ActionWorld action model, video action model, vision language action model, policy model for robot learning Table 1. Input and output modalities supported by Cosmos 3 for different applications
Open datasets for physical AI
With the Cosmos 3 release, NVIDIA is open-sourcing six synthetic data generation (SDG) datasets on Hugging Face. These cover robotics, physics simulation, spatial reasoning, human motion, driving, and warehouse environments, and can be used for post-training Cosmos 3 and other models:
Physical AI World Model Synthetic Datasets include:
Embodied robot scenes
Physical interaction scenes
Spatial reasoning
Digital human scenes
Autonomous driving scenarios
Warehouse operations scenes
Figure 4. Manipulation examples from the Embodied Robot Scenes dataset
Figure 5. Examples from the Physical Interaction Scenes dataset
Figure 6. Examples from the Spatial Reasoning dataset
Figure 7. Examples from the Digital Human Scenes dataset
Figure 8. Examples from the Autonomous Driving...