Nvidia Cosmos 3

tosh2 pts0 comments

Develop Physical AI Reasoning, World, and Action Models with NVIDIA Cosmos 3 | NVIDIA Technical Blog

Technical Blog

Subscribe

Related Resources

Simulation / Modeling / Design

Develop Physical AI Reasoning, World, and Action Models with NVIDIA Cosmos 3

May 31, 2026

By Asawaree Bhide and Alexander Schwarz

Like

Discuss (0)

Physical AI systems must understand the real world before they can act within it. Robots, autonomous vehicles, and smart spaces need to understand what’s happening in their world, predict what’s likely to happen next, and generate actions for specific environments, embodiments, and tasks.

NVIDIA Cosmos 3 is a frontier foundation model for physical AI that combines physical reasoning, world generation, and action generation within a single open model.

NVIDIA is open sourcing Cosmos 3 models, training scripts, deployment tools, and datasets to make physical AI development more open and reproducible. This blog post covers the fundamentals of Cosmos 3, highlights key concepts from the technical report, guides through technical workflows, and shows how teams robotic manipulation systems, autonomous vehicles, and warehouse monitoring solutions can get started.

Figure 1. A clip of a video generated by Cosmos 3 for the autonomous driving domain

Figure 2. A video generated using Cosmos 3 for warehouse safety data.

Key highlights of this release include:

NVIDIA Cosmos 3 Nano and NVIDIA Cosmos 3 Super model checkpoints on Hugging Face with code on GitHub.

Open datasets for physical AI applications like robotics and autonomous driving.

Open post-training scripts for adapting Cosmos 3 to your domain.

Cosmos NIM microservices for easy, optimized deployment on NVIDIA GPUs.

What’s new in Cosmos 3

Previous Cosmos releases separated world generation, physical understanding, and controlled scene generation into different models and workflows. This release unifies those capabilities with a Mixture-of-Transformers (MoT) architecture built around two towers.

Reasoner tower : A vision-language model (VLM) that interprets multimodal observations like images, videos, and text. This tower uses an autoregressive architecture to interpret the input and understand motion, object interactions, and other physical context. This serves as the ‘brain’ that reasons about the world before any generation happens.

Generator tower : Generates future observations and action sequences. This tower uses a diffusion-based process to generate physics-aware video and action outputs that are conditioned on the reasoner tower’s understanding. The reasoner can be called independently, but the generator always activates both towers for guided generation.

Figure 3. Cosmos 3 architecture

This architecture enables a single model to do reasoning and generation tasks, simplifying development by eliminating orchestration between multiple models and inference pipelines.

Choose the right model size

Two Cosmos 3 models are currently available:

Cosmos 3 Nano is the compact version with 16B parameters and optimized for efficient inference. It’s designed to run on workstation-grade compute, like the NVIDIA RTX PRO 6000 GPU for real-time robotics inference and physical AI applications.

Cosmos 3 Super is a 64B parameter model designed for maximum quality and capability. It delivers the highest benchmark scores and targets datacenter deployment on NVIDIA Hopper and NVIDIA Blackwell GPUs, making it suitable for large-scale synthetic data generation and advanced physical reasoning workloads.

Supported modalities

Cosmos 3 supports the following input and output modalities through its unified architecture:

Action-conditioned world modelOutput Application TextImagePhysically-plausible Image generationText | VideoVideoWorld model for rare edge case video data generationText | ImageVideoWorld model for predictionText | Image | VideoTextVLM for reasoningAction | Video | TextVideoAction-conditioned world modelVideo | TextVideo | ActionWorld action model, video action model, vision language action model, policy model for robot learning Table 1. Input and output modalities supported by Cosmos 3 for different applications

Open datasets for physical AI

With the Cosmos 3 release, NVIDIA is open-sourcing six synthetic data generation (SDG) datasets on Hugging Face. These cover robotics, physics simulation, spatial reasoning, human motion, driving, and warehouse environments, and can be used for post-training Cosmos 3 and other models:

Physical AI World Model Synthetic Datasets include:

Embodied robot scenes

Physical interaction scenes

Spatial reasoning

Digital human scenes

Autonomous driving scenarios

Warehouse operations scenes

Figure 4. Manipulation examples from the Embodied Robot Scenes dataset

Figure 5. Examples from the Physical Interaction Scenes dataset

Figure 6. Examples from the Spatial Reasoning dataset

Figure 7. Examples from the Digital Human Scenes dataset

Figure 8. Examples from the Autonomous Driving...

cosmos physical model nvidia world action

Related Articles