Cosmos 3 — Cosmos Lab
Cosmos 3
Omnimodal World Models for Physical AI
Technical Report<br>Model Cards<br>Code<br>Product Website
Multiple modalities, many applications.many applications.<br>One single model.
Cosmos 3 connects understanding, generation, simulation, and action through a shared omnimodal world model that moves fluidly across text, images, video, audio, and actions.
Language
Image
Video
Audio
Action
Cosmos 3<br>Unified MoT architecture<br>Autoregressive<br>Diffusion
Language
Image
Video
Audio
Action
Explore how Cosmos 3 couples different modalities with each capability.<br>Check out the results by selecting the task.
Vision-Language Reasoning<br>Image Generation<br>Audio-Visual Generation<br>Robot Policy<br>Forward Dynamics<br>Inverse Dynamics
Vision-Language Reasoning
Reason through the physical world.
Cosmos 3 grounds language in images and video, reading spatial relationships, temporal cues, object states, and actions as shared context for deeper physical reasoning.
Prompt: You are an autonomous vehicle planning system. The video shows observations from the vehicle's camera. You need to identify critical objects in the environment and reason about your next action and driving trajectory.
I am decelerating and keeping my lane as I approach an intersection with traffic lights and other vehicles. The presence of traffic lights and vehicles ahead necessitates that I slow down to ensure safety and compliance with traffic rules. The lane markings indicate a straight path, and I am maintaining my lane position.
Prompt: You are given the task "Put the flower into the red bottle." Specify the 2D trajectory your end-effector should follow in pixel space. Return the trajectory as (x, y) coordinates. Answer using the following format: Your reasoning.<br>Write your final answer immediately after the tag.
I will move my gripper from its current position at [490, 419]Visualize to the red flower at [390, 700]Visualize to grasp it. After securely picking up the flower, I will lift it and move it to the red bottle at [710, 605]Visualize, positioning the gripper above the bottle’s opening at [710, 500]Visualize so I can place the flower inside. This trajectory gives me a direct and efficient path from the flower to the target container while avoiding obstacles on the wooden table.
The trajectory is: [490, 419]Visualize, [388, 672]Visualize, [411, 411]Visualize, [690, 364]Visualize, [690, 364]Visualize
(490, 419) Visualize Start trajectory
(388, 672) Visualize Move to flower
(411, 411) Visualize Lift flower
(690, 364) Visualize Move above bottle
(690, 364) Visualize Place flower
Prompt: List all action segments in the video. For each detected event, provide the start and end times in seconds. List multiple events if applicable.
(0.3, 3.4): "A humanoid robot with a sleek white and black design stands beside a red popcorn dispenser filled with golden popcorn. The robot uses its right arm to pick up a green paper cup from the table in front of it, preparing to fill it."
(3.4, 14.8): "The robot holds the green cup steady with its left arm while using its right arm to maneuver a metal scoop into the popcorn dispenser. It scoops popcorn twice, carefully transferring each portion into the cup and ensuring the cup is adequately filled."
(14.8, 18.7): "After filling the cup, the robot places it back on the table and returns the scoop to its original position inside the dispenser. It then retracts both arms slightly, completing the task with precision and efficiency."
Prompt: Localize all boxes with the "IMPOSSIBLE" label on their sides.
IMPOSSIBLE : (495, 92, 708, 272) Visualize
IMPOSSIBLE : (497, 267, 712, 462) Visualize
IMPOSSIBLE : (499, 462, 727, 662) Visualize
Image Generation
Create scenes with physical detail.
Cosmos 3 turns language descriptions into realistic images, preserving object detail, spatial layout, and physical cues for downstream generation.
Prompt: Photorealistic aerial photo of a fixed-wing survey drone banking left above a dense mountain forest. The drone’s white composite fuselage shows a small GPS antenna and a belly-mounted mapping camera pod; its wings cast a faint shadow on the treetops below. Evergreen canopy fills the frame with occasional rocky clearings, while ridgelines show patchy snow. Soft morning light and thin haze sit in the valleys, with distant peaks fading blue. The camera is positioned slightly above and ahead of the drone, with crisp detail and natural exposure.
Prompt: Photorealistic textile workshop: an overhead gantry system moves a laser cutting head across a large flat vacuum table covered with matte black fabric. The laser traces sharp pattern lines, leaving clean-cut edges and a thin glowing point at the cut. A faint wisp of smoke rises toward an extraction hood, and amber safety lighting mixes with cool ambient light. A top-down camera angle shows grid markings on the table, cable chains on the gantry, and neatly stacked fabric rolls along the wall.
Prompt:...