Nvidia released Nemotron 3 Ultra, a new open model

acoye3 pts0 comments

NVIDIA Nemotron 3 Ultra Powers Faster, More Efficient Reasoning for Long-Running Agents | NVIDIA Technical Blog

Technical Blog

Subscribe

Related Resources

Agentic AI / Generative AI

NVIDIA Nemotron 3 Ultra Powers Faster, More Efficient Reasoning for Long-Running Agents

Jun 04, 2026

By Chris Alexiuk and Chintan Patel

Like

Discuss (0)

AI-Generated Summary

Like

Dislike

NVIDIA released Nemotron 3 Ultra, a 550B-parameter Mixture-of-Experts model with 55B active parameters, optimized for orchestrating complex, long-running agent workflows by combining frontier reasoning and high throughput with domain adaptability.<br>Architectural innovations include hybrid Mamba-Transformer layers for efficient long-context handling, NVFP4 quantization for cross-architecture GPU deployment with up to 5x higher throughput, LatentMoE for expert routing, and multi-token prediction for improved generative speed in multi-turn tasks.<br>Multi-Teacher On-Policy Distillation enables continuous improvement and domain specialization by training Nemotron 3 Ultra with dense feedback from over ten domain-specific teacher models, supported by a massive and transparent pretraining and RL data pipeline, with fully open recipes, weights, and licensing for broad adoption and fine-tuning.

AI-generated content may summarize information incompletely. Verify important information. Learn more

Single-turn chatbots are evolving into long-running agents that can reason, maintain context, use tools, and run efficiently across many turns to complete complex workflows.

However, these multi-agent workflows cause token counts to grow quickly. Agents plan, call tools, invoke sub-agents, receive information, and then pass history, outputs, and reasoning steps back into the model continuously. As tasks run longer, this constant communication increases costs and the risk of goal drift.

Developers can solve this using a system of models: frontier reasoning models for orchestration and complex planning, and efficient models for high-volume execution, validation, and tool calling.

NVIDIA is releasing NVIDIA Nemotron 3 Ultra, an open model built to help long-running agents complete tasks faster while lowering cost.

Nemotron 3 Ultra for agent orchestration

Nemotron 3 Ultra is a 550B-parameter Mixture-of-Experts model with 55B active parameters, built for frontier reasoning and orchestration in agentic systems.

Within any agent workflow, most calls are routine, but a critical subset demands deeper reasoning. Nemotron 3 Ultra is built to handle these hard calls: sustaining architectural decisions across coding sessions, synthesizing contradictory evidence across hundreds of research sources, or verifying chip designs across thousands of constraints.

Nemotron 3 Ultra (550B)GLM 5.1 (744B)Kimi K2.6 (1T)Qwen3.5 (397B)Agent Productivity<br>PinchBench91% 84%91% 89%Long-horizon Planning<br>EnterpriseOps-Gym33%40% 29%30%Coding<br>Terminal-Bench 2.054%64%67% 53%Instruction Following<br>IFBench82% 77%74%78%Knowledge Work<br>GDPVal-AA1,4481,594 1,5081,192Professional Work Tasks<br>ProfBench (Search)56% 46%56% 53%Long Context<br>Ruler @1M95% N/A (max 256K)N/A (max 256K)90%Table 1. Nemotron 3 Ultra delivers frontier accuracy in a smaller model

Nemotron 3 Ultra is also fast. It achieves 5x higher throughput compared to other open models in its class, enabling long-running agents to complete tasks faster and more efficiently.

Figure 2. Nemotron 3 Ultra achieves 5x faster inference while delivering leading accuracy on the Artificial Analysis Intelligence Index leaderboard

Nemotron 3 Ultra is also built for efficiency. In experiments on the SWE-bench and Terminal bench 2.0, it completed benchmarks using fewer total tokens and fewer tokens per turn than comparable models. This lowers the cost for agentic tasks by up to 30%.

Figure 3. Nemotron 3 Ultra lowers the cost to task completion by 30%

Breakthroughs powering Nemotron 3 Ultra

To mitigate the typical efficiency-accuracy tradeoffs for high-capacity reasoning models, the Nemotron models introduce architectural innovations:

Post-trained for agent harness<br>Nemotron Ultra is post-trained to deliver consistent accuracy across top harnesses. The model is trained using the NVIDIA NeMo RL and Gym open libraries with one of the largest suites of long-running, task-solving, tool-using datasets in the world.

Ultra is optimized for agent-led open harnesses, not just single-turn chat, and is designed to work within workflows where agents plan, call tools, read observations, delegate to sub-agents, validate outputs, and recover from errors across many turns.

Hybrid Mamba transformer<br>Mamba layers improve sequence efficiency for long-context workloads, while Transformer layers preserve precise recall when agents need to retrieve specific facts from large context windows.

NVFP4 precision<br>The same NVFP4 checkpoint runs on NVIDIA Hopper, NVIDIA Blackwell, and Ampere GPUs. Developers can use one checkpoint across all NVIDIA GPU architectures...

nemotron ultra nvidia long agents reasoning

Related Articles