LoongForge End-to-End Optimization for GR00T N1.6 Delivers 2.3× Throughput

Training Cycle Halved: LoongForge End-to-End Optimization for GR00T N1.6 Delivers 2.3× Throughput | LoongForge Blog

← Back to blog

LoongForgeVLAGR00TTraining

Training Cycle Halved: LoongForge End-to-End Optimization for GR00T N1.6 Delivers 2.3× Throughput

2026-06-02 · The LoongForge Team

To address IO stalls, communication overhead, and inefficient operator scheduling in GR00T N1.6 VLA model training, Baidu Baige's LoongForge delivers end-to-end system-level optimization, achieving up to 2.3× training throughput and shortening the overall training cycle by 56.6%.

Official website: https://baidu-baige.github.io/LoongForge/

GitHub: https://github.com/baidu-baige/LoongForge

1. Background: the capability leap and challenges of GR00T N1.6 as an embodied-intelligence foundation

As humanoid robots accelerate toward industrialization, Vision-Language-Action (VLA) models have become a core technical path for embodied intelligence, thanks to their ability to connect perception, understanding, and action end-to-end. Among the embodied-intelligence foundation models, NVIDIA's open-source GR00T N series stands out as a representative core technology stack for humanoid-robot scenarios and is widely used in robotic intelligence training and R&D deployment.

Released in 2025, GR00T N1.6 further revamps both the model architecture and the action-generation paradigm, significantly strengthening end-to-end intelligent control of humanoid robots. The model uses Cosmos-Reason-2B as its multimodal vision-language perception core, and introduces a 32-layer DiT backbone for action generation, jointly modeling first-person robot video, proprioceptive state, and natural-language instructions as a shared policy representation—unifying perception, understanding, and action control.

The deep DiT enables high-precision modeling of long action sequences and substantially improves intelligent-control quality, but it also turns model training into a workload that is both compute- and communication-intensive, with high training cost and difficulty.

According to the official configuration, the pre-training stage uses a global batch size of 16,384 and runs roughly 300K steps on 1,024 H100 GPUs. Even fine-tuning on a downstream task on a single node takes several days. Data IO stalls, multi-GPU communication overhead, and inefficient training scheduling all combine to make GR00T N1.6 training expensive and slow, hindering rapid model iteration.

2. Solution overview: LoongForge end-to-end system-level optimization

To further improve GR00T N1.6 training efficiency, the Baidu Baige team applied system-level optimization and deep refactoring across the full training pipeline, on top of the in-house, open-source full-modal training framework LoongForge.

Targeting the characteristics of VLA training, LoongForge focuses on three directions: data IO pipeline, communication-computation overlap, and training scheduling:

Introduce asynchronous prefetch in data processing to mitigate GPU idling caused by data loading and transfer latency;

Use a distributed optimizer with fine-grained communication-computation overlap to reduce extra cost from multi-GPU synchronization stalls;

Adapt CUDA Graph to cut launch overhead from the large number of small-granularity operators.

Compared to the official training implementation, LoongForge ultimately delivers up to 2.3× training throughput and reduces the overall training cycle by 56.6%. So how exactly does LoongForge release more GPU compute and accelerate GR00T N1.6 training? Below we systematically break down the key ideas and technical implementation.

3. Inside the 2.3× speedup: three engineering optimizations

To unlock GR00T N1.6's training potential, we did not stop at simple parameter tuning, but performed system-level optimization at three layers: data IO pipeline, communication-computation overlap, and training scheduling.

Optimization 1: IO pipeline — asynchronous data prefetch

GR00T N1.6 data preprocessing involves CPU-heavy operations such as video decoding, image augmentation, and multimodal encoding. In the Lerobot framework, the GPU spends a large fraction of time waiting on data—a classic IO stall.

Baseline: data processing and forward executed serially

LoongForge decouples data production from GPU training via a three-level asynchronous pipeline:

Level 1 — data reading : multiple DataLoader workers read from disk in parallel, each prefetching n additional batches.

Level 2 — CPU preprocessing : a dedicated daemon thread runs image / video / text preprocessing, decoupled from the training main loop via double-buffered queues to avoid cross-process tensor serialization overhead.

Level 3 — GPU DMA transfer : with pinned memory and non-blocking transfers, the GPU asynchronously moves data into VRAM on a dedicated copy stream, fully overlapping with computation.

While the GPU computes the current batch, the next batch is being transferred, the one after is being preprocessed,...

LoongForge End-to-End Optimization for GR00T N1.6 Delivers 2.3× Throughput

Related Articles

The Newest Instagram "Exploit" Is the Goofiest I've Seen

Apple WWDC 2026 Livestream

Claude Fable 5

US Government directive to suspend access to Fable 5 and Mythos 5

It's Not Just X. It's Y