Training Cycle Halved: LoongForge End-to-End Optimization for GR00T N1.6 Delivers 2.3× Throughput | LoongForge Blog
← Back to blog
LoongForgeVLAGR00TTraining
Training Cycle Halved: LoongForge End-to-End Optimization for GR00T N1.6 Delivers 2.3× Throughput
2026-06-02 · The LoongForge Team
To address IO stalls, communication overhead, and inefficient operator scheduling in GR00T N1.6 VLA model training, Baidu Baige's LoongForge delivers end-to-end system-level optimization, achieving up to 2.3× training throughput and shortening the overall training cycle by 56.6%.
Official website: https://baidu-baige.github.io/LoongForge/
GitHub: https://github.com/baidu-baige/LoongForge
1. Background: the capability leap and challenges of GR00T N1.6 as an embodied-intelligence foundation
As humanoid robots accelerate toward industrialization, Vision-Language-Action (VLA) models have become a core technical path for embodied intelligence, thanks to their ability to connect perception, understanding, and action end-to-end. Among the embodied-intelligence foundation models, NVIDIA's open-source GR00T N series stands out as a representative core technology stack for humanoid-robot scenarios and is widely used in robotic intelligence training and R&D deployment.
Released in 2025, GR00T N1.6 further revamps both the model architecture and the action-generation paradigm, significantly strengthening end-to-end intelligent control of humanoid robots. The model uses Cosmos-Reason-2B as its multimodal vision-language perception core, and introduces a 32-layer DiT backbone for action generation, jointly modeling first-person robot video, proprioceptive state, and natural-language instructions as a shared policy representation—unifying perception, understanding, and action control.
The deep DiT enables high-precision modeling of long action sequences and substantially improves intelligent-control quality, but it also turns model training into a workload that is both compute- and communication-intensive, with high training cost and difficulty.
According to the official configuration, the pre-training stage uses a global batch size of 16,384 and runs roughly 300K steps on 1,024 H100 GPUs. Even fine-tuning on a downstream task on a single node takes several days. Data IO stalls, multi-GPU communication overhead, and inefficient training scheduling all combine to make GR00T N1.6 training expensive and slow, hindering rapid model iteration.
2. Solution overview: LoongForge end-to-end system-level optimization
To further improve GR00T N1.6 training efficiency, the Baidu Baige team applied system-level optimization and deep refactoring across the full training pipeline, on top of the in-house, open-source full-modal training framework LoongForge.
Targeting the characteristics of VLA training, LoongForge focuses on three directions: data IO pipeline, communication-computation overlap, and training scheduling:
Introduce asynchronous prefetch in data processing to mitigate GPU idling caused by data loading and transfer latency;
Use a distributed optimizer with fine-grained communication-computation overlap to reduce extra cost from multi-GPU synchronization stalls;
Adapt CUDA Graph to cut launch overhead from the large number of small-granularity operators.
Compared to the official training implementation, LoongForge ultimately delivers up to 2.3× training throughput and reduces the overall training cycle by 56.6%. So how exactly does LoongForge release more GPU compute and accelerate GR00T N1.6 training? Below we systematically break down the key ideas and technical implementation.
3. Inside the 2.3× speedup: three engineering optimizations
To unlock GR00T N1.6's training potential, we did not stop at simple parameter tuning, but performed system-level optimization at three layers: data IO pipeline, communication-computation overlap, and training scheduling.
Optimization 1: IO pipeline — asynchronous data prefetch
GR00T N1.6 data preprocessing involves CPU-heavy operations such as video decoding, image augmentation, and multimodal encoding. In the Lerobot framework, the GPU spends a large fraction of time waiting on data—a classic IO stall.
Baseline: data processing and forward executed serially
LoongForge decouples data production from GPU training via a three-level asynchronous pipeline:
Level 1 — data reading : multiple DataLoader workers read from disk in parallel, each prefetching n additional batches.
Level 2 — CPU preprocessing : a dedicated daemon thread runs image / video / text preprocessing, decoupled from the training main loop via double-buffered queues to avoid cross-process tensor serialization overhead.
Level 3 — GPU DMA transfer : with pinned memory and non-blocking transfers, the GPU asynchronously moves data into VRAM on a dedicated copy stream, fully overlapping with computation.
While the GPU computes the current batch, the next batch is being transferred, the one after is being preprocessed,...