The Load-Balance Problem Behind Hybrid Parallelism

matt_d1 pts0 comments

The Load-Balance Problem Behind Hybrid Parallelism

Light

← Back

1. The 5D Map Is Really a Coupling Map

DP Splits samples across ranks, then synchronizes gradients.

CP Splits one sequence across ranks, mainly to make long attention fit.

TP + SP Splits layer tensors and some sequence-dimension activations.

PP Splits layers into pipeline stages.

EP Splits experts and routes tokens to them.

The axes are not independent knobs. Optimizing one split changes both communication and load balance for the others: a CP split can make one long sample fit, but it can also add P2P traffic; a DP split can add sample throughput, but it can also leave ranks waiting at gradient synchronization; a PP split can turn uneven microbatch runtimes into pipeline bubbles.

This post focuses on the DP+CP part in Megatron Dynamic-CP and ByteScale.

2. What Is Microbatch-Level DP Load Balancing?

For this section and the rest of the post, assume sequence packing is already supported: a microbatch may contain multiple sequences, and the scheduler can see their token lengths rather than only one opaque tensor shape.

The naive way to handle post-training load balance is to ignore CP and only rearrange samples across DP ranks. A simple version sorts samples by token count, or by a cost function derived from token count, then uses a longest-processing-time heuristic: place the next heaviest sample or microbatch on the currently lightest DP rank. This often makes each gradient-accumulation window less skewed because the long samples are spread across ranks instead of clustered.

The constraint is that gradient accumulation still expects the DP ranks to meet at the same synchronization boundary. If rank 0 executes a different number of microbatches from rank 1 before the same gradient update, the forward/backward and communication schedule can diverge. So microbatch-level DP load balancing tries to make each microbatch step have comparable token work across ranks while keeping the same microbatch count per rank.

Slime implements a more practical two-level version for rollout training. Its DP/microbatch scheduler first groups rollout samples into training steps, then packs each step into microbatches by fixed chunking or first-fit token packing. After that, if --balance-data is enabled, it balances the resulting microbatch token sums across DP ranks with the Karmarkar-Karp partitioner while preserving the equal-count invariant. This is smarter than raw sample-level LPT because it balances the units the training loop actually executes, but it still treats CP as a fixed capacity factor rather than choosing a per-sequence local_cp_size.

3. How Megatron Handles This

Megatron's hybrid context parallelism is already a useful step away from a fixed "DP equals N, CP equals M" grid. At initialization it still builds a bounded set of process groups over the DPxCP ranks. At runtime, the scheduler first unpacks the packed batch: a sub-sample is one real sequence segment recovered from the packed tensor using its cumulative sequence lengths. The scheduler then assigns each sub-sample a local_cp_size: short sequences can stay on one rank, while long sequences may occupy 2, 4, 8, or more ranks from the same DPxCP pool.

Figure 3 from the ByteScale paper, showing why fixed DP+CP can create redundant communication for short sequences and imbalance bubbles across DP/PP schedules.

The relevant Megatron code is compact enough to read directly: parallel_state.py creates the bounded hybrid DP-CP process groups, data_schedule.py unpacks and reroutes sub-samples, and hybrid_cp_schedule.py estimates relative attention work as roughly seq_len * seq_len / cp_size, rounds the required CP size to a power of two, and inserts barriers between compatible groups. NVIDIA's Dynamic-CP blog describes the same direction at the system level: a data-iterator wrapper reschedules packed data, selects CP size, returns the effective num_micro_batches, and broadcasts the dynamic packing metadata across pipeline stages.

The PP part is still at the microbatch scheduling level. Dynamic-CP does not add a separate PP parallelism layer; it tries to feed Megatron's existing PP/VPP schedule with better-balanced microbatches and consistent dynamic metadata, so fewer long microbatches block neighboring pipeline stages.

That corrects an easy overstatement: Megatron is not merely "fixed CP for every microbatch." Its hybrid CP path is dynamic over sub-samples inside the current DPxCP scheduling window. The narrower constraint is that the dynamic choice is represented through pre-created power-of-two CP groups, while the rest of the stack still has to preserve common loss scaling, gradient synchronization, and pipeline execution semantics.

4. What ByteScale Adds Beyond Megatron Hybrid CP

A caveat first: ByteScale does not appear to have an open-source training implementation, so this comparison is based on the paper's design description rather than code-level verification.

The most...

microbatch ranks across samples megatron dynamic

Related Articles