RL Doesn't Work on Slurm | SkyPilot Blog
Table of ContentsWhat a typical online RL pipeline looks likeWhy GRPO made this worse
Where Slurm’s model breaks down<br>How every major framework works around itOpenRLHF<br>veRL<br>NeMo RL<br>TRL
Teams that hit this wall<br>Running RL with SkyPilot Job Groups<br>The case for keeping Slurm<br>Where this is going
Open the Slurm launch script for OpenRLHF. It parses node IPs from scontrol show hostnames, hardcodes service addresses, and bootstraps a Ray cluster by hand. Now open the one for veRL. Same thing. NeMo RL? Same thing again. Three unrelated frameworks, three sets of nearly identical glue code, all papering over Slurm’s lack of multi-service orchestration.<br>Slurm is a batch scheduler. Online RL - GRPO, PPO, multi-step RLHF - is not a batch job. Writing better launch scripts doesn’t change that.<br>What a typical online RL pipeline looks like#
A GRPO or PPO training loop isn’t one program. It’s four or five separate processes that need to coordinate on every iteration.<br>The policy/actor model is where gradient updates happen - you need H100s or B200s with NVLink and InfiniBand here. The rollout engine (usually vLLM or SGLang) generates completions from the current policy; it’s memory-bandwidth-bound, not compute-bound, so cheaper GPUs work. Then there’s the reward model (or a verifiable reward function) and a reference model for KL divergence penalties - both inference-only, each needing its own GPU allocation. Finally, data processing (tokenization, advantage computation, replay buffers) runs on CPU.<br>These aren’t independent jobs. The rollout engine produces completions, the reward model scores them, advantages get computed, the policy takes a gradient step, and then the updated weights have to ship back to the inference engine before the next iteration can start. With a 70B model, that’s ~140GB of fp16 weights moving every single iteration.<br>Generation is where the time goes. 80–90% of total compute in RL training is spent on rollout generation. The gradient updates themselves are comparatively fast.<br>Why GRPO made this worse#<br>DPO learns from a static preference dataset, structurally identical to supervised fine-tuning. It maps perfectly fine to a Slurm batch job: read data, compute gradients, update weights, repeat.<br>GRPO, first introduced in DeepSeekMath and later adopted by DeepSeek-R1, works differently. Instead of learning from pre-collected preferences, GRPO samples multiple completions from the current policy during training, scores them, and computes group-relative advantages. The advantage for each completion is (reward - group_mean) / group_std. No learned critic network (unlike PPO, which trains both a policy and a value network), which cuts memory overhead.<br>But now training and inference are inseparable. You can’t precompute the completions offline because the rewards are relative to what the current policy produces. Every training step requires fresh generations from the live model. Slurm has no way to express this kind of coupling between inference and training.<br>Where Slurm’s model breaks down#<br>Slurm manages over 60% of the TOP500 supercomputers. Its core assumption: a job is a finite program with a defined end time, running on exclusively allocated nodes. Online RL breaks this:<br>No service discovery. Slurm gives you SLURM_JOB_NODELIST, a list of allocated hostnames, and nothing else. No DNS, no service registry. When OpenRLHF runs on Slurm, its launch scripts must extract node IPs via scontrol show hostnames, parse them into arrays, and manually bootstrap a Ray cluster with hardcoded addresses (example). There is no way in Slurm to say “run the actor on these GPUs, the reward model on those, vLLM inference on these others, and let them talk to each other.”<br>Jobs must terminate. Every Slurm partition has a MaxTime limit; users must specify --time. As Nebius engineers note, “Slurm is well-suited only for time-finite workloads on size-limited clusters.” An RL training loop runs until convergence, which has no predictable wall-clock deadline. Fitting it into Slurm means either grossly overestimating time (wasting your allocation) or checkpointing and resubmitting, requiring a full topology reconstruction each time.<br>Health checking stops at the node. Slurm monitors whether slurmd responds to periodic pings, but it does not provide a component-level recovery loop for services inside the allocation. If your reward model or vLLM server dies, Slurm has no automatic way to restart just that service, reattach it to the rest of the job, and preserve its endpoint. At best, you fail or requeue the whole job; then your launch script has to redo service discovery, rebuild the Ray topology, and redistribute hardcoded addresses. In SkyPilot Job Groups, the job system owns those mechanics: tasks get stable hostnames, recover independently from preemptions, and can be relaunched by managed job...