Running Large-Scale GPU Workloads on Kubernetes with Slurm

Running Large-Scale GPU Workloads on Kubernetes with Slurm | NVIDIA Technical Blog

Technical Blog

Related Resources

Data Center / Cloud

English中文

Running Large-Scale GPU Workloads on Kubernetes with Slurm

Apr 09, 2026

By Anton Polyakov, Fagani Hajizada, Marlow Warnicke and Skyler Malinowski

Discuss (0)

AI-Generated Summary

Dislike

Slinky, developed by SchedMD (now part of NVIDIA), enables native Slurm cluster management on Kubernetes by representing all Slurm daemons as Kubernetes Custom Resource Definitions, supporting full Slurm lifecycle orchestration and high availability without relying on Slurm's native HA.<br>Integration with the NVIDIA GPU Operator and DRA/ComputeDomains allows automated GPU management, topology-aware multinode scheduling, and per-job GPU monitoring, supporting advanced NVIDIA architectures like GB200 NVL72 with dynamic Internode Memory Exchange and topology discovery.<br>Production deployments at NVIDIA have demonstrated that Slinky slurm-operator scales to over 8,000 GPUs, supports nondisruptive rolling updates, maintains unified observability via Prometheus and Grafana, and achieves performance parity with noncontainerized Slurm clusters, with v1.1.0 adding dynamic topology support, DaemonSet-style scaling, and enhanced self-healing capabilities.

AI-generated content may summarize information incompletely. Verify important information. Learn more

Slurm is an open source cluster management and job scheduling system for Linux. It manages job scheduling for over 65% of TOP500 systems. Most organizations running large-scale AI training have years of investment in Slurm job scripts, fair-share policies, and accounting workflows. The challenge is getting Slurm scheduling capabilities onto Kubernetes—the standard platform for managing GPU infrastructure at scale—without maintaining two separate environments.

Slinky, an open source project developed by SchedMD (now part of NVIDIA), takes two approaches to this integration:

slurm-bridge brings Slurm scheduling to native Kubernetes workloads, allowing Slurm to act as a Kubernetes scheduler for pods

slurm-operator runs full Slurm clusters on Kubernetes infrastructure, managing the complete lifecycle of Slurm daemons as pods

This post focuses on the slurm-operator, which is how NVIDIA runs Slurm on Kubernetes for large-scale GPU training clusters. It walks through the architecture of the operator and how it maps Slurm daemons to Kubernetes primitives, then covers deployment—including how Slinky slurm-operator integrates with your existing infrastructure. It also covers the Kubernetes ecosystem integrations that make this model practical. Finally, we share lessons from running Slinky in production at NVIDIA on clusters with over 1,000 GPU worker nodes and 8,000+ GPUs.

How does Slinky slurm-operator work?

Slinky slurm-operator represents each Slurm component (slurmctld for scheduling, slurmdbd for accounting, slurmd for compute workers, slurmrestd for API access) as a Kubernetes Custom Resource Definition (CRD). A Slurm cluster is defined using Custom Resources, and Slinky creates containerized Slurm daemons running in their own pods, configured to belong to their respective cluster.

Figure 1. Slinky CRDs and their referential relationships

Slinky ensures high availability (HA) of the Slurm control plane (slurmctld) through pod regeneration, with no need for the Slurm native HA mechanism. Configuration changes propagate automatically: Kubernetes synchronizes mounted configuration (ConfigMaps and Secrets) into the control plane pod, which detects the changes and propagates the new configuration to its workers (slurmd) with zero scheduler downtime.

Using Slurm native OpenMetrics support and Prometheus monitoring (since Slurm v25.11), workers can be autoscaled through the HorizontalPodAutoscaler (HPA) based on cluster metrics and your desired scaling policy, from a single pod to every available worker node. On scale-in, Slinky fully drains Slurm nodes before terminating pods, ensuring running workloads complete first. Slinky prioritizes pods whose workloads will complete soonest for scale-in. The same drain-before-terminate process applies when rolling out new worker pod images (updated Slurm versions or OS patches, for example), so upgrades do not interrupt running jobs.

How to deploy Slinky slurm-operator

Slurm clusters deployed with Slinky on Kubernetes work similarly to noncontainerized Slurm deployments. Slinky slurm-operator automatically enables the Slurm features required for containerized operation:

configless mode for config distribution without shared filesystems

dynamic nodes so workers register on startup without being predefined in slurm.conf

auth/slurm with use_client_ids for cluster-wide user authentication without per-node identity services

Figure 2. A full Slurm cluster deployed on Kubernetes with Slinky, including login pods, worker pod autoscaling, job accounting, and integration...

Running Large-Scale GPU Workloads on Kubernetes with Slurm

Related Articles

Amazon, Facebook, FBI have access to a private intelligence-sharing network

SpaceX not the behemoth everyone thought

The Mirror Is Part of the Machine

Elevated error rates on requests to multiple models

Donald Trump and sons to be 'forever' exempt from tax audits