SuperInfer: SLO-Aware Rotary Scheduling and Memory Management for LLM Inference

matt_d1 pts0 comments

SuperInfer: SLO-Aware Rotary Scheduling and Memory Management for LLM Inference on Superchips

SuperInfer: SLO-Aware Rotary Scheduling and Memory Management for LLM Inference on Superchips

Jiahuan Yu1,

Mingtao Hu1,

Zichao Lin1,

Minjia Zhang1

1University of Illinois Urbana-Champaign

-->

Paper

arXiv

Code

Blog

-->

News

2026-01-26: SuperInfer has been accepted at MLSys 2026 ! 🎉

UCP boosts large-scale training efficiency:

🚀 Flexible change of parallelism (PP, SP, TP, ZeRO-DP) or GPU count mid-stream

🚀 Improve resilience by scaling down to healthy nodes

🚀 Increase throughput by scaling up to elastic nodes

-->

Abstract

Expert-specialized Mixture-of-Experts (MoEs) represent a significant advancement in large language models, employing fine-grained experts with large top-k routing to enhance expert specialization. However, training these emerging MoE architectures poses significant challenges for existing off-the-shelf MoE training solutions, especially on heterogeneous HPC platforms . These challenges include inefficient cross-platform kernels, shifted memory bottlenecks from model parameters to activations, and expensive all-to-all communication on hierarchical networks.

To address these issues, we present X-MoE , a comprehensive training system designed specifically for expert-specialized MoEs on HPC platforms. X-MoE introduces three key innovations: (1) a padding-free sparse MoE training pipeline with cross-platform kernels that eliminates zero-padding overhead, (2) a hierarchical redundancy-bypassing dispatch algorithm that reduces communication redundancy on hierarchical networks, and (3) a hybrid parallelism strategy with sequence-sharded MoE blocks that addresses the shifted memory bottleneck. Our evaluation on the Frontier supercomputer demonstrates that X-MoE enables training of models up to 545B parameters on 1024 AMD GPUs —10× larger than existing solutions—while achieving up to 1.42× higher training throughput .

-->

Large Language Model (LLM) serving faces a fundamental tension between stringent latency Service Level Objectives (SLOs) and limited GPU memory capacity. When high request rates exhaust the KV cache budget, existing LLM inference systems often suffer severe head-of-line (HOL) blocking. While prior work explored PCIe-based offloading, these approaches cannot sustain responsiveness under high request rates, often failing to meet tight Time-To-First-Token (TTFT) and Time-Between-Tokens (TBT) SLOs.

To address these issues, we present SuperInfer , a high-performance LLM inference system designed for emerging Superchips (e.g., NVIDIA GH200) with tightly coupled GPU-CPU architecture via NVLink-C2C. SuperInfer introduces (1) RotaSched, the first proactive, SLO-aware rotary scheduler that rotates requests to maintain responsiveness on Superchips, (2) DuplexKV, an optimized rotation engine that enables full-duplex transfer over NVLink-C2C.

Evaluations on GH200 using various models and datasets show that SuperInfer improves TTFT SLO attainment rates by up to 74.7% while maintaining comparable TBT and throughput compared to state-of-the-art systems, demonstrating that SLO-aware scheduling and memory co-design unlocks the full potential of Superchips for responsive LLM serving.

Background

The Memory Wall : During autoregressive generation, each request maintains a growing KV cache that quickly exhausts GPU memory under high loads, leading to SLO violations.

The Interconnect Bottleneck : Existing KV offloading systems are crippled by slow PCIe bandwidth (~32-64 GB/s), causing severe head-of-line (HOL) blocking and SLO violations.

In contrast, expert-specialized MoEs represent a paradigm shift toward more fine-grained expertise. These architectures feature:

-->

Fine-grained experts with smaller hidden dimensions that encourage specialization

Large top-k routing (e.g., top-8) that activates multiple specialized experts per token

Enhanced expert specialization where each expert learns to handle specific types of linguistic patterns or knowledge domains<br>--><br>Increasing swap bandwidth beyond the PCIe Gen5x16 uni-directional limit significantly reduces both TTFT and TBT .

-->

PCIe's Low swap bandwidth creates two major obstacles: request backlogging and HOL blocking to reducing tail latencies.

-->

Superchip Opportunity : Emerging tightly-coupled CPU-GPU Superchips provides highspeed CPU-GPU interconnects to break the PCIe bottleneck. As an example, NVIDIA GH200 integrates a Hopper GPU and a Grace CPU via NVLink-C2C with 900 GB/s interconnection bandwidth.

Software Bottlenecks : Existing serving stacks fall short on two fronts:

SLO-unaware

React to memory pressure, not latency urgency. Static Waiting-First / Swapped-First policies bias one SLO (TTFT or TBT) at the expense of the other.

Under-utilized C2C

Exploit PagedAttention fragments KV cache into tiny pieces.

SuperInfer Design

RotaSched: Proactive Rotation

Rotates requests between running (HBM) and a novel...

memory superinfer training superchips expert aware

Related Articles