VoltanaLLM: Feedback-Driven Frequency Control and State-Space Routing for Energy-Efficient LLM Serving
VoltanaLLM: Feedback-Driven Frequency Control and State-Space Routing for Energy-Efficient LLM Serving
Jiahuan Yu1,
Aryan Taneja1,
Junfeng Lin2,<br>-->
Minjia Zhang1
1University of Illinois Urbana-Champaign<br>2Tsinghua University
-->
Paper
arXiv
Code
Blog
News
2025-06-26: X-MoE has been accepted at SC 2025 and received Best Student Paper Nomination! 🎉
-->
UCP boosts large-scale training efficiency:
🚀 Flexible change of parallelism (PP, SP, TP, ZeRO-DP) or GPU count mid-stream
🚀 Improve resilience by scaling down to healthy nodes
🚀 Increase throughput by scaling up to elastic nodes
-->
Abstract
Modern Large Language Model (LLM) serving systems increasingly support interactive applications such as real-time chat assistants, code generation tools, and agentic workflows. However, the soaring energy cost of LLM inference presents a growing challenge for sustainable and cost-effective deployment.
We introduce VoltanaLLM , a system for SLO-aware, energy-efficient LLM serving , designed from a control theory perspective. VoltanaLLM co-designs frequency scaling and request routing in emerging prefill/decode disaggregated architectures, leveraging their decoupled execution to enable fine-grained, phase-specific control. It consists of (1) a feedback-driven frequency controller that dynamically adapts GPU frequency for prefill and decode phases, and (2) a state-space router that explores routing decisions across frequency-scaled instances to minimize energy under latency constraints.
We implement VoltanaLLM in SGLang and evaluate its performance across multiple state-of-the-art LLMs and real-world datasets. Our results show that VoltanaLLM achieves up to 36.3% energy savings while maintaining a near-perfect SLO attainment rate , paving the way for sustainable and intelligent LLM serving .
Relevance and Early Observation
LLMs are deployed at unprecedented scale , making inference a major driver of energy consumption and total cost of ownership (TOC). Recent studies show inference can account for 90% of AI infrastructure utilization , pushing datacenter power and thermal limits. Large datacenters today can consume electricity equivalent to millions of households .
At the same time, latency-sensitive applications like chat assistants and agent pipelines rely on strict Service Level Objectives (SLOs) , such as Time-To-First-Token (TTFT) and Inter-Token Latency (ITL). Violating these SLOs degrades user experience and downstream responsiveness.
The central challenge: how can we serve LLMs under tight SLOs while reducing their energy footprint ?
Our empirical profiling of LLM inference reveals a
non-monotonic energy–frequency relationship<br>As shown above, while reducing GPU frequency from 1410 MHz to 1005 MHz (by ~28.7%) does increase execution time,<br>the increase is
sub-linear.
Consequently, the total energy follows a
U-shaped curve
with respect to GPU frequency.<br>This trend indicates that at low frequencies,
execution time dominates energy<br>, whereas at high frequencies,
power dominates<br>; in the middle lies an
energy sweet point
Background
Numerous systems have been proposed to improve LLM serving efficiency . These include advanced batching strategies for throughput optimization, memory management techniques like PagedAttention , CPU offloading, and GPU kernel-level optimizations (e.g., FlashAttention). Parallelism frameworks and parameter-sharing mechanisms further reduce bottlenecks, while speculative decoding and preemptive scheduling improve tail latency and job completion times.
Batching & Memory Optimizations – PagedAttention, CPU offloading, and GPU kernel improvements.
Parallelism & Sharing – model parallelism, pipelining, and parameter reuse.
Latency Techniques – speculative decoding and preemptive scheduling for multi-tenant settings.
While these systems indirectly reduce energy use by improving latency or throughput, VoltanaLLM directly targets energy efficiency via frequency and routing-aware scheduling — a relatively underexplored but critical dimension for sustainable deployment.
-->
Several recent efforts have also begun addressing energy-efficient LLM serving . For instance, DynamoLLM explores GPU frequency control based on request characteristics, while μ-Serve optimizes power by co-serving multiple models. EcoServe considers operational and embodied carbon emissions, TAPAS exploits datacenter thermal slack, and Heron places GPUs closer to renewable sources.
To better manage compute heterogeneity, recent systems have introduced prefill/decode (P/D) disaggregation , separating the two phases across GPU nodes. Projects like SplitWise , TetriInfer , Llumnix , and DistServe show improvements in goodput, time-to-first-token, and SLO attainment. Popular inference libraries such as vLLM and SGLang have also added runtime support.
These efforts primarily optimize...