VoltanaLLM: Energy-Efficient LLM Serving

VoltanaLLM: Feedback-Driven Frequency Control and State-Space Routing for Energy-Efficient LLM Serving

Jiahuan Yu1,

Aryan Taneja1,

Junfeng Lin2, -->

Minjia Zhang1

1University of Illinois Urbana-Champaign 2Tsinghua University

-->

Paper

arXiv

Code

Blog

News

2025-06-26: X-MoE has been accepted at SC 2025 and received Best Student Paper Nomination! 🎉

-->

UCP boosts large-scale training efficiency:

🚀 Flexible change of parallelism (PP, SP, TP, ZeRO-DP) or GPU count mid-stream

🚀 Improve resilience by scaling down to healthy nodes

🚀 Increase throughput by scaling up to elastic nodes

-->

Abstract

Modern Large Language Model (LLM) serving systems increasingly support interactive applications such as real-time chat assistants, code generation tools, and agentic workflows. However, the soaring energy cost of LLM inference presents a growing challenge for sustainable and cost-effective deployment.

We introduce VoltanaLLM , a system for SLO-aware, energy-efficient LLM serving , designed from a control theory perspective. VoltanaLLM co-designs frequency scaling and request routing in emerging prefill/decode disaggregated architectures, leveraging their decoupled execution to enable fine-grained, phase-specific control. It consists of (1) a feedback-driven frequency controller that dynamically adapts GPU frequency for prefill and decode phases, and (2) a state-space router that explores routing decisions across frequency-scaled instances to minimize energy under latency constraints.

We implement VoltanaLLM in SGLang and evaluate its performance across multiple state-of-the-art LLMs and real-world datasets. Our results show that VoltanaLLM achieves up to 36.3% energy savings while maintaining a near-perfect SLO attainment rate , paving the way for sustainable and intelligent LLM serving .

Relevance and Early Observation

LLMs are deployed at unprecedented scale , making inference a major driver of energy consumption and total cost of ownership (TOC). Recent studies show inference can account for 90% of AI infrastructure utilization , pushing datacenter power and thermal limits. Large datacenters today can consume electricity equivalent to millions of households .

At the same time, latency-sensitive applications like chat assistants and agent pipelines rely on strict Service Level Objectives (SLOs) , such as Time-To-First-Token (TTFT) and Inter-Token Latency (ITL). Violating these SLOs degrades user experience and downstream responsiveness.

The central challenge: how can we serve LLMs under tight SLOs while reducing their energy footprint ?

Our empirical profiling of LLM inference reveals a

non-monotonic energy–frequency relationship As shown above, while reducing GPU frequency from 1410 MHz to 1005 MHz (by ~28.7%) does increase execution time, the increase is

sub-linear.

Consequently, the total energy follows a

U-shaped curve

with respect to GPU frequency. This trend indicates that at low frequencies,

execution time dominates energy , whereas at high frequencies,

power dominates ; in the middle lies an

energy sweet point

Background

Numerous systems have been proposed to improve LLM serving efficiency . These include advanced batching strategies for throughput optimization, memory management techniques like PagedAttention , CPU offloading, and GPU kernel-level optimizations (e.g., FlashAttention). Parallelism frameworks and parameter-sharing mechanisms further reduce bottlenecks, while speculative decoding and preemptive scheduling improve tail latency and job completion times.

Batching & Memory Optimizations – PagedAttention, CPU offloading, and GPU kernel improvements.

Parallelism & Sharing – model parallelism, pipelining, and parameter reuse.

Latency Techniques – speculative decoding and preemptive scheduling for multi-tenant settings.

While these systems indirectly reduce energy use by improving latency or throughput, VoltanaLLM directly targets energy efficiency via frequency and routing-aware scheduling — a relatively underexplored but critical dimension for sustainable deployment.

-->

Several recent efforts have also begun addressing energy-efficient LLM serving . For instance, DynamoLLM explores GPU frequency control based on request characteristics, while μ-Serve optimizes power by co-serving multiple models. EcoServe considers operational and embodied carbon emissions, TAPAS exploits datacenter thermal slack, and Heron places GPUs closer to renewable sources.

To better manage compute heterogeneity, recent systems have introduced prefill/decode (P/D) disaggregation , separating the two phases across GPU nodes. Projects like SplitWise , TetriInfer , Llumnix , and DistServe show improvements in goodput, time-to-first-token, and SLO attainment. Popular inference libraries such as vLLM and SGLang have also added runtime support.

These efforts primarily optimize...

VoltanaLLM: Energy-Efficient LLM Serving

Related Articles

Claude Fable 5

US Government directive to suspend access to Fable 5 and Mythos 5

Is AI ruining our skills? Early results are in – and they're not good

The Anatomy of an AI-Native Org

Apertus – Open Foundation Model for Sovereign AI