MLSys @ WukLab - Nitsum: Serving Tiered LLM Requests with Adaptive Tensor Parallelism
MLSys @ WukLab
Building next-generation AI/ML systems that are efficient, scalable, and reliable.
powered by Hugo | themed with poison<br>© 2026 . All rights reserved.
Nitsum: Serving Tiered LLM Requests with Adaptive Tensor Parallelism<br>May 16, 2026<br>- 12 mins read<br>LLM<br>Serving<br>Tensor Parallelism
Author: Vikranth Srivatsa, Zijian He, Pu Guo, Dongming Li, and Yiying Zhang<br>TLDR : A single LLM deployment now serves everything from latency-critical chat to relaxed background jobs under a fixed GPU budget, creating a tiered-SLO serving problem. We designed Nitsum [arXiv ‘26], the first serving system that treats tensor parallelism (TP) as a runtime control surface instead of a fixed deployment choice. By making TP switching nearly free and continuously reconfiguring the cluster to track shifting workloads, Nitsum improves SLO-compliant goodput by up to 5.3x over state-of-the-art serving systems.<br>LLM Serving Is Increasingly Tiered<br>A single model deployment today serves a mix of very different workloads on the same infrastructure: interactive chat, coding agents, computer-use agents, API calls embedded in products, and long-running background or scheduled jobs. Their latency expectations differ widely. Some need a fast first token and a steady token rate for a human in the loop; others tolerate much slower responses in exchange for lower cost. In practice, this creates tiers of service objectives.<br>Serving one request happens in two phases. First comes prefill : the model reads the entire prompt and produces the first token. Then comes decode : it generates the rest of the answer one token at a time. These two phases map directly to the two latency targets, called service-level objectives (SLOs) , that users care about: Time To First Token (TTFT) , set by prefill, and Time Per Output Token (TPOT) , set by decode. A request only “counts” toward goodput if it meets both its TTFT and TPOT targets. With unlimited GPUs, tiered serving would be easy: give each tier its own cluster. But providers run under fixed GPU budgets, and the workload mix, request lengths, and load intensity all vary substantially over time. Strictly separating clusters wastes capacity; pooling all requests together makes the heterogeneous TTFT and TPOT objectives interfere.<br>This leads to the central question of our work:<br>How should an LLM serving system operate under a fixed GPU budget and maximize the number of requests per second that meet both their TTFT and TPOT SLOs (i.e., goodput) for multiple SLO tiers?
Tensor Parallelism Is a Hidden SLO Knob<br>First, what is tensor parallelism? A large model is too big for one GPU, so tensor parallelism (TP) splits each layer’s weight matrices across N GPUs. The GPUs work on every token together, exchanging partial results over a fast interconnect at each step. TP is normally set once, just high enough to make the model fit, and then never touched.<br>Existing SLO-aware systems mostly control when requests run: queuing, batching, migration, autoscaling. They leave how each request executes largely fixed. Our key observation is that the execution configuration itself can be used to improve SLO attainment, and the TP level, usually treated as a fixed deployment setting, is one of the most powerful knobs available.
Figure 2: Effect of tensor parallelism on TTFT, decode throughput, L2 cache hit rate, and communication cost across 14B and 70B models on A100, H100, and B200.<br>Higher TP splits the prefill work across more GPUs, so it reduces prefill latency and improves TTFT , which helps for long prompts or tight first-token targets. Decode is less obvious. When only a few requests are being processed at once, higher TP can also raise decode throughput, and therefore improve TPOT , by up to 3x. That looks backwards, since adding GPUs adds cross-GPU communication. The cause turns out to be memory rather than communication. At low TP, each GPU has to stream a huge weight matrix out of slow GPU memory on every decode step. At higher TP, each GPU holds a smaller slice that fits in its fast on-chip cache, so it spends much less time waiting on memory. Below a certain batch size that memory saving outweighs the extra communication; above it, communication dominates and lower TP is better. So TP affects both TTFT and TPOT, which makes it a usable runtime knob and not only a deployment choice.<br>No Single Static TP Wins<br>Because TP affects prefill and decode differently, and because each SLO tier imposes its own latency pressure, the goodput-optimal configuration is not fixed; it shifts as the workload mix and load change over time.
Figure 1: ServeGen conversation and coding workloads on 8 H100 GPUs. Request demand and the optimal cluster configuration vary continuously; no single static TP setting achieves the best goodput.<br>Figure 1 shows this on a real production trace (ServeGen, calibrated to Alibaba Cloud...