Why LLM Inference Needs a New Kind of Router

aviziva1 pts0 comments

Modular: Why LLM Inference Needs a New Kind of Router - Part 1

Hippocratic AI + Modular to power real-time patient conversations. Read More →

May 8, 2026

Why LLM Inference Needs a New Kind of Router - Part 1<br>Aayush Deshpande

Deep Dhillon

Alexandr Nikitin

Michael Dunn-OConnor

Engineering

HTTP routing has been a solved problem for many years. Round-robin, consistent hashing, least-connections. Pick one, put it in front of a pool of identical servers, and the traffic spreads pretty evenly.<br>But then came Large Language Models.<br>The backends here aren't interchangeable web servers. They're GPU pods holding large, local KV caches in high-bandwidth, RAM or SSD memory. That state is expensive to rebuild, not uniformly available across the cluster, and often determines whether a request returns quickly or spends seconds recomputing previous work. Some pods might specialize in prefill, others in decode. Conversations typically stretch across requests. A single inference call sometimes needs two backends in sequence. The old assumptions about "interchangeable backends" and "independent requests" don't support these requirements.<br>Cache routing: blind vs awareTraditional routing is blind to all of this. It treats every backend as interchangeable, every request as independent, every pod as equally good. GPU pods are none of those things. They’re stateful, specialized and heterogeneous. Inference routing has to account for that.<br>This is the first post in a three-part series about what routing has to become to handle inference workloads. Modular Cloud’s orchestration layer is built around this routing problem, and this series explains how it solves it.<br>Years of stateless routing<br>HTTP-era load balancing is built on a small menu of algorithms, each tuned to a specific deployment shape. They have different policies, but they share the same precondition: stateless backends.<br>Classic routing strategiesRound-robin distributes requests uniformly across a pool of identical backends. It assumes every backend serves every request at the same cost. This might look like eight replicas of the same web service behind a load balancer, each getting 12.5% of the traffic. It’s simple, fair, stateless.<br>Consistent hashing routes each request to a backend determined by hashing some property of the request (a key, url, session identifier), and picking the backend whose position on the hash ring is closest. It’s the routing strategy of choice when you want the same key to land on the same backend, for client-side caching or session affinity. The backend’s “stickiness” is a function of the request key, not of anything the backend is holding in memory.<br>Least-requests sends each new request to whichever backend has the fewest active requests, on the assumption that fewer active requests means more spare capacity. It works when every request takes roughly the same amount of work.<br>These policies share the same three assumptions:<br>Any backend can serve any request. The assignment is a policy choice not a correctness one.<br>Requests are independent. What happened on request N doesn’t change what you should do on request N+1.<br>Backends are interchangeable. The load balancer can swap one for another without the client noticing.<br>Those assumptions hold for stateless web services. LLM inference breaks all three.<br>Where LLMs break the model<br>LLM workloads violate the stateless assumptions in four specific ways. Each one introduces a dimension that traditional routing has no mechanism to handle.<br>KV Cache State<br>KV cache state: the same prompt with an 80x latency differenceWhen a pod serves an inference request, the forward pass builds a KV cache: the model's intermediate state for every token position, held in GPU memory. Modern engines retain that cache after the response completes, so later requests sharing a prefix can skip the equivalent compute.<br>This changes the routing problem drastically. A 100K token prompt landing on a pod with the first 75K tokens already cached can prefill in milliseconds. The same prompt hitting a cold pod takes seconds. Round-robin, blind to cache state, would produce unpredictable time to first token (TTFT) for identical requests.<br>Cache state is the primary driver of prefill latency variance at scale. A router that selects pods based on cache residency eliminates prefill compute proportional to the shared prefix length for every hit. This frees up GPU cycles the cluster would otherwise spend recomputing work it has already done.<br>Hardware Specialization<br>LLM Inference has two phases, and they stress hardware differently.<br>Prefill vs decode podsPrefill processes the entire prompt in parallel. It’s compute-bound. This means GPU cores are saturated doing dense matrix multiplications across thousands of tokens at once.<br>Decode generates tokens one at a time autoregressively, each token depending on every token before it. It’s memory-bandwidth-bound. This means most of the GPU’s time is spent fetching model weights and KV cache from high...

request routing inference requests cache backend

Related Articles