Distributed LLM Inference with llm-d | Moncef Abboud Light Dark System
Post<br>Cancel<br>Distributed LLM Inference with llm-d<br>Contents Distributed LLM Inference with llm-d
Intro<br>What does production-grade LLM inference look like?<br>If you wake up in the middle of the night thinking about that question, this blog post might be for you.<br>Inference engines like vLLM and SGLang sit on top of PyTorch (when I say vLLM going forward, that also includes SGLang and other supported inference engines). They optimize inference at the node level, most notably by managing KV cache and improving throughput through techniques such as paged attention and continuous batching.<br>The elevator pitch for llm-d is that it’s an LLM-aware load balancer.<br>If we have multiple vLLM instances, each has its own state: available GPU memory, KV cache prefix matches, number of requests queuing to be processed, etc. We can’t simply use round robin. Selecting the best inference engine instance based on these signals is what llm-d is all about.<br>It also provides features such as flow control to support different classes of requests based on priority (e.g., premium real-time traffic vs batch workloads). In addition, it enables smooth disaggregated P/D, where prefix and decode run on different nodes because they benefit from different configs. Prefix is compute-bound, while decode is memory-bandwidth-bound, so they have different GPU requirements: low TP for prefix to maximize compute, and high TP for decode to maximize memory bandwidth.<br>It also provides features such as flow control to support different classes of requests based on priority (e.g., premium real-time traffic vs batch workloads). In addition, it enables smooth disaggregated P/D, where prefix and decode run on different nodes because they benefit from different configs. Prefix is compute-bound, while decode is memory-bandwidth-bound, so they have different GPU requirements: low TP for prefix to maximize compute, and high TP for decode to maximize memory bandwidth.<br>If It Ain’t Broke<br>The cool thing about llm-d is that it doesn’t reinvent the wheel. It builds on top of existing, established projects. LLM inference still happens on vLLM and SGLang, and we simply communicate with them via HTTP. The proxy layer and discovery are all built on top of Kubernetes(k8s) and Envoy. Even the integration point for deciding which vLLM instance to choose relies on an existing extension point, namely Envoy’s ext_proc extension. The data layer and metric collection are essentially built on top of Prometheus. So, no reinventing the wheel, just intelligently combining solid existing solutions. The best part is that for each layer and piece (metrics, scoring, flow control, etc.), llm-d is easily extensible with clear interfaces that new plugins can implement and use right away.<br>So if vLLM adds a new metric, or even if a new inference engine comes along, it can be easily added. If we want to implement a new way to pick or score, we just implement a new picker or scorer. If Prometheus falls out of favor, or if we want to use a custom monitoring solution, we can implement a plugin for that instead.<br>There’s a standardization effort to consolidate Generative AI inference on top of k8s, taking the form of the Gateway API Inference Extension, or GAIE. GAIE defines the API resources (like InferencePool) and the Endpoint Picker role. llm-d’s router is an implementation of that EPP role, paired with an Envoy proxy that does the actual request forwarding.<br>That’s a very strategic choice. Rather than having llm-d be an isolated initiative with a bespoke API, it’s built as an implementation of a broader k8s standard.<br>It’s also worth mentioning that llm-d has a mode where it runs outside of k8s via a file discovery plugin, where the vLLM endpoints are hardcoded instead of discovered via the k8s API.
So what are these factors that need to be taken into account for LLM inference routing?<br>Prefix Cache<br>Say we have N1 and N2. If a user has already gotten a response from N1, this means the user’s request KV cache has already been calculated there, so a subsequent request can skip that step. If, however, we send the user’s request to N2, we need to re-run prefill and we won’t be taking advantage of the calculation already done on N1.<br>In other words, it’s efficient to route requests to nodes that already have the KV cache of the prompt (this consideration changes if we have KV cache offloading, in which the cache can be stored in shared network storage, for instance).<br>KV-cache Utilization<br>Another factor we care about is how much free VRAM N1 and N2 have. If N1’s VRAM is almost full, even though it has the user’s KV cache, it might not be the best choice, because we might need to wait for other requests to finish or evict them in order to run the request. If at the same time N2’s VRAM is free and ready to go, it might be best to rerun the prefill and make use of the free memory.<br>Queue Depth<br>Both nodes might be running with little free...