High Performance Distributed Inference with Ray Serve LLM

High Performance Distributed Inference with Ray Serve LLM | AnyscaleHomeBlogBlog Detail High Performance Distributed Inference with Ray Serve LLM By Seiji Eicher, Jeffrey Wang, Kourosh Hakhamaneshi and Spencer Peterson (Google) | June 18, 2026

Today, in partnership with the Google Kubernetes Engine (GKE) team at Google Cloud, we are announcing a major milestone in Ray Serve LLM’s throughput and latency characteristics, driven by architecture changes across the stack. We include comparisons to a known high-performance, rust-based routing framework, vllm-router, as well as a retrospective performance comparison, to illustrate the progress Ray Serve LLM has made in reducing orchestration overhead. Ray is a popular choice for complex distributed computing batch inference pipelines with heterogeneous hardware. In addition, we believe that Ray’s powerful primitives for fault tolerance, observability, flexibility across Kubernetes and VMs will enable the next generation of optimizations as LLM inference deployments become increasingly complex. Below, we cover three major optimizations to the Ray Serve LLM + vLLM stack: direct streaming, a new vLLM Ray executor backend, and HAProxy integration. As a result, we see up to 4.4x higher request throughput than previous versions on prefill-heavy workloads, and up to 24x higher request throughput on decode-heavy workloads.

Ray Serve LLM closes the throughput gap Cumulative Effect of Optimizations: The figure above shows the cumulative effect of the incremental optimizations compared to vLLM behind vllm-router. Ray Serve LLM now matches vllm-router performance in both prefill- and decode-heavy workloads, representing a 4.4x and 24.8x improvement over the Ray Serve LLM baseline prior to the optimization effort.1 LinkWhat’s new? Three major optimizations contribute to the Ray Serve LLM’s new performance capabilities. LinkRay Serve LLM: Direct Streaming Ray 2.56 introduces direct streaming mode for Ray Serve LLM. This new architecture decouples the request routing control plane from the request/response streaming data plane. On the forward path, the HAProxy ingress load balancer queries an ingress request router with the request content for a routing decision, based on a user-configured routing policy. Next, HAProxy establishes a direct HTTP connection with the selected target replica and streams tokens directly back to the client. The new design resolves a bottleneck in the legacy architecture where the intermediate routing deployment (OpenAiIngress) was also responsible for forwarding response tokens back to HAProxy, taxing its event loop and adding to time per output token (TPOT). Try this out by setting RAY_SERVE_LLM_ENABLE_DIRECT_STREAMING=1. See docs for usage.

Ray Serve Application Ray Serve LLM Direct Streaming: In the figure above, LLMRouter serves as the direct streaming application’s ingress request router. After serving a routing decision HAProxy can establish a connection directly to the target replica for data-plane communication. OpenAiIngress was the intermediate routing deployment used in the legacy architecture. LinkvLLM: Ray Executor Backend V2 The revamped Ray backend for vLLM, RayExecutorV2, is enabled by default in vLLM 0.21.0 and combines the process management capabilities with the battle-tested feature set of the mp backend’s data and control planes. In addition, the new Ray backend facilitates the inheritance of other features such as asynchronous scheduling. LinkRay Serve: HAProxy In Ray 2.55, we released two major optimizations to Ray Serve: a C-based, HAProxy ingress load balancer and high throughput mode optimizations. For LLM serving, this also included disabling TCP datagram buffering (Nagle’s algorithm) by default for improved streaming performance. Details are covered in the announcement blogpost and docs. In Ray 2.56, HAProxy is available in all rayproject/ray container images, including rayproject/ray-llm:2.56-py312-cu130, our recommended container image for LLM serving, which includes extras from the vLLM base images, such as DeepGEMM. If the Ray docker images can’t be used, in Ray 2.56, HAProxy can be installed via pip install ray-haproxy and enabled with RAY_SERVE_EXPERIMENTAL_PIP_HAPROXY=1. The binary will be automatically included and enabled with pip install ray[serve] in Ray 2.57. LinkBenchmarks We considered workloads with varying input sequence length (ISL) to output sequence length (OSL) ratios to simulate generic prefill- and decode-heavy workloads, and a multi-turn agentic workload to demonstrate request routing and cache reuse capabilities. In particular, these were: Randomized prefill-heavy workload with ISL=8000, OSL=50

Randomized decode-heavy workload with ISL=50, OSL=500

Simulated prompt and traffic pattern traces from a multi-turn coding agent capped at 20 turns

The random workloads are intended to isolate orchestration due to the lack of prefix-caching benefits in the workload. For example, prefill-heavy...

High Performance Distributed Inference with Ray Serve LLM

Related Articles

Apple WWDC 2026 Livestream

Claude Fable 5

US Government directive to suspend access to Fable 5 and Mythos 5

German ruling declares Google liable for false answers in AI Overviews

Britain Became as Poor as Mississippi