Routing for serverless servers with Pingora, Envoy, and Spanner

Optimized inference you actually own. Try Modal Auto Endpoints

All posts Back Engineering June 25, 2026•10 minute read

Claudia Zhu Member of Technical Staff

Charles Frye Member of Technical Staff

Richard Gong Member of Technical Staff

Modal makes it easy to run high-performance code in the cloud: Python functions, agent runtimes, notebooks, batch jobs, and more. Now, you can also run ultra-low-latency Servers on Modal for HTTP, WebSocket, and gRPC traffic. @app.server( compute_region="us", routing_region="us-west", class FileServer: @modal.enter() def start(self): import subprocess

subprocess.Popen(["python", "-m", "http.server", "8000"])

Servers are designed for applications where every millisecond counts, like LLM inference for interactive agents. Servers give you a regionalized, autoscaling pool of HTTP server replicas behind Modal’s routing layer, with the deployment ergonomics, fast feedback loops, and autoscaling we consider table stakes (for humans and for agents). This might sound familiar: with Modal Web Functions, you could already expose HTTP endpoints on Modal. But Web Functions were architected for batteries-included robustness — queueing, retries, and a platform-managed request lifecycle — rather than bleeding-edge latency. As inference latencies have plummeted, it has become increasingly critical to remove latency in the happy path and push everything else to the application layer. Those few dozen milliseconds are the difference between winning and losing. So we built an HTTP serving solution on Modal with minimum overhead without sacrificing core features. Request Latency

Modal Server

Modal Web Function

In this blog, we’ll explain how. The hard part was not accepting and routing HTTP traffic; Envoy can do that. The hard part was preserving Modal’s semantics — auth, dynamic replica placement, regional routing, autoscaling, inference features, and tenant isolation — without putting a control-plane lookup or queue in the hot path. What are Modal Servers for? Modal Servers enable communication between your clients and a regionalized, autoscaling pool of HTTP server replicas on Modal via a reverse proxy routing system.

Contrast that lightweight system (right below) with Modal Web Functions, which include an input plane that affords queueing and retries (left below).

To understand these architectures and their choices better, consider the differences between the Transmission Control Protocol (TCP) and the User Datagram Protocol (UDP). TCP provides Internet applications with a reliable, ordered byte stream. UDP provides a lower-latency primitive that requires applications to handle out of order/dropped datagrams. Neither is “better” in the abstract. They optimize for different things. Some applications (e.g., GPU-to-GPU comms, video calling) build on UDP — via RDMA over converged Ethernet (RoCE) v2 or via Web Real-Time Communication (WebRTC). They achieve lower jitter/latency than TCP (cf the limited uptake of iWARP RDMA-over-TCP). The tradeoff is reimplementing higher-layer reliability/ordering in an application-specific way (cf “the end-to-end argument”). Modal Web Functions are closer to TCP: retries and queueing are built in, so clients don’t need to worry about them. Modal Servers are closer to UDP: requests take a lighter, lower-latency path, but applications must handle the rough edges. If no replica is available, clients get a 503 Service Unavailable—the same response as if the service didn’t exist. Queueing and load-shedding (via 503s) are likewise pushed to the container application (for now!). But in certain applications like low-latency LLM inference, it makes sense to make this trade. Aside: This difference isn’t literal, btw — you can serve UDP applications on Modal containers using either stack with UDP hole-punching, e.g. to drive a robot or detect objects in a webcam feed. But the engineering constraints, our choices, and your options are analogous. Designing Modal Servers At a high level, the routing layer for Modal Servers comprises a streaming edge proxy for I/O, an intelligent stateless proxy, and a compute load balancer. The stateless proxy is configured by a shared global state and the compute load balancer communicates with user containers and with the Autoscaler that creates and destroys them. That looks something like this:

Two principles governed every choice we made for and within this architecture: Maximize resource sharing while minimizing interference . We need to pool resources across tenants and requests (work-stealing, connection pooling, stream multiplexing) for aggregate performance, but we also need to provide the illusion of dedicated resources for correctness and per-tenant/per-request performance. No network calls on the request path. No metadata fetches, no KV store, no fallback to blob storage. That prevents us...

Routing for serverless servers with Pingora, Envoy, and Spanner

Related Articles

US Government directive to suspend access to Fable 5 and Mythos 5

Is AI ruining our skills? Early results are in – and they're not good

The Anatomy of an AI-Native Org

Apertus – Open Foundation Model for Sovereign AI

Britain Became as Poor as Mississippi