LiteLLM Migrates to Rust

Migrating LiteLLM to Rust - Building the Fastest and Litest AI Gateway | liteLLM

Blog Skip to main content

Last Updated: June 2026

Over the past year, we have heard the same thing from our users and our community: they want the fastest, most lightweight AI gateway they can run. We have heard you. We are addressing it by moving LiteLLM to Rust, and committing to sub-1ms overhead with a sub-100MB memory binary you can deploy. By the end of this migration, you will get a pure Rust server that can serve 100% of your AI traffic, with every hot path operation, including auth and rate limiting, running in Rust.

Want to help us build it? We are opening an early beta and want to work directly with teams who care about a fast, lightweight gateway. If that is you, sign up here and we will get you testing the Rust gateway in your own stack, with a direct line to our team.

The reason it matters: under real load, CPU and memory climb with concurrency, and pods get OOM-killed at the worst time. Today the LiteLLM Python proxy peaks around 359MB of memory under load, and that cost multiplies across every pod, region, and retry you run.

We are already seeing the payoff in benchmarks. The Rust gateway serves about 15x the throughput (453 to 6,782 requests per second) on about 11x less memory (359MB to 32MB), and cuts per-request overhead from about 7.5ms on the Python path to about 0.05ms, well under the 1ms we commit to.

What you get

You deploy a single Rust binary. It uses about 65MB of memory, gateway overhead stays under 1ms, and nothing in your setup changes: same config.yaml, same database, same client API, same providers. You keep LiteLLM's coverage of 100+ LLM providers behind one OpenAI-compatible API, with /chat/completions, /messages, /responses, and every other LLM endpoint LiteLLM supports today, now as the fastest and most lightweight LLM gateway you can self-host.

This is not a v2 and not a rewrite. There is no new major version to migrate to and nothing for you to change. The runtime under the hot path gets faster and lighter while your config stays exactly where it is.

We ship this the careful way. Each route moves to Rust only after it passes our full parity and end-to-end test suite, and it runs in production before the next route starts. Stability is the priority, and we target zero regressions on every release.

How fast is the LiteLLM gateway? A throughput, overhead, and memory benchmark

Per-request overhead. We built a small harness: a mock upstream, a thin Rust forwarding gateway (axum), the same forwarding path running through LiteLLM today (litellm.acompletion over uvicorn), and a load client that times each request in microseconds. At 10 concurrent clients against the same mock, the Rust gateway adds about 0.05ms of overhead per request; the LiteLLM Python path adds about 7.5ms. That is roughly 150x lower, and well under the 1ms we commit to.

Sustained load. Against the current LiteLLM Python proxy on the same /v1/responses workload at 50 concurrent clients, the Rust path served about 15x the throughput on about 11x less memory.

Per-request overheadThroughput under loadPeak memory under loadRust gateway ~0.05ms6,782 req/s31.7MBLiteLLM (Python) ~7.5ms453 req/s358.9MB The overhead harness (mock, gateway, load client) is checked in next to this post under benchmark/, and the summarized numbers are in rust_proxy_benchmark_results.csv, so you can reproduce the sub-1ms result. This measures the gateway forwarding path (request transform, forwarding, response handling), not a full production workload.

What stays the same

Nothing you depend on changes. The migration is invisible from the outside:

Your Python SDK keeps the exact same interface; the same calls now run on Rust bindings underneath.

Your config.yaml is unchanged.

Your database and schema are unchanged.

Your client API and request/response shapes are unchanged.

Your providers, routing, and keys are unchanged.

You get lower memory and lower overhead, and you do nothing to get it.

How the migration works

If you just want the outcome, you have it above. The rest of this post is for engineers who want to see how we move the gateway to Rust without breaking anything.

The core idea is a clean split. We build one Rust core that only transforms data: it turns your request into a provider request, turns the provider response back, transforms stream chunks, counts tokens, and normalizes errors. It never opens a socket, reads a secret, or writes to your database. The host process does all of that. That separation is what lets us put Rust into production without rewriting the server, because Python keeps doing the I/O while Rust takes over the translation.

Stage 0 · Today Pure Python SDK + FastAPI proxy 100% Python

Stage 1 · Core in Rust Python drives Rust transforms via PyO3 V0 to V3

Stage 2 · Thin shell FastAPI shell, hot path all Rust V4 to V5a

Stage 3 · Pure Rust axum server, Python in sidecar V5b

Rust share of hot...

LiteLLM Migrates to Rust

Related Articles

Claude Fable 5

US Government directive to suspend access to Fable 5 and Mythos 5

Is AI ruining our skills? Early results are in – and they're not good

The Anatomy of an AI-Native Org

Apertus – Open Foundation Model for Sovereign AI