Show HN: Rustgate – Bypassing Python's event loop for token-aware rate limiting

MordechaiHadad1 pts0 comments

GitHub - MordechaiHadad/rustgate: A rust powered token-aware rate limiter for FastAPI · GitHub

/" data-turbo-transient="true" />

Skip to content

Search or jump to...

Search code, repositories, users, issues, pull requests...

-->

Search

Clear

Search syntax tips

Provide feedback

--><br>We read every piece of feedback, and take your input very seriously.

Include my email address so I can be contacted

Cancel

Submit feedback

Saved searches

Use saved searches to filter your results more quickly

-->

Name

Query

To see all available qualifiers, see our documentation.

Cancel

Create saved search

Sign in

/;ref_cta:Sign up;ref_loc:header logged out"}"<br>Sign up

Appearance settings

Resetting focus

You signed in with another tab or window. Reload to refresh your session.<br>You signed out in another tab or window. Reload to refresh your session.<br>You switched accounts on another tab or window. Reload to refresh your session.

Dismiss alert

{{ message }}

MordechaiHadad

rustgate

Public

Notifications<br>You must be signed in to change notification settings

Fork

Star

master

BranchesTags

Go to file

CodeOpen more actions menu

Folders and files<br>NameNameLast commit message<br>Last commit date<br>Latest commit

History<br>6 Commits<br>6 Commits

backend

backend

bindings

bindings

LICENSE

LICENSE

README.md

README.md

View all files

Repository files navigation

Rustgate Backend

Small FastAPI backend that uses a Rust pyo3 extension for AI token-aware rate<br>limiting backed by Redis.

Summary

This repository contains two parts:

bindings/ -- a Rust crate that exposes Python bindings via pyo3/maturin

backend/ -- a small FastAPI app that loads the compiled extension and serves<br>a few endpoints

The Rust extension uses the axum-rate-limiter crate and OpenAI's tiktoken-rs to<br>count query tokens. Rate limits are applied per model and per sliding window,<br>using the query's estimated token count multiplied by a model-specific cost<br>factor.

Prerequisites

Rust toolchain (rustc/cargo)

Python 3.10+

uv (the uv dependency manager)

Redis (running on default port 6379, or configure with RUSTGATE_REDIS_URL)

Quick Start

Ensure uv is installed and available on PATH.

Install dependencies and build everything via uv:

uv sync

"uv sync" installs pinned dependencies from uv.lock and runs the build steps<br>for this repository, including building the Rust pyo3 extension and<br>installing the local Python package into the environment uv manages.

Run the server

After uv sync completes you can start the FastAPI server with uv:

uv run uvicorn main:app --app-dir src --host 127.0.0.1 --port 8001

This uses the environment and commands declared in the repo's uv configuration.

Environment

RUSTGATE_REDIS_URL -- redis connection string used by the rate limiter<br>(default: redis://127.0.0.1:6379/0)

API Endpoints

All POST endpoints accept a JSON body with a query field, parsed by the Rust<br>layer for token counting.

GET /health -- basic health check, returns {"status": "ok"}

POST /models/auto -- tries gpt-5 first, falls back to gpt-4 if rate limited.<br>Returns {"model": ""}.

POST /models/gpt-5 -- attempts to use gpt-5. Returns 429 if rate limited.

POST /models/gpt-4 -- attempts to use gpt-4. Returns 429 if rate limited.

Rate Limiting

Rate limits are enforced in Rust via the RedisAiLimiter (axum-rate-limiter<br>crate) with the following rules:

Sliding window : 10 minutes (600 seconds).

Total budget : 5000 charge units per window per client, identified by IP<br>(X-Forwarded-For or remote address).

Per-token cost :

gpt-4 family: 1 charge unit per token

gpt-5 family: 25 charge units per token

Token counting : uses tiktoken-rs with the appropriate tokenizer<br>(Cl100kBase for gpt-4, O200kBase for gpt-5).

Zero-token queries : bypass rate limiting entirely.

Example: a 200-token gpt-5 query costs 5000 charge units (200 x 25), consuming<br>the entire budget. The same 200-token query against gpt-4 costs only 200 charge<br>units (200 x 1).

Supported models

The Rust layer supports two model families:

gpt-4 and gpt-4.* (e.g. gpt-4, gpt-4.1)

gpt-5 and gpt-5.* (e.g. gpt-5, gpt-5.4)

Models like gpt-4o, gpt-4o-mini, gpt-5-mini, or o3 are not currently<br>supported and will return a 400 error.

Benchmark (sample load test)

Load test with oha against the POST /models/auto endpoint:

oha -z 30s -c 100 -m POST -d '{"query": "This is my grand query"}' \<br>http://localhost:8001/models/auto

Results:

Metric<br>Value

Duration<br>30.01 s

Requests/sec<br>1128.10

Fastest latency<br>15.34 ms

Average latency<br>88.76 ms

Slowest latency<br>774.95 ms

p50<br>75.0 ms

p90<br>152.1 ms

p95<br>168.36 ms

p99<br>183.6 ms

Rate-limited (429)<br>33713

Successful (200)<br>40

Comparison with the lightweight GET /health endpoint (no rate limiting, no<br>token counting, no model rerouting, just a fast return):

oha -z 30s -c 100 -m GET http://localhost:8001/health

Metric<br>Value

Duration<br>30.00 s

Requests/sec<br>1496.39

Fastest latency<br>13.70 ms

Average latency<br>66.90 ms

Slowest latency<br>1.89 s

p50<br>51.10 ms

p90<br>58.80...

rate token rust query models window

Related Articles