GitHub - MordechaiHadad/rustgate: A rust powered token-aware rate limiter for FastAPI · GitHub
/" data-turbo-transient="true" />
Skip to content
Search or jump to...
Search code, repositories, users, issues, pull requests...
-->
Search
Clear
Search syntax tips
Provide feedback
--><br>We read every piece of feedback, and take your input very seriously.
Include my email address so I can be contacted
Cancel
Submit feedback
Saved searches
Use saved searches to filter your results more quickly
-->
Name
Query
To see all available qualifiers, see our documentation.
Cancel
Create saved search
Sign in
/;ref_cta:Sign up;ref_loc:header logged out"}"<br>Sign up
Appearance settings
Resetting focus
You signed in with another tab or window. Reload to refresh your session.<br>You signed out in another tab or window. Reload to refresh your session.<br>You switched accounts on another tab or window. Reload to refresh your session.
Dismiss alert
{{ message }}
MordechaiHadad
rustgate
Public
Notifications<br>You must be signed in to change notification settings
Fork
Star
master
BranchesTags
Go to file
CodeOpen more actions menu
Folders and files<br>NameNameLast commit message<br>Last commit date<br>Latest commit
History<br>6 Commits<br>6 Commits
backend
backend
bindings
bindings
LICENSE
LICENSE
README.md
README.md
View all files
Repository files navigation
Rustgate Backend
Small FastAPI backend that uses a Rust pyo3 extension for AI token-aware rate<br>limiting backed by Redis.
Summary
This repository contains two parts:
bindings/ -- a Rust crate that exposes Python bindings via pyo3/maturin
backend/ -- a small FastAPI app that loads the compiled extension and serves<br>a few endpoints
The Rust extension uses the axum-rate-limiter crate and OpenAI's tiktoken-rs to<br>count query tokens. Rate limits are applied per model and per sliding window,<br>using the query's estimated token count multiplied by a model-specific cost<br>factor.
Prerequisites
Rust toolchain (rustc/cargo)
Python 3.10+
uv (the uv dependency manager)
Redis (running on default port 6379, or configure with RUSTGATE_REDIS_URL)
Quick Start
Ensure uv is installed and available on PATH.
Install dependencies and build everything via uv:
uv sync
"uv sync" installs pinned dependencies from uv.lock and runs the build steps<br>for this repository, including building the Rust pyo3 extension and<br>installing the local Python package into the environment uv manages.
Run the server
After uv sync completes you can start the FastAPI server with uv:
uv run uvicorn main:app --app-dir src --host 127.0.0.1 --port 8001
This uses the environment and commands declared in the repo's uv configuration.
Environment
RUSTGATE_REDIS_URL -- redis connection string used by the rate limiter<br>(default: redis://127.0.0.1:6379/0)
API Endpoints
All POST endpoints accept a JSON body with a query field, parsed by the Rust<br>layer for token counting.
GET /health -- basic health check, returns {"status": "ok"}
POST /models/auto -- tries gpt-5 first, falls back to gpt-4 if rate limited.<br>Returns {"model": ""}.
POST /models/gpt-5 -- attempts to use gpt-5. Returns 429 if rate limited.
POST /models/gpt-4 -- attempts to use gpt-4. Returns 429 if rate limited.
Rate Limiting
Rate limits are enforced in Rust via the RedisAiLimiter (axum-rate-limiter<br>crate) with the following rules:
Sliding window : 10 minutes (600 seconds).
Total budget : 5000 charge units per window per client, identified by IP<br>(X-Forwarded-For or remote address).
Per-token cost :
gpt-4 family: 1 charge unit per token
gpt-5 family: 25 charge units per token
Token counting : uses tiktoken-rs with the appropriate tokenizer<br>(Cl100kBase for gpt-4, O200kBase for gpt-5).
Zero-token queries : bypass rate limiting entirely.
Example: a 200-token gpt-5 query costs 5000 charge units (200 x 25), consuming<br>the entire budget. The same 200-token query against gpt-4 costs only 200 charge<br>units (200 x 1).
Supported models
The Rust layer supports two model families:
gpt-4 and gpt-4.* (e.g. gpt-4, gpt-4.1)
gpt-5 and gpt-5.* (e.g. gpt-5, gpt-5.4)
Models like gpt-4o, gpt-4o-mini, gpt-5-mini, or o3 are not currently<br>supported and will return a 400 error.
Benchmark (sample load test)
Load test with oha against the POST /models/auto endpoint:
oha -z 30s -c 100 -m POST -d '{"query": "This is my grand query"}' \<br>http://localhost:8001/models/auto
Results:
Metric<br>Value
Duration<br>30.01 s
Requests/sec<br>1128.10
Fastest latency<br>15.34 ms
Average latency<br>88.76 ms
Slowest latency<br>774.95 ms
p50<br>75.0 ms
p90<br>152.1 ms
p95<br>168.36 ms
p99<br>183.6 ms
Rate-limited (429)<br>33713
Successful (200)<br>40
Comparison with the lightweight GET /health endpoint (no rate limiting, no<br>token counting, no model rerouting, just a fast return):
oha -z 30s -c 100 -m GET http://localhost:8001/health
Metric<br>Value
Duration<br>30.00 s
Requests/sec<br>1496.39
Fastest latency<br>13.70 ms
Average latency<br>66.90 ms
Slowest latency<br>1.89 s
p50<br>51.10 ms
p90<br>58.80...