GitHub - flouthoc/sors: Minimal proxy which reorders prompts for LLM to maximize prefix cache hit · GitHub
/" data-turbo-transient="true" />
Skip to content
Search or jump to...
Search code, repositories, users, issues, pull requests...
-->
Search
Clear
Search syntax tips
Provide feedback
--><br>We read every piece of feedback, and take your input very seriously.
Include my email address so I can be contacted
Cancel
Submit feedback
Saved searches
Use saved searches to filter your results more quickly
-->
Name
Query
To see all available qualifiers, see our documentation.
Cancel
Create saved search
Sign in
/;ref_cta:Sign up;ref_loc:header logged out"}"<br>Sign up
Appearance settings
Resetting focus
You signed in with another tab or window. Reload to refresh your session.<br>You signed out in another tab or window. Reload to refresh your session.<br>You switched accounts on another tab or window. Reload to refresh your session.
Dismiss alert
{{ message }}
flouthoc
sors
Public
Notifications<br>You must be signed in to change notification settings
Fork
Star
main
BranchesTags
Go to file
CodeOpen more actions menu
Folders and files<br>NameNameLast commit message<br>Last commit date<br>Latest commit
History<br>6 Commits<br>6 Commits
assets
assets
src
src
tests
tests
.gitignore
.gitignore
Cargo.lock
Cargo.lock
Cargo.toml
Cargo.toml
README.md
README.md
View all files
Repository files navigation
Sors - reorders prompts for LLM to maximize prefix cache hit.
A minimal reverse proxy that reorders prompt content to maximize prefix cache hits in LLM inference engines (vLLM, SGLang, or any OpenAI-compatible backend with prefix caching enabled).
How It Works
vLLM's Automatic Prefix Caching uses a radix tree keyed on sequential tokens from position 0. If volatile content (timestamps, request IDs) appears before a large static block, the entire downstream prefix is invalidated every request.
sors intercepts API requests, classifies prompt blocks as static/dynamic/unknown, and reorders them to place stable content at the prefix position — maximizing cache reuse.
Two Optimization Modes
Mode<br>Trigger<br>Mechanism
Tag-based<br>, , XML tags in content<br>Explicit extraction + reorder
Auto-detect<br>No tags, ENABLE_AUTO_DETECT=true<br>SHA-256 fingerprints, hit-count tracking, stability scoring
Output order: [static (longest first)] → [unknown] → [dynamic]
Quick Start
cargo build --release
# Start the proxy (configure via env vars)<br>VLLM_BACKEND=http://localhost:8000 ./target/release/sors
The proxy listens on port 9000 by default. Point your OpenAI client at http://localhost:9000 instead of the backend directly.
Configuration
All settings via environment variables:
Variable<br>Default<br>Description
VLLM_BACKEND<br>http://localhost:8000<br>Backend URL
PROXY_HOST<br>0.0.0.0<br>Bind host
PROXY_PORT<br>9000<br>Listen port
STABILITY_THRESHOLD<br>0.5<br>Min stability score for "static"
MIN_HITS_FOR_STATIC<br>Min times a block must appear
MIN_BLOCK_LENGTH<br>50<br>Min chars to process a block
MAX_BLOCK_HISTORY<br>10000<br>Max fingerprints stored
BACKEND_TIMEOUT<br>120.0<br>HTTP timeout (seconds)
ENABLE_AUTO_DETECT<br>true<br>Auto fingerprint mode
ENABLE_TAG_MODE<br>true<br>XML tag mode
ENABLE_METRICS<br>true<br>Record request metrics
ENABLE_ORDER_ANNOTATIONS<br>false<br>Inject logical order header
API Endpoints
Method<br>Path<br>Description
POST<br>/v1/chat/completions<br>Optimize messages, forward (streaming supported)
POST<br>/v1/completions<br>Optimize prompt string, forward
GET<br>/health<br>Proxy + backend health check
GET<br>/stats<br>Block engine statistics
GET<br>/metrics<br>Prometheus text format
GET<br>/metrics/json<br>JSON metrics summary
/{path}<br>Passthrough to backend
Testing
Two benchmark scripts compare proxy-optimized latency vs direct (unoptimized) requests.
Prerequisites
pip install requests
You need a running vLLM backend with prefix caching enabled:
VLLM_CPU_KVCACHE_SPACE=10 VLLM_CPU_OMP_THREADS_BIND=auto \<br>python -m vllm.entrypoints.openai.api_server \<br>--model Qwen/Qwen2.5-0.5B --dtype bfloat16 --port 8000 --enable-prefix-caching
Test 1: Tag-based optimization
Tests explicit , , tag reordering. Sends requests through the proxy (port 9000) and directly to vLLM (port 8000), comparing latency.
# Terminal 1: start the proxy<br>cargo run
# Terminal 2: run the benchmark<br>python tests/test_cache.py
Test 2: Auto-detection (no tags)
Tests the fingerprint-based auto-detection mode. The proxy learns which blocks are static by observing repeated content across requests, then begins reordering automatically.
# Terminal 1: start the proxy<br>cargo run
# Terminal 2: run the benchmark<br>python tests/test_auto_detect.py
The auto-detect test has two phases:
Learning phase (requests 1–3): proxy observes traffic patterns, no reordering yet
Optimization phase (requests 4–8): auto-reordering kicks in after blocks are classified as static
Both tests print per-request timings and a summary with average speedup.
Architecture
Client → sors (:9000) → Parse → Classify → Reorder → Forward → vLLM...