Sors: a Rust proxy that reorders prompts to maximize vLLM prefix cache hits

GitHub - flouthoc/sors: Minimal proxy which reorders prompts for LLM to maximize prefix cache hit · GitHub

/" data-turbo-transient="true" />

Search or jump to...

Search code, repositories, users, issues, pull requests...

-->

Clear

Search syntax tips

Provide feedback

--> We read every piece of feedback, and take your input very seriously.

Include my email address so I can be contacted

Cancel

Submit feedback

Saved searches

Use saved searches to filter your results more quickly

-->

Name

Query

To see all available qualifiers, see our documentation.

Cancel

Create saved search

/;ref_cta:Sign up;ref_loc:header logged out"}" Sign up

Appearance settings

Resetting focus

You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session. You switched accounts on another tab or window. Reload to refresh your session.

Dismiss alert

flouthoc

sors

Public

Notifications You must be signed in to change notification settings

Fork

Star

main

BranchesTags

Go to file

CodeOpen more actions menu

Folders and files NameNameLast commit message Last commit date Latest commit

History 6 Commits 6 Commits

assets

src

tests

.gitignore

Cargo.lock

Cargo.toml

README.md

View all files

Repository files navigation

Sors - reorders prompts for LLM to maximize prefix cache hit.

A minimal reverse proxy that reorders prompt content to maximize prefix cache hits in LLM inference engines (vLLM, SGLang, or any OpenAI-compatible backend with prefix caching enabled).

How It Works

vLLM's Automatic Prefix Caching uses a radix tree keyed on sequential tokens from position 0. If volatile content (timestamps, request IDs) appears before a large static block, the entire downstream prefix is invalidated every request.

sors intercepts API requests, classifies prompt blocks as static/dynamic/unknown, and reorders them to place stable content at the prefix position — maximizing cache reuse.

Two Optimization Modes

Mode Trigger Mechanism

Tag-based , , XML tags in content Explicit extraction + reorder

Auto-detect No tags, ENABLE_AUTO_DETECT=true SHA-256 fingerprints, hit-count tracking, stability scoring

Output order: [static (longest first)] → [unknown] → [dynamic]

Quick Start

cargo build --release

# Start the proxy (configure via env vars) VLLM_BACKEND=http://localhost:8000 ./target/release/sors

The proxy listens on port 9000 by default. Point your OpenAI client at http://localhost:9000 instead of the backend directly.

Configuration

All settings via environment variables:

Variable Default Description

VLLM_BACKEND http://localhost:8000 Backend URL

PROXY_HOST 0.0.0.0 Bind host

PROXY_PORT 9000 Listen port

STABILITY_THRESHOLD 0.5 Min stability score for "static"

MIN_HITS_FOR_STATIC Min times a block must appear

MIN_BLOCK_LENGTH 50 Min chars to process a block

MAX_BLOCK_HISTORY 10000 Max fingerprints stored

BACKEND_TIMEOUT 120.0 HTTP timeout (seconds)

ENABLE_AUTO_DETECT true Auto fingerprint mode

ENABLE_TAG_MODE true XML tag mode

ENABLE_METRICS true Record request metrics

ENABLE_ORDER_ANNOTATIONS false Inject logical order header

API Endpoints

Method Path Description

POST /v1/chat/completions Optimize messages, forward (streaming supported)

POST /v1/completions Optimize prompt string, forward

GET /health Proxy + backend health check

GET /stats Block engine statistics

GET /metrics Prometheus text format

GET /metrics/json JSON metrics summary

/{path} Passthrough to backend

Testing

Two benchmark scripts compare proxy-optimized latency vs direct (unoptimized) requests.

Prerequisites

pip install requests

You need a running vLLM backend with prefix caching enabled:

VLLM_CPU_KVCACHE_SPACE=10 VLLM_CPU_OMP_THREADS_BIND=auto \ python -m vllm.entrypoints.openai.api_server \ --model Qwen/Qwen2.5-0.5B --dtype bfloat16 --port 8000 --enable-prefix-caching

Test 1: Tag-based optimization

Tests explicit , , tag reordering. Sends requests through the proxy (port 9000) and directly to vLLM (port 8000), comparing latency.

# Terminal 1: start the proxy cargo run

# Terminal 2: run the benchmark python tests/test_cache.py

Test 2: Auto-detection (no tags)

Tests the fingerprint-based auto-detection mode. The proxy learns which blocks are static by observing repeated content across requests, then begins reordering automatically.

# Terminal 1: start the proxy cargo run

# Terminal 2: run the benchmark python tests/test_auto_detect.py

The auto-detect test has two phases:

Learning phase (requests 1–3): proxy observes traffic patterns, no reordering yet

Optimization phase (requests 4–8): auto-reordering kicks in after blocks are classified as static

Both tests print per-request timings and a summary with average speedup.

Architecture

Client → sors (:9000) → Parse → Classify → Reorder → Forward → vLLM...

Sors: a Rust proxy that reorders prompts to maximize vLLM prefix cache hits

Related Articles

The Newest Instagram "Exploit" Is the Goofiest I've Seen

Apple WWDC 2026 Livestream

Claude Fable 5

US Government directive to suspend access to Fable 5 and Mythos 5

German ruling declares Google liable for false answers in AI Overviews