Sors: a Rust proxy that reorders prompts to maximize vLLM prefix cache hits

flaccount1 pts0 comments

GitHub - flouthoc/sors: Minimal proxy which reorders prompts for LLM to maximize prefix cache hit · GitHub

/" data-turbo-transient="true" />

Skip to content

Search or jump to...

Search code, repositories, users, issues, pull requests...

-->

Search

Clear

Search syntax tips

Provide feedback

--><br>We read every piece of feedback, and take your input very seriously.

Include my email address so I can be contacted

Cancel

Submit feedback

Saved searches

Use saved searches to filter your results more quickly

-->

Name

Query

To see all available qualifiers, see our documentation.

Cancel

Create saved search

Sign in

/;ref_cta:Sign up;ref_loc:header logged out"}"<br>Sign up

Appearance settings

Resetting focus

You signed in with another tab or window. Reload to refresh your session.<br>You signed out in another tab or window. Reload to refresh your session.<br>You switched accounts on another tab or window. Reload to refresh your session.

Dismiss alert

{{ message }}

flouthoc

sors

Public

Notifications<br>You must be signed in to change notification settings

Fork

Star

main

BranchesTags

Go to file

CodeOpen more actions menu

Folders and files<br>NameNameLast commit message<br>Last commit date<br>Latest commit

History<br>6 Commits<br>6 Commits

assets

assets

src

src

tests

tests

.gitignore

.gitignore

Cargo.lock

Cargo.lock

Cargo.toml

Cargo.toml

README.md

README.md

View all files

Repository files navigation

Sors - reorders prompts for LLM to maximize prefix cache hit.

A minimal reverse proxy that reorders prompt content to maximize prefix cache hits in LLM inference engines (vLLM, SGLang, or any OpenAI-compatible backend with prefix caching enabled).

How It Works

vLLM's Automatic Prefix Caching uses a radix tree keyed on sequential tokens from position 0. If volatile content (timestamps, request IDs) appears before a large static block, the entire downstream prefix is invalidated every request.

sors intercepts API requests, classifies prompt blocks as static/dynamic/unknown, and reorders them to place stable content at the prefix position — maximizing cache reuse.

Two Optimization Modes

Mode<br>Trigger<br>Mechanism

Tag-based<br>, , XML tags in content<br>Explicit extraction + reorder

Auto-detect<br>No tags, ENABLE_AUTO_DETECT=true<br>SHA-256 fingerprints, hit-count tracking, stability scoring

Output order: [static (longest first)] → [unknown] → [dynamic]

Quick Start

cargo build --release

# Start the proxy (configure via env vars)<br>VLLM_BACKEND=http://localhost:8000 ./target/release/sors

The proxy listens on port 9000 by default. Point your OpenAI client at http://localhost:9000 instead of the backend directly.

Configuration

All settings via environment variables:

Variable<br>Default<br>Description

VLLM_BACKEND<br>http://localhost:8000<br>Backend URL

PROXY_HOST<br>0.0.0.0<br>Bind host

PROXY_PORT<br>9000<br>Listen port

STABILITY_THRESHOLD<br>0.5<br>Min stability score for "static"

MIN_HITS_FOR_STATIC<br>Min times a block must appear

MIN_BLOCK_LENGTH<br>50<br>Min chars to process a block

MAX_BLOCK_HISTORY<br>10000<br>Max fingerprints stored

BACKEND_TIMEOUT<br>120.0<br>HTTP timeout (seconds)

ENABLE_AUTO_DETECT<br>true<br>Auto fingerprint mode

ENABLE_TAG_MODE<br>true<br>XML tag mode

ENABLE_METRICS<br>true<br>Record request metrics

ENABLE_ORDER_ANNOTATIONS<br>false<br>Inject logical order header

API Endpoints

Method<br>Path<br>Description

POST<br>/v1/chat/completions<br>Optimize messages, forward (streaming supported)

POST<br>/v1/completions<br>Optimize prompt string, forward

GET<br>/health<br>Proxy + backend health check

GET<br>/stats<br>Block engine statistics

GET<br>/metrics<br>Prometheus text format

GET<br>/metrics/json<br>JSON metrics summary

/{path}<br>Passthrough to backend

Testing

Two benchmark scripts compare proxy-optimized latency vs direct (unoptimized) requests.

Prerequisites

pip install requests

You need a running vLLM backend with prefix caching enabled:

VLLM_CPU_KVCACHE_SPACE=10 VLLM_CPU_OMP_THREADS_BIND=auto \<br>python -m vllm.entrypoints.openai.api_server \<br>--model Qwen/Qwen2.5-0.5B --dtype bfloat16 --port 8000 --enable-prefix-caching

Test 1: Tag-based optimization

Tests explicit , , tag reordering. Sends requests through the proxy (port 9000) and directly to vLLM (port 8000), comparing latency.

# Terminal 1: start the proxy<br>cargo run

# Terminal 2: run the benchmark<br>python tests/test_cache.py

Test 2: Auto-detection (no tags)

Tests the fingerprint-based auto-detection mode. The proxy learns which blocks are static by observing repeated content across requests, then begins reordering automatically.

# Terminal 1: start the proxy<br>cargo run

# Terminal 2: run the benchmark<br>python tests/test_auto_detect.py

The auto-detect test has two phases:

Learning phase (requests 1–3): proxy observes traffic patterns, no reordering yet

Optimization phase (requests 4–8): auto-reordering kicks in after blocks are classified as static

Both tests print per-request timings and a summary with average speedup.

Architecture

Client → sors (:9000) → Parse → Classify → Reorder → Forward → vLLM...

proxy prefix requests sors vllm tests

Related Articles