Stream2LLM: Overlap Context Streaming and Prefill for Reduced Time-to-First-Token | @rajveerbach’s blog
tl;dr Streaming context to an LLM as it arrives -- rather than waiting for complete retrieval -- reduces latency dramatically. But prior systems only handle one request at a time. Stream2LLM extends vLLM with concurrent streaming support, introducing scheduling policies that manage memory contention and dynamic input changes across concurrent requests. Evaluated on real-world web crawling and vector search traces, it achieves up to 11x TTFT improvement while maintaining throughput parity.
A user asks a question. Behind the scenes, a web crawler fetches pages to build context over about 10 seconds, with each page arriving roughly 700 milliseconds apart. Without streaming, the user stares at a blank screen the entire time – because the model cannot start until every page has arrived. With streaming, the model starts reading the first page while the rest are still being fetched. The first word appears in under a second.
Context streaming overlaps retrieval with prefill, reducing TTFT by beginning inference as chunks arrive.
For a single request, streaming context is straightforward and effective.
Now serve four requests at once. Each is streaming context at a different rate. Each holds KV cache blocks in GPU memory for its growing input. Memory fills up. The scheduler must decide: which request gets evicted? Should its cache be discarded or swapped to CPU? And what happens when a request’s input changes mid-flight because the search algorithm found better documents?
Prior streaming systems do not answer these questions – they evaluate single-request scenarios only. Stream2LLM extends vLLM to handle concurrent streaming, delivering up to 11x faster time-to-first-token (TTFT) while maintaining throughput parity. But there is a catch: naive streaming under memory pressure makes tail latency 5x worse than not streaming at all. Getting the scheduler right is the difference between an 11x win and a 5x regression.
Why Memory and Scheduling Matter
LLM inference has two phases. The prefill phase processes all input tokens in parallel, storing a key vector and value vector for every input token in GPU memory – the KV cache. The decode phase generates tokens one at a time, attending to the full cache at each step.
The cache is expensive. For Llama-3.1-8B with a 32K-token input, it is 4 GB per request in FP16. Serve four concurrent requests and you already have 16 GB of KV cache alone. Scale retrieval time further and memory runs out given the total requests kv state resident in given window of time. When it does, the inference system must evict requests: either discard their cache (recomputation) or move it to CPU (swapping). Both cost time.
Traditional inference waits for all documents before starting prefill. Time-to-first-token is retrieval time plus prefill time, and retrieval dominates – often seconds. Streaming eliminates this wait by feeding documents to the model as they arrive, but existing solutions only handle one request at a time. None address what happens when many requests stream simultaneously.
Two Kinds of Streaming
The complication is that context retrieval doesn’t follow a single pattern. There are two, and they break the KV cache in different ways:
Append-mode (web crawlers): Documents arrive sequentially and extend the input. [c1, q] becomes [c1, c2, q]. The cached prefix stays valid, so only the new tokens need processing. Favors swapping to preserve reusable cache.
Update-mode (vector search): The search algorithm refines its results, replacing documents. [c1, c3, q] becomes [c1, c2, q] – the cached KV pairs for c3 are now wrong. The longest common prefix (LCP) between old and new inputs determines how much cache survives. Favors recomputation to avoid retaining soon-invalid data.
Longest Common Prefix (LCP) invalidation for append-mode and update-mode streaming. Append mode preserves the full prefix; update mode may invalidate significant portions of the KV cache.
The figure above shows the difference concretely. Each row represents the token sequence at a point in time, with blocks showing which KV cache entries survive across updates. In append mode (left), the full prefix is preserved every time a new chunk arrives – the cache grows monotonically and nothing is wasted. In update mode (right), replacing a document in the middle invalidates everything after the longest common prefix. The marked blocks are cached KV pairs that must be discarded because the tokens they correspond to no longer exist in the new input.
A scheduler that treats these uniformly will either waste computation or serve stale cache. Stream2LLM handles both.
A Decoupled Scheduler
Stream2LLM splits the scheduling problem into two independent decisions. It targets the prefill instance in a disaggregated architecture (where prefill and decode run on separate GPU pools). TTFT and throughput are the relevant...