KVBoost — Pitch Deck
pip install kvboost
KVBoost
Faster LLM Inference.<br>Less VRAM. No Model Changes.
Chunk-level KV cache reuse · FlashAttention-2 · AWQ layer streaming · CPU paged decoding
The Problem
LLM inference is broken by default.
🧱
VRAM Walls
Modern LLMs like Qwen2.5-32B require 60+ GB VRAM at full precision — out of reach for most teams.
🐢
Slow Prefill
Repeated system prompts are recomputed from scratch on every single request — wasting GPU cycles constantly.
🔧
HF Bottlenecks
HuggingFace's default inference loop has no KV cache reuse, no chunked attention, and no memory-efficient decoding.
The Solution
KVBoost: drop-in, no rewrites.
Python
from kvboost import KVBoost
engine = KVBoost.from_pretrained(
"Qwen/Qwen2.5-3B"
# Warm a shared prefix once
engine.warm("You are a helpful assistant...")
# All subsequent calls reuse cache
result = engine.generate(prompt)
print(result.kv_reuse_ratio)<br># ✓ 80%+
KV Cache Reuse<br>Chunk-level cache reuse eliminates redundant prefill for shared prompts.
🚀
FlashAttention-2<br>Memory-efficient attention with 3–5× TTFT speedup vs vanilla HuggingFace.
💾
AWQ Layer Streaming<br>Run 32B+ models on 8 GB VRAM via pinned-host weight streaming.
🗄️
CPU Paged Decoding<br>Spill KV cache to CPU RAM — handle long contexts without OOM errors.
Performance
Real numbers. Real hardware.
3–5×
TTFT Speedup<br>vs HF Baseline
80%+
KV Cache Hit Rate<br>Multi-Turn
8 GB
VRAM for 32B Model<br>AWQ Streaming
~10K
Lines of Code<br>43 Python Modules
Time to First Token (ms) — lower is better
HF Baseline
850ms
Prefix Reuse
320ms
Chunk Reuse
210ms
Multi-Turn Cache Hit Rate (%)
Turn 1
0%
Turn 2
45%
Turn 3
68%
Turn 4
78%
Turn 5+
85%
How It Works
Four layers of optimization.
01
Hash Chunks
Incoming prompt is split into chunks. Each chunk is hashed to look up prior cached K/V pairs.
02
Reuse Cache
Matching chunks skip attention entirely. Only novel tokens are forwarded through the transformer.
03
Flash Attention
New tokens run FlashAttention-2 — tiled CUDA kernels with O(√N) memory. No custom model code needed.
04
Page Offload
Long-context KV blocks are evicted to CPU RAM via async DMA — enabling contexts beyond GPU VRAM.
AWQ Layer Streaming
Run a 32B model on a gaming GPU.
Terminal
$ python -m kvboost.streaming.demo_partial_8b
--model Qwen/Qwen2.5-32B-Instruct-AWQ
INFO: Replaced projections:
56 resident across 8 layers
392 streamed across 56 layers
load_time: 10.7s
peak_vram_after_load: 5.65 GB
avg_tok_per_s: 0.11
peak_vram_during_decode: 6.13 GB
5.65 GB
Peak VRAM after loading a 32B model — fits on a single 8 GB gaming GPU.
6.13 GB
Peak VRAM during decode — stays safely under the 8 GB limit.
0.11 tok/s
PCIe-bound throughput — built for VRAM savings, not raw speed.
Use Cases
Who needs KVBoost?
💻
AI Coding Assistants
System prompts are re-used across 100s of requests. Cache the context once, speed up every response by 3–5×.
📚
RAG Pipelines
Document chunks appear in many queries. Chunk-level reuse makes multi-document QA dramatically faster.
⚙️
Edge / Budget Infra
AWQ streaming lets teams deploy 30B+ models on consumer GPUs — no $10K A100 required.
💬
Multi-Turn Chatbots
Conversation history grows each turn. CPU paged decoding handles long contexts without OOM crashes.
MIT Licensed · Drop-in with HuggingFace Transformers · No fine-tuning, no architecture changes
Technology
Built on solid foundations.
FlashAttention-2<br>Tiled CUDA kernels for O(√N) memory attention
AWQ (AutoQuant)<br>Weight-only 4-bit quantization preserving accuracy
HuggingFace Transformers<br>Drop-in compatibility — no model changes required
CUDA DMA Streams<br>Async PCIe transfers for layer-by-layer weight streaming
Chunk Hashing<br>Deterministic token-level hashing for cache lookup
CPU Paged Memory<br>Page-table KV offload — evict cold blocks to RAM
PyPI Package<br>pip install kvboost — ready in 2 minutes
MIT License<br>Fully open source, production-ready for any use
Roadmap
What's next.
Now ✅
✓ Chunk-level KV reuse
✓ FlashAttention-2 integration
✓ AWQ layer streaming
✓ CPU paged decoding
Next 🔨
◦ Multi-GPU tensor parallel
◦ Speculative decoding
◦ LoRA adapter hot-swap
◦ Continuous batching
Future 🔭
◦ GGUF / GGML support
◦ Triton custom kernels
◦ Distributed KV cache
◦ Cloud-hosted cache tier
Start building<br>faster.
KVBoost is open source and production-ready.<br>Drop it into any HuggingFace project today.
GitHub
github.com/pythongiant/kvboost
PyPI
pypi.org/project/kvboost/
Docs
kvboost.readthedocs.io
$ pip install kvboost
MIT License · Built by @pythongiant
1 / 10