Show HN: KVBoost – chunk-level KV cache reuse for HuggingFace, 5–48x faster TTFT

pythongiant1 pts0 comments

KVBoost — Pitch Deck

pip install kvboost

KVBoost

Faster LLM Inference.<br>Less VRAM. No Model Changes.

Chunk-level KV cache reuse · FlashAttention-2 · AWQ layer streaming · CPU paged decoding

The Problem

LLM inference is broken by default.

🧱

VRAM Walls

Modern LLMs like Qwen2.5-32B require 60+ GB VRAM at full precision — out of reach for most teams.

🐢

Slow Prefill

Repeated system prompts are recomputed from scratch on every single request — wasting GPU cycles constantly.

🔧

HF Bottlenecks

HuggingFace's default inference loop has no KV cache reuse, no chunked attention, and no memory-efficient decoding.

The Solution

KVBoost: drop-in, no rewrites.

Python

from kvboost import KVBoost

engine = KVBoost.from_pretrained(

"Qwen/Qwen2.5-3B"

# Warm a shared prefix once

engine.warm("You are a helpful assistant...")

# All subsequent calls reuse cache

result = engine.generate(prompt)

print(result.kv_reuse_ratio)<br># ✓ 80%+

KV Cache Reuse<br>Chunk-level cache reuse eliminates redundant prefill for shared prompts.

🚀

FlashAttention-2<br>Memory-efficient attention with 3–5× TTFT speedup vs vanilla HuggingFace.

💾

AWQ Layer Streaming<br>Run 32B+ models on 8 GB VRAM via pinned-host weight streaming.

🗄️

CPU Paged Decoding<br>Spill KV cache to CPU RAM — handle long contexts without OOM errors.

Performance

Real numbers. Real hardware.

3–5×

TTFT Speedup<br>vs HF Baseline

80%+

KV Cache Hit Rate<br>Multi-Turn

8 GB

VRAM for 32B Model<br>AWQ Streaming

~10K

Lines of Code<br>43 Python Modules

Time to First Token (ms) — lower is better

HF Baseline

850ms

Prefix Reuse

320ms

Chunk Reuse

210ms

Multi-Turn Cache Hit Rate (%)

Turn 1

0%

Turn 2

45%

Turn 3

68%

Turn 4

78%

Turn 5+

85%

How It Works

Four layers of optimization.

01

Hash Chunks

Incoming prompt is split into chunks. Each chunk is hashed to look up prior cached K/V pairs.

02

Reuse Cache

Matching chunks skip attention entirely. Only novel tokens are forwarded through the transformer.

03

Flash Attention

New tokens run FlashAttention-2 — tiled CUDA kernels with O(√N) memory. No custom model code needed.

04

Page Offload

Long-context KV blocks are evicted to CPU RAM via async DMA — enabling contexts beyond GPU VRAM.

AWQ Layer Streaming

Run a 32B model on a gaming GPU.

Terminal

$ python -m kvboost.streaming.demo_partial_8b

--model Qwen/Qwen2.5-32B-Instruct-AWQ

INFO: Replaced projections:

56 resident across 8 layers

392 streamed across 56 layers

load_time: 10.7s

peak_vram_after_load: 5.65 GB

avg_tok_per_s: 0.11

peak_vram_during_decode: 6.13 GB

5.65 GB

Peak VRAM after loading a 32B model — fits on a single 8 GB gaming GPU.

6.13 GB

Peak VRAM during decode — stays safely under the 8 GB limit.

0.11 tok/s

PCIe-bound throughput — built for VRAM savings, not raw speed.

Use Cases

Who needs KVBoost?

💻

AI Coding Assistants

System prompts are re-used across 100s of requests. Cache the context once, speed up every response by 3–5×.

📚

RAG Pipelines

Document chunks appear in many queries. Chunk-level reuse makes multi-document QA dramatically faster.

⚙️

Edge / Budget Infra

AWQ streaming lets teams deploy 30B+ models on consumer GPUs — no $10K A100 required.

💬

Multi-Turn Chatbots

Conversation history grows each turn. CPU paged decoding handles long contexts without OOM crashes.

MIT Licensed · Drop-in with HuggingFace Transformers · No fine-tuning, no architecture changes

Technology

Built on solid foundations.

FlashAttention-2<br>Tiled CUDA kernels for O(√N) memory attention

AWQ (AutoQuant)<br>Weight-only 4-bit quantization preserving accuracy

HuggingFace Transformers<br>Drop-in compatibility — no model changes required

CUDA DMA Streams<br>Async PCIe transfers for layer-by-layer weight streaming

Chunk Hashing<br>Deterministic token-level hashing for cache lookup

CPU Paged Memory<br>Page-table KV offload — evict cold blocks to RAM

PyPI Package<br>pip install kvboost — ready in 2 minutes

MIT License<br>Fully open source, production-ready for any use

Roadmap

What's next.

Now ✅

✓ Chunk-level KV reuse

✓ FlashAttention-2 integration

✓ AWQ layer streaming

✓ CPU paged decoding

Next 🔨

◦ Multi-GPU tensor parallel

◦ Speculative decoding

◦ LoRA adapter hot-swap

◦ Continuous batching

Future 🔭

◦ GGUF / GGML support

◦ Triton custom kernels

◦ Distributed KV cache

◦ Cloud-hosted cache tier

Start building<br>faster.

KVBoost is open source and production-ready.<br>Drop it into any HuggingFace project today.

GitHub

github.com/pythongiant/kvboost

PyPI

pypi.org/project/kvboost/

Docs

kvboost.readthedocs.io

$ pip install kvboost

MIT License · Built by @pythongiant

1 / 10

kvboost cache reuse vram streaming turn

Related Articles