Why long context eats your VRAM: the KV cache explained

The KV Cache, Explained: Why Long Context Eats Your VRAM (and How to Fit More)

Dark

Here's a moment every local-LLM owner hits eventually: you carefully pick a quant, the model loads with VRAM to spare — and then you paste in a long document or hit a few thousand tokens of chat history, and it crashes with an out-of-memory error. The weights didn't grow. So what filled your VRAM? The answer is the KV cache , and it's the most under-explained number in local AI. It's the third thing competing for your memory — alongside model weights (which our quantization guide covers) and total parameters (our Mixture-of-Experts explainer). This piece completes that trilogy: what the KV cache is, why it explodes with context length, and the levers that let you fit more. What the KV cache actually is When a language model generates text, it processes tokens through attention layers. For every token, each layer computes three things: a query , a key , and a value . The trick of attention is that each new token needs to "look back" at the keys and values of every previous token. Without a cache, the model would have to recompute the keys and values for the entire history on every single new token — quadratic, brutally slow work. So instead it does the obvious thing: it stores the keys and values once and reuses them . That store is the KV cache. It's pure speed optimization — and like most speed optimizations, you pay for it in memory. The catch: the cache grows with every token in the context . A longer conversation, a bigger document, a fatter system prompt — each one adds to a pile of cached keys and values that has to live in fast memory right next to the model. And that pile gets big fast. The math: why context eats VRAM The size of the KV cache follows a simple formula: KV bytes ≈ 2 × layers × kv-heads × head-dim × tokens × bytes-per-value The 2 is for keys and values; everything else is the model's shape and how long your context is. Plug in real models (FP16, single user) and the numbers are startling:

ModelPer tokenAt 32k contextAt 128k context

7B, old-style full attention~0.5 MB~16 GB~64 GB 8B with GQA (Llama-3-style)~0.13 MB~4 GB~16 GB 70B with GQA~0.31 MB~10 GB~40 GB

Look at that first row. A 7B model with old-style full attention generates half a megabyte of cache per token . At 32k context that's ~16 GB — larger than the entire 4-bit quantized model itself (~4 GB). The thing you thought you were loading was the small part. This is the trap: people size their hardware for the weights and forget the cache, which at long context is often the bigger number. It scales linearly and relentlessly: double the context, double the cache. This is why a model that loads happily at 4k context detonates at 64k. What owners actually run into This isn't theoretical — it's one of the most common frustrations on r/LocalLLaMA. In a thread bluntly titled "My biggest issue with the Gemma-4 models is the massive KV cache," the owner (u/Iory1998) explains: "I have 40 GB of VRAM and I still cannot fit the entire… Q8 (35 GB) [with full context]… if I have to run a Q4 with a Q8 KV cache, then I am better off just using [a smaller model]." — u/Iory1998, on a 35 GB model that won't fit in 40 GB once context is added A commenter put the general rule even more plainly: "Most inference providers are serving a lot more VRAM on KV than weights." — a commenter in the same thread At production scale, with big batches and long contexts, the cache routinely dwarfs the model. The weights are a fixed cost; the KV cache is the variable one that blows your budget. How models fight back: MQA, GQA, and paging Because the KV cache is such a bottleneck, a lot of research has gone into shrinking it — and the wins are baked into the models you already run: Multi-Query Attention (MQA). Shazeer's "Fast Transformer Decoding" (2019) had every attention head share one set of keys and values instead of each keeping its own. That alone can cut the cache by an order of magnitude, at a small quality cost. Grouped-Query Attention (GQA). Ainslie et al.'s GQA (2023) is the middle ground: heads are split into a few groups that share keys/values. It keeps almost all of multi-head quality at close to MQA's memory. This is why modern models (Llama-3, Mistral, etc.) use it — it's the difference between the 16 GB and 4 GB rows in the table above. PagedAttention. Kwon et al.'s vLLM paper (2023) noticed that naive KV cache allocation wastes huge amounts of memory to fragmentation. Borrowing virtual-memory paging from operating systems, it packs the cache efficiently — a big reason vLLM serves more concurrent users on the same card. The cutting edge: Multi-head Latent Attention GQA shrinks the cache by sharing key/value heads; the newest idea changes what gets stored at all. Multi-head Latent Attention (MLA) , introduced in DeepSeek-V2 (2024), compresses the keys and values into a small shared latent vector via a learned low-rank projection,...

Why long context eats your VRAM: the KV cache explained

Related Articles

The Newest Instagram "Exploit" Is the Goofiest I've Seen

Apple WWDC 2026 Livestream

Claude Fable 5

US Government directive to suspend access to Fable 5 and Mythos 5

It's Not Just X. It's Y