Understanding KV Cache: The Hidden Memory Cost of Serving LLMs - Melchi<br>Contents
Understanding KV Cache: The Hidden Memory Cost of Serving LLMs<br>Melchi<br>included in GenAI<br>2026-05-19 3006 words<br>15 minutes
Contents
How attention architectures evolved to keep KV cache from eating your GPU, and what that means if you self-host.<br>Already comfortable with KV cache and attention? Skip the theory and jump straight to the interactive KV Cache Calculator<br>to size VRAM for your model, batch size, and target GPU.
If you’re planning to self-host a large language model, you’ve probably sized VRAM based on parameters alone. A 70B model in BF16 needs roughly 140 GB just for weights. That’s the easy part: 70 billion parameters × 2 bytes.<br>What’s less obvious is the second memory consumer that grows while the model is actually serving requests: the Key-Value (KV) cache .<br>KV cache scales with every cached token in an active request: prompt tokens, generated tokens, and any prefix-cache entries the engine keeps resident. It also scales with the number of concurrent sequences. At 32K–128K context, KV cache can easily become the largest single thing on the GPU. If you don’t budget for it, you’ll serve one long-context user when you wanted to serve many.<br>This post walks from the basics of attention through the architectural and runtime tricks people use to shrink KV cache. By the end you should be able to look at a model card and roughly predict its memory profile.<br>Part 1: Attention, a quick refresher<br>The original transformer attention<br>The 2017 paper “Attention Is All You Need”<br>introduced Multi-Head Attention (MHA) , the mechanism that lets a model look back at previous tokens when generating the next one.<br>Three steps:<br>Project. For each token, build three vectors (a Query (Q), a Key (K), and a Value (V)) by multiplying the token representation by learned weight matrices.<br>Score. Take the dot product of Q with K. This answers “how relevant is each previous token to what I’m generating now?”<br>Aggregate. Softmax the scores, then take a weighted sum of V. The result is a context-aware representation of the current token.<br>The formula:<br>$$<br>\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V<br>$$<br>Why we need a cache<br>During training, the model usually processes the whole training sequence in parallel. There’s no decode-time KV cache because every token is available at once. Training is still memory-hungry (activations, gradients, optimizer state) but that’s a different problem.<br>Inference is the autoregressive case, where the model generates tokens one at a time. Each new token needs to attend to every token that came before it. Without a cache, generating token #1000 would require recomputing K and V for all 999 previous tokens at every step. The KV cache is the obvious fix: store the K and V projections once and reuse them on every subsequent step. Hugging Face’s documentation explains it the same way (Hugging Face cache explanation<br>).<br>Key insight: We cache K and V, not Q. Q is the “question” being asked by the current token and only matters for that one step. K and V are the “memory” the question gets asked against, and that memory has to stick around.
Part 2: How big is the KV cache?<br>The formula<br>The serving-time number you actually want is:<br>Total KV cache bytes =<br>2 × num_layers × num_key_value_heads × head_dim<br>× cached_tokens × active_sequences × bytes_per_element
Where:<br>2 because we store both K and V.<br>num_layers because every attention layer has its own KV cache.<br>num_key_value_heads is the number of KV heads, which is not always the same as the number of query heads.<br>head_dim is the per-head vector size, often 64, 128, or 256.<br>cached_tokens is prompt tokens plus generated tokens still resident in the cache.<br>active_sequences is your active batch / concurrent sequences.<br>bytes_per_element is 2 for BF16/FP16, 1 for FP8/INT8, 0.5 for INT4-style packed storage.<br>For standard Multi-Head Attention (MHA) :<br>num_key_value_heads = num_attention_heads
For Grouped-Query Attention (GQA) or Multi-Query Attention (MQA) :<br>num_key_value_heads<br>That distinction is the whole game. GQA and MQA shrink KV cache by reducing how many K/V heads are stored, while keeping more Q heads for model capacity.<br>A concrete example: a 70B-scale MHA baseline<br>The example below is intentionally a worst-case MHA baseline . It is not a claim that every 70B-class model uses this exact configuration; many of them use GQA, MLA, sliding windows, or hybrid attention.<br>ParameterValueLayers80Query heads64KV heads64Head dimension128PrecisionBF16 (2 bytes)Per token:<br>2 × 80 × 64 × 128 × 2 = 2,621,440 bytes ≈ 2.5 MiB
Now scale it up:<br>ScenarioCached TokensActive SequencesKV Cache SizeSingle user, short chat2,04815 GiB Single user, long context32,768180 GiB 8 users, moderate context8,1928160 GiB 16 users, long context32,768161.25 TiB Reality check: A 70B...