DeepSeek-V4 KV Cache Explained: Why 1M Context Uses Less VRAM
Home
About
Archives
Search
Links
中文简体
中文繁體
English
日本語
Español
Dark Mode
Table of contents
DeepSeek-V4 KV Cache Explained: Why 1M Context Uses Less VRAM
A comparison of DeepSeek-V4's CSA/HCA hybrid compressed attention with traditional MHA, GQA, and MLA, explaining why DeepSeek-V4 can greatly reduce KV Cache memory for 1M-token context.
The real cost of long-context models is often not whether they can accept one million tokens, but how much VRAM the KV Cache consumes during inference.
During Transformer decoding, every newly generated token needs access to the Key and Value states of previous tokens. The longer the context, the larger the KV Cache. A larger KV Cache puts pressure on VRAM, memory bandwidth, time to first token, and throughput.
DeepSeek-V4 is interesting because it does not only reduce cache along the attention-head dimension. It pushes compression into the sequence-length dimension. According to Hugging Face’s discussion of DeepSeek-V4, in a 1M-token setting, DeepSeek-V4-Pro’s KV Cache is about 10% of DeepSeek-V3.2, and about 2% of a common bf16 GQA architecture.
That is the key difference: DeepSeek-V4 does not merely store each KV entry in a smaller format. It reduces the number of KV entries that must be kept and searched over long history.
Several generations of KV Cache optimization
KV Cache optimization has evolved through several routes.
The first is traditional MHA, or Multi-Head Attention. Each Query head typically has its own Key/Value heads. The structure is direct, but under long context the cache grows linearly with sequence length, making VRAM pressure heavy.
The second is GQA, or Grouped Query Attention. Multiple Query heads share fewer Key/Value heads. Many modern models such as LLaMA, Mistral, and Qwen use similar ideas. It significantly reduces KV head count and is now a common long-context optimization.
The third is MLA, or Multi-head Latent Attention. DeepSeek-V2 and DeepSeek-V3 use this route, compressing Key/Value into low-rank latent representations and further reducing cache along the attention-head dimension.
The fourth is DeepSeek-V4’s hybrid compressed attention. It focuses on sequence length: instead of only reducing how much KV each token stores, it compresses multiple historical tokens into fewer KV entries and retrieves them through sparse or dense attention.
Roughly:
MHA: every head remembers separately.
GQA: multiple Query heads share memory.
MLA: each token’s KV representation is compressed into a latent vector.
DeepSeek-V4: many historical tokens are aggregated into fewer compressed memory blocks.
Key change: from head compression to sequence compression
GQA and MLA mainly optimize how much KV each token stores. That works well, but when context reaches 1M tokens, the token count itself becomes the problem.
DeepSeek-V4 compresses old context into blocks. The model does not necessarily preserve full KV for every distant token. Instead, multiple tokens form compressed entries.
It is a bit like reading a very long book: you remember recent pages in detail, while earlier chapters are stored more as summaries, themes, and key clues. DeepSeek-V4’s attention design follows a similar split: keep detail nearby, use compressed representation farther away.
CSA: 4x compression plus sparse retrieval
CSA stands for Compressed Sparse Attention. It is the finer-grained long-context compression mechanism.
In CSA, the model compresses neighboring tokens into fewer KV entries. The Hugging Face Transformers documentation gives a default compression ratio of m=4, meaning roughly every four tokens become one compressed entry.
But it is not simple averaging. CSA uses a learned compression pool and overlapping windows so the model can preserve more useful information. After compression, the query does not attend to all compressed blocks directly. It first uses a Lightning Indexer to score them, selects the most relevant top-k compressed blocks, and then performs the core attention computation.
This gives two benefits:
The number of historical KV entries becomes smaller.
Each query only looks at a relevant subset of compressed blocks.
CSA is suitable for long-range context where details still matter, such as codebases, long documents, and tool-call histories.
HCA: 128x compression plus dense attention
HCA stands for Heavily Compressed Attention, and it is more aggressive.
The Transformers documentation gives a default compression ratio of m'=128. HCA compresses a much longer context span into one compressed entry. Because the compressed sequence becomes very short, it does not need sparse top-k retrieval like CSA. The query can simply perform dense attention over all HCA compressed entries.
HCA acts more like a global summary. It does not try to preserve every detail. Instead, it covers very long history at extremely low cost, helping the model...