VeriCache: Turning Lossy KV Cache into Lossless LLM Inference

[2605.17613] VeriCache: Turning Lossy KV Cache into Lossless LLM Inference

arXiv is now an independent nonprofit! Learn more ×

Search arXiv

Press Enter to search · Advanced search

-->

Computer Science > Hardware Architecture

arXiv:2605.17613 (cs)

[Submitted on 17 May 2026]

Title:VeriCache: Turning Lossy KV Cache into Lossless LLM Inference

Authors:Jiayi Yao, Samuel Shen, Kuntai Du, Shaoting Feng, Dongjoo Seo, Rui Zhang, Yuyang Huang, Yuhan Liu, Shan Lu, Junchen Jiang View a PDF of the paper titled VeriCache: Turning Lossy KV Cache into Lossless LLM Inference, by Jiayi Yao and 9 other authors

View PDF HTML (experimental)

Abstract:The large size of the KV cache has become a major bottleneck for serving LLMs with increasing context lengths. In response, many KV cache compression methods, such as token dropping and quantization, have been proposed. However, almost all of these methods are inherently lossy-despite minimal accuracy degradation for short outputs, their outputs increasingly diverge from full-KV-cache outputs as more tokens are decoded, which leads to catastrophic failures in code generation and tool calling.

We present VeriCache, the first inference framework that ensures the same output as full-KV-cache decoding but largely preserves the high decoding throughput of a range of KV cache compression algorithms. VeriCache uses the compressed KV cache to draft tokens, then verifies them against the full KV cache. While it may seem like just speculative decoding, VeriCache requires addressing a key system challenge to work-keeping the full KV cache out of GPU memory and minimizing the overhead of swapping it in for verification. The insight is two-fold: (1) compressed-KV decoding can be parallelized with full-KV swap, because one is HBM-bandwidth-bound and the other is PCIe/network-bound, and (2) the compressed KV cache often produces output similar to the full KV cache, allowing a long drafting horizon to amortize each full-KV swap.

VeriCache applies to both long-context decoding and remote prefix caching, supports a broad family of token-dropping and quantization methods through a uniform compressor interface, and composes with traditional speculative decoding. Experimental results show that VeriCache achieves up to 4X higher throughput than full-KV inference while producing identical outputs.

Subjects:

Hardware Architecture (cs.AR); Machine Learning (cs.LG)

Cite as: arXiv:2605.17613 [cs.AR]

(or arXiv:2605.17613v1 [cs.AR] for this version)

https://doi.org/10.48550/arXiv.2605.17613

Focus to learn more

arXiv-issued DOI via DataCite

Submission history From: Jiayi Yao [view email] [v1] Sun, 17 May 2026 19:18:39 UTC (435 KB)

Full-text links: Access Paper:

View a PDF of the paper titled VeriCache: Turning Lossy KV Cache into Lossless LLM Inference, by Jiayi Yao and 9 other authors View PDF HTML (experimental) TeX Source

view license

Current browse context:

cs.AR

next >

new recent | 2026-05

Change to browse by:

cs cs.LG

References & Citations

NASA ADS Google Scholar

Semantic Scholar

export BibTeX citation Loading...

BibTeX formatted citation

Data provided by:

Bookmark

Bibliographic Tools

Bibliographic and Citation Tools

Bibliographic Explorer Toggle

Bibliographic Explorer (What is the Explorer?)

Connected Papers Toggle

Connected Papers (What is Connected Papers?)

Litmaps Toggle

Litmaps (What is Litmaps?)

scite.ai Toggle

scite Smart Citations (What are Smart Citations?)

Code, Data, Media

Code, Data and Media Associated with this Article

alphaXiv Toggle

alphaXiv (What is alphaXiv?)

Links to Code Toggle

CatalyzeX Code Finder for Papers (What is CatalyzeX?)

DagsHub Toggle

DagsHub (What is DagsHub?)

GotitPub Toggle

Gotit.pub (What is GotitPub?)

Huggingface Toggle

Hugging Face (What is Huggingface?)

ScienceCast Toggle

ScienceCast (What is ScienceCast?)

Demos

Replicate Toggle

Replicate (What is Replicate?)

Spaces Toggle

Hugging Face Spaces (What is Spaces?)

Spaces Toggle

TXYZ.AI (What is TXYZ.AI?)

VeriCache: Turning Lossy KV Cache into Lossless LLM Inference

Related Articles

(no title)

Is AI ruining our skills? Early results are in – and they're not good

The Anatomy of an AI-Native Org

ZCode – Harness for GLM-5.2

Apertus – Open Foundation Model for Sovereign AI