[2605.17613] VeriCache: Turning Lossy KV Cache into Lossless LLM Inference
Skip to main content
arXiv is now an independent nonprofit!<br>Learn more<br>×
Search arXiv
Press Enter to search · Advanced search
-->
Computer Science > Hardware Architecture
arXiv:2605.17613 (cs)
[Submitted on 17 May 2026]
Title:VeriCache: Turning Lossy KV Cache into Lossless LLM Inference
Authors:Jiayi Yao, Samuel Shen, Kuntai Du, Shaoting Feng, Dongjoo Seo, Rui Zhang, Yuyang Huang, Yuhan Liu, Shan Lu, Junchen Jiang<br>View a PDF of the paper titled VeriCache: Turning Lossy KV Cache into Lossless LLM Inference, by Jiayi Yao and 9 other authors
View PDF<br>HTML (experimental)
Abstract:The large size of the KV cache has become a major bottleneck for serving LLMs with increasing context lengths. In response, many KV cache compression methods, such as token dropping and quantization, have been proposed. However, almost all of these methods are inherently lossy-despite minimal accuracy degradation for short outputs, their outputs increasingly diverge from full-KV-cache outputs as more tokens are decoded, which leads to catastrophic failures in code generation and tool calling.
We present VeriCache, the first inference framework that ensures the same output as full-KV-cache decoding but largely preserves the high decoding throughput of a range of KV cache compression algorithms. VeriCache uses the compressed KV cache to draft tokens, then verifies them against the full KV cache. While it may seem like just speculative decoding, VeriCache requires addressing a key system challenge to work-keeping the full KV cache out of GPU memory and minimizing the overhead of swapping it in for verification. The insight is two-fold: (1) compressed-KV decoding can be parallelized with full-KV swap, because one is HBM-bandwidth-bound and the other is PCIe/network-bound, and (2) the compressed KV cache often produces output similar to the full KV cache, allowing a long drafting horizon to amortize each full-KV swap.
VeriCache applies to both long-context decoding and remote prefix caching, supports a broad family of token-dropping and quantization methods through a uniform compressor interface, and composes with traditional speculative decoding. Experimental results show that VeriCache achieves up to 4X higher throughput than full-KV inference while producing identical outputs.
Subjects:
Hardware Architecture (cs.AR); Machine Learning (cs.LG)
Cite as:<br>arXiv:2605.17613 [cs.AR]
(or<br>arXiv:2605.17613v1 [cs.AR] for this version)
https://doi.org/10.48550/arXiv.2605.17613
Focus to learn more
arXiv-issued DOI via DataCite
Submission history<br>From: Jiayi Yao [view email]<br>[v1]<br>Sun, 17 May 2026 19:18:39 UTC (435 KB)
Full-text links:<br>Access Paper:
View a PDF of the paper titled VeriCache: Turning Lossy KV Cache into Lossless LLM Inference, by Jiayi Yao and 9 other authors<br>View PDF<br>HTML (experimental)<br>TeX Source
view license
Current browse context:
cs.AR
next >
new<br>recent<br>| 2026-05
Change to browse by:
cs<br>cs.LG
References & Citations
NASA ADS<br>Google Scholar
Semantic Scholar
export BibTeX citation<br>Loading...
BibTeX formatted citation
×
loading...
Data provided by:
Bookmark
Bibliographic Tools
Bibliographic and Citation Tools
Bibliographic Explorer Toggle
Bibliographic Explorer (What is the Explorer?)
Connected Papers Toggle
Connected Papers (What is Connected Papers?)
Litmaps Toggle
Litmaps (What is Litmaps?)
scite.ai Toggle
scite Smart Citations (What are Smart Citations?)
Code, Data, Media
Code, Data and Media Associated with this Article
alphaXiv Toggle
alphaXiv (What is alphaXiv?)
Links to Code Toggle
CatalyzeX Code Finder for Papers (What is CatalyzeX?)
DagsHub Toggle
DagsHub (What is DagsHub?)
GotitPub Toggle
Gotit.pub (What is GotitPub?)
Huggingface Toggle
Hugging Face (What is Huggingface?)
ScienceCast Toggle
ScienceCast (What is ScienceCast?)
Demos
Demos
Replicate Toggle
Replicate (What is Replicate?)
Spaces Toggle
Hugging Face Spaces (What is Spaces?)
Spaces Toggle
TXYZ.AI (What is TXYZ.AI?)
Related Papers
Recommenders and Search Tools
Link to Influence Flower
Influence Flower (What are Influence Flowers?)
Core recommender toggle
CORE Recommender (What is CORE?)
Author
Venue
Institution
Topic
About arXivLabs
arXivLabs: experimental projects with community collaborators
arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.
Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.
Have an idea for a project that will add value for arXiv's community? Learn more about arXivLabs .
Which authors of this paper are endorsers? |<br>Disable MathJax (What is MathJax?)
Major funding support from