VeriCache: Turning Lossy KV Cache into Lossless LLM Inference

matt_d1 pts0 comments

[2605.17613] VeriCache: Turning Lossy KV Cache into Lossless LLM Inference

Skip to main content

arXiv is now an independent nonprofit!<br>Learn more<br>&times;

Search arXiv

Press Enter to search &middot; Advanced search

-->

Computer Science > Hardware Architecture

arXiv:2605.17613 (cs)

[Submitted on 17 May 2026]

Title:VeriCache: Turning Lossy KV Cache into Lossless LLM Inference

Authors:Jiayi Yao, Samuel Shen, Kuntai Du, Shaoting Feng, Dongjoo Seo, Rui Zhang, Yuyang Huang, Yuhan Liu, Shan Lu, Junchen Jiang<br>View a PDF of the paper titled VeriCache: Turning Lossy KV Cache into Lossless LLM Inference, by Jiayi Yao and 9 other authors

View PDF<br>HTML (experimental)

Abstract:The large size of the KV cache has become a major bottleneck for serving LLMs with increasing context lengths. In response, many KV cache compression methods, such as token dropping and quantization, have been proposed. However, almost all of these methods are inherently lossy-despite minimal accuracy degradation for short outputs, their outputs increasingly diverge from full-KV-cache outputs as more tokens are decoded, which leads to catastrophic failures in code generation and tool calling.

We present VeriCache, the first inference framework that ensures the same output as full-KV-cache decoding but largely preserves the high decoding throughput of a range of KV cache compression algorithms. VeriCache uses the compressed KV cache to draft tokens, then verifies them against the full KV cache. While it may seem like just speculative decoding, VeriCache requires addressing a key system challenge to work-keeping the full KV cache out of GPU memory and minimizing the overhead of swapping it in for verification. The insight is two-fold: (1) compressed-KV decoding can be parallelized with full-KV swap, because one is HBM-bandwidth-bound and the other is PCIe/network-bound, and (2) the compressed KV cache often produces output similar to the full KV cache, allowing a long drafting horizon to amortize each full-KV swap.

VeriCache applies to both long-context decoding and remote prefix caching, supports a broad family of token-dropping and quantization methods through a uniform compressor interface, and composes with traditional speculative decoding. Experimental results show that VeriCache achieves up to 4X higher throughput than full-KV inference while producing identical outputs.

Subjects:

Hardware Architecture (cs.AR); Machine Learning (cs.LG)

Cite as:<br>arXiv:2605.17613 [cs.AR]

(or<br>arXiv:2605.17613v1 [cs.AR] for this version)

https://doi.org/10.48550/arXiv.2605.17613

Focus to learn more

arXiv-issued DOI via DataCite

Submission history<br>From: Jiayi Yao [view email]<br>[v1]<br>Sun, 17 May 2026 19:18:39 UTC (435 KB)

Full-text links:<br>Access Paper:

View a PDF of the paper titled VeriCache: Turning Lossy KV Cache into Lossless LLM Inference, by Jiayi Yao and 9 other authors<br>View PDF<br>HTML (experimental)<br>TeX Source

view license

Current browse context:

cs.AR

next >

new<br>recent<br>| 2026-05

Change to browse by:

cs<br>cs.LG

References & Citations

NASA ADS<br>Google Scholar

Semantic Scholar

export BibTeX citation<br>Loading...

BibTeX formatted citation

&times;

loading...

Data provided by:

Bookmark

Bibliographic Tools

Bibliographic and Citation Tools

Bibliographic Explorer Toggle

Bibliographic Explorer (What is the Explorer?)

Connected Papers Toggle

Connected Papers (What is Connected Papers?)

Litmaps Toggle

Litmaps (What is Litmaps?)

scite.ai Toggle

scite Smart Citations (What are Smart Citations?)

Code, Data, Media

Code, Data and Media Associated with this Article

alphaXiv Toggle

alphaXiv (What is alphaXiv?)

Links to Code Toggle

CatalyzeX Code Finder for Papers (What is CatalyzeX?)

DagsHub Toggle

DagsHub (What is DagsHub?)

GotitPub Toggle

Gotit.pub (What is GotitPub?)

Huggingface Toggle

Hugging Face (What is Huggingface?)

ScienceCast Toggle

ScienceCast (What is ScienceCast?)

Demos

Demos

Replicate Toggle

Replicate (What is Replicate?)

Spaces Toggle

Hugging Face Spaces (What is Spaces?)

Spaces Toggle

TXYZ.AI (What is TXYZ.AI?)

Related Papers

Recommenders and Search Tools

Link to Influence Flower

Influence Flower (What are Influence Flowers?)

Core recommender toggle

CORE Recommender (What is CORE?)

Author

Venue

Institution

Topic

About arXivLabs

arXivLabs: experimental projects with community collaborators

arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.

Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.

Have an idea for a project that will add value for arXiv's community? Learn more about arXivLabs .

Which authors of this paper are endorsers? |<br>Disable MathJax (What is MathJax?)

Major funding support from

cache toggle vericache arxiv full inference

Related Articles