Stochastic KV Routing: Enabling Adaptive Depth-Wise Cache Sharing

PaulHoule1 pts0 comments

[2604.22782] Stochastic KV Routing: Enabling Adaptive Depth-Wise Cache Sharing

-->

Computer Science > Machine Learning

arXiv:2604.22782 (cs)

[Submitted on 3 Apr 2026]

Title:Stochastic KV Routing: Enabling Adaptive Depth-Wise Cache Sharing

Authors:Anastasiia Filippova, David Grangier, Marco Cuturi, João Monteiro<br>View a PDF of the paper titled Stochastic KV Routing: Enabling Adaptive Depth-Wise Cache Sharing, by Anastasiia Filippova and 3 other authors

View PDF<br>HTML (experimental)

Abstract:Serving transformer language models with high throughput requires caching Key-Values (KVs) to avoid redundant computation during autoregressive generation. The memory footprint of KV caching is significant and heavily impacts serving costs. This work proposes to lessen these memory requirements. While recent work has largely addressed KV cache reduction via compression and eviction along the temporal axis, we argue that the \emph{depth} dimension offers an orthogonal and robust avenue for optimization. Although prior research suggests that a full cache for every layer is redundant, implementing cross-layer cache sharing remains a practical challenge; existing methods typically suffer from reduced throughput or increased time-to-first-token. In this paper, we demonstrate that dropping a layer's cache offers efficient optimization without information loss. We propose a simple training approach: random cross-layer attention. During training, layers randomly choose to attend either to their own KV states or those of a preceding layer. This stochastic process adapts the model to be robust to various depth-wise cache sharing strategies, ensuring flexibility for unknown hardware constraints at deployment time. Our evaluations show that applying this scheme during pre-training or fine-tuning enables depth-wise cache sharing for various model families. Furthermore, for larger models in data-constrained settings, this approach is suggestive of a regularization-like effect, frequently preserving or improving performance while significantly reducing the cache's memory footprint.

Subjects:

Machine Learning (cs.LG); Artificial Intelligence (cs.AI)

Cite as:<br>arXiv:2604.22782 [cs.LG]

(or<br>arXiv:2604.22782v1 [cs.LG] for this version)

https://doi.org/10.48550/arXiv.2604.22782

Focus to learn more

arXiv-issued DOI via DataCite

Submission history<br>From: Joao Monteiro [view email]<br>[v1]<br>Fri, 3 Apr 2026 14:56:17 UTC (14,295 KB)

Full-text links:<br>Access Paper:

View a PDF of the paper titled Stochastic KV Routing: Enabling Adaptive Depth-Wise Cache Sharing, by Anastasiia Filippova and 3 other authors<br>View PDF<br>HTML (experimental)<br>TeX Source

view license

Current browse context:

cs.LG

next >

new<br>recent<br>| 2026-04

Change to browse by:

cs<br>cs.AI

References & Citations

NASA ADS<br>Google Scholar

Semantic Scholar

export BibTeX citation<br>Loading...

BibTeX formatted citation

&times;

loading...

Data provided by:

Bookmark

Bibliographic Tools

Bibliographic and Citation Tools

Bibliographic Explorer Toggle

Bibliographic Explorer (What is the Explorer?)

Connected Papers Toggle

Connected Papers (What is Connected Papers?)

Litmaps Toggle

Litmaps (What is Litmaps?)

scite.ai Toggle

scite Smart Citations (What are Smart Citations?)

Code, Data, Media

Code, Data and Media Associated with this Article

alphaXiv Toggle

alphaXiv (What is alphaXiv?)

Links to Code Toggle

CatalyzeX Code Finder for Papers (What is CatalyzeX?)

DagsHub Toggle

DagsHub (What is DagsHub?)

GotitPub Toggle

Gotit.pub (What is GotitPub?)

Huggingface Toggle

Hugging Face (What is Huggingface?)

ScienceCast Toggle

ScienceCast (What is ScienceCast?)

Demos

Demos

Replicate Toggle

Replicate (What is Replicate?)

Spaces Toggle

Hugging Face Spaces (What is Spaces?)

Spaces Toggle

TXYZ.AI (What is TXYZ.AI?)

Related Papers

Recommenders and Search Tools

Link to Influence Flower

Influence Flower (What are Influence Flowers?)

Core recommender toggle

CORE Recommender (What is CORE?)

IArxiv recommender toggle

IArxiv Recommender<br>(What is IArxiv?)

Author

Venue

Institution

Topic

About arXivLabs

arXivLabs: experimental projects with community collaborators

arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.

Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.

Have an idea for a project that will add value for arXiv's community? Learn more about arXivLabs .

Which authors of this paper are endorsers? |<br>Disable MathJax (What is MathJax?)

toggle cache depth sharing arxiv wise

Related Articles