StreamIndex: Memory-bounded compressed sparse attention via streaming top-k

[2605.02568] StreamIndex: Memory-Bounded Compressed Sparse Attention via Streaming Top-k

-->

Computer Science > Machine Learning

arXiv:2605.02568 (cs)

[Submitted on 4 May 2026]

Title:StreamIndex: Memory-Bounded Compressed Sparse Attention via Streaming Top-k

Authors:Jaber Jaber, Osama Jaber View a PDF of the paper titled StreamIndex: Memory-Bounded Compressed Sparse Attention via Streaming Top-k, by Jaber Jaber and Osama Jaber

View PDF HTML (experimental)

Abstract:DeepSeek-V3.2 and V4 introduce Compressed Sparse Attention (CSA): a lightning indexer (a learned scoring projection over compressed keys) scores them, the top-k are selected per query, and a sparse attention kernel reads only those. Public CSA implementations materialize a [B, S, H_I, T] FP32 score tensor before the top-k reduction. With H_I=64 indexer heads and the V4-Flash compression ratio m=4, that intermediate is 256 GB at sequence length S=65,536, exceeding any single-GPU high-bandwidth-memory (HBM) budget. We present StreamIndex, a Triton implementation of the CSA pipeline whose central component is a chunked partition-merge top-k driver that never materializes the full intermediate. On synthetic-but-realistic V4-shaped inputs at the indexer-step (layer) level on a single NVIDIA H200, the materialize path runs out of memory (OOMs) at S=65,536 with V4-Flash dimensions; StreamIndex runs the same indexer to S=1,048,576 with 6.21 GB peak HBM, a 32x regime extension. Set-overlap recall against the materialize ground truth is bit-exact at small S where both fit; across three 5-point design-space sweeps (chunk size, key-tile size, top-k), mean recall rounds to 1.0000 with min recall at least 0.9980 in every cell. The chunked driver composes with TileLang's pipelined attention kernel: at S=262,144 with V4-Flash dimensions, the materialize indexer paired with TileLang attention OOMs while the chunked indexer paired with the same attention runs in 1.97 s at 18.56 GB peak. Our contribution targets the indexer step; we make no claim of a faster attention kernel or of real-checkpoint end-to-end behavior. Code: this https URL.

Comments: 11 pages, 3 figures, 7 tables, 2 algorithms, 36 references. Memory-bounded indexer kernel for DeepSeek-V4 CSA via chunked partition-merge top-k. Code: this https URL

Subjects:

Machine Learning (cs.LG); Performance (cs.PF)

ACM classes: C.1.2; I.2.7

Cite as: arXiv:2605.02568 [cs.LG]

(or arXiv:2605.02568v1 [cs.LG] for this version)

https://doi.org/10.48550/arXiv.2605.02568

Focus to learn more

arXiv-issued DOI via DataCite (pending registration)

Submission history From: Jaber Jaber [view email] [v1] Mon, 4 May 2026 13:19:29 UTC (66 KB)

Full-text links: Access Paper:

View a PDF of the paper titled StreamIndex: Memory-Bounded Compressed Sparse Attention via Streaming Top-k, by Jaber Jaber and Osama Jaber View PDF HTML (experimental) TeX Source

view license

Current browse context:

cs.LG

next >

new recent | 2026-05

Change to browse by:

cs cs.PF

References & Citations

NASA ADS Google Scholar

Semantic Scholar

export BibTeX citation Loading...

BibTeX formatted citation

Data provided by:

Bookmark

Bibliographic Tools

Bibliographic and Citation Tools

Bibliographic Explorer Toggle

Bibliographic Explorer (What is the Explorer?)

Connected Papers Toggle

Connected Papers (What is Connected Papers?)

Litmaps Toggle

Litmaps (What is Litmaps?)

scite.ai Toggle

scite Smart Citations (What are Smart Citations?)

Code, Data, Media

Code, Data and Media Associated with this Article

alphaXiv Toggle

alphaXiv (What is alphaXiv?)

Links to Code Toggle

CatalyzeX Code Finder for Papers (What is CatalyzeX?)

DagsHub Toggle

DagsHub (What is DagsHub?)

GotitPub Toggle

Gotit.pub (What is GotitPub?)

Huggingface Toggle

Hugging Face (What is Huggingface?)

ScienceCast Toggle

ScienceCast (What is ScienceCast?)

Demos

Replicate Toggle

Replicate (What is Replicate?)

Spaces Toggle

Hugging Face Spaces (What is Spaces?)

Spaces Toggle

TXYZ.AI (What is TXYZ.AI?)

StreamIndex: Memory-bounded compressed sparse attention via streaming top-k

Related Articles

Amazon, Facebook, FBI have access to a private intelligence-sharing network

SpaceX not the behemoth everyone thought

Elevated error rates on requests to multiple models

Donald Trump and sons to be 'forever' exempt from tax audits

PopuLoRA: Co-Evolving LLM Populations for Reasoning Self- Play