Luce KVFlash: 256K context with 72MiB of KV cache on the GPU

GreenGames1 pts0 comments

lucebox-hub/optimizations/kvflash at main · Luce-Org/lucebox-hub · GitHub

//files/disambiguate" data-turbo-transient="true" />

Skip to content

Search or jump to...

Search code, repositories, users, issues, pull requests...

-->

Search

Clear

Search syntax tips

Provide feedback

--><br>We read every piece of feedback, and take your input very seriously.

Include my email address so I can be contacted

Cancel

Submit feedback

Saved searches

Use saved searches to filter your results more quickly

-->

Name

Query

To see all available qualifiers, see our documentation.

Cancel

Create saved search

Sign in

//files/disambiguate;ref_cta:Sign up;ref_loc:header logged out"}"<br>Sign up

Appearance settings

Resetting focus

You signed in with another tab or window. Reload to refresh your session.<br>You signed out in another tab or window. Reload to refresh your session.<br>You switched accounts on another tab or window. Reload to refresh your session.

Dismiss alert

{{ message }}

Luce-Org

lucebox-hub

Public

Notifications<br>You must be signed in to change notification settings

Fork<br>232

Star<br>2.5k

FilesExpand file tree

main

/kvflash<br>Copy path

Directory actions

More options<br>More options

Directory actions

More options<br>More options

Latest commit

History<br>History<br>History

main

/kvflash<br>Copy path

Top

Folders and files<br>NameNameLast commit message<br>Last commit date<br>parent directory<br>..<br>DESIGN.md

DESIGN.md

README.md

README.md

RESULTS.md

RESULTS.md

hero.png

hero.png

View all files

README.md<br>Outline

← lucebox-hub

Luce KVFlash

Lookahead sparse attention for dflash. Bounded KV residency on one GPU.

The attention KV cache lives in a fixed pool of slots; cold 64-token chunks page to host RAM, bit-exact and recallable.<br>With pflash, its drafter doubles as a Memory Indexer that recalls the context the generation needs next.

Qwen3.6-27B Q4_K_M on a single RTX 3090: native 256K context at 38.6 tok/s with 72 MiB of resident KV ,<br>needle recall 88-100% at 6% residency, harness accuracy unchanged (36/36 vs full cache).

decode tok/s KV in VRAM (Q8_0) needle (d=10/50/90%)<br>full cache @ 64K 27.8 1152 MiB 16/16<br>full cache @ 128K 19.6 2304 MiB 16/16<br>full cache @ 256K 13.1 4608 MiB 16/16<br>KVFlash 4K @ 64K 38.6 72 MiB 14/16<br>KVFlash 4K @ 128K 38.6 72 MiB 14/16<br>KVFlash 4K @ 256K 38.6 72 MiB 15/16

Decode speed is flat at any context length (the per-step KV read is pool-sized,<br>not context-sized), prefill is up to 2.8x faster, and a 256K prompt that costs<br>4.6 GiB of VRAM as a full cache costs 72 MiB resident + 4.2 GiB of host RAM.<br>(The full-cache 256K rows are measured, not extrapolated: they fit the 24 GB<br>card only thanks to Q8_0 KV; with F16 KV the cache alone is 9.2 GiB and 256K<br>does not fit at all.)

Usage

# recommended: drafter-scored residency, pool auto-sized from VRAM.<br># pass --prefill-drafter so the drafter is guaranteed (no silent LRU fallback).<br>dflash_server model.gguf --max-ctx 32768 --kvflash auto \<br>--prefill-drafter /opt/lucebox/models/drafter/Qwen3-0.6B-BF16.gguf

# drop the path to auto-probe (model dir, drafter/, draft/, /opt/lucebox/models/drafter/);<br># falls back to LRU if none is found, so check the banner reads policy=drafter<br>dflash_server model.gguf --max-ctx 32768 --kvflash auto

# explicit pool size, recency-only LRU<br>dflash_server model.gguf --max-ctx 32768 --kvflash 8192 --kvflash-policy lru

Drafter-scored residency is the DEFAULT policy on every model family:<br>the server probes for Qwen3-0.6B-BF16.gguf next to the model (same<br>dir, drafter/, draft/, then /opt/lucebox/models/drafter/) and<br>lazy-loads it on the first reselect; --prefill-drafter overrides the<br>location, prefill compression can stay off either way. Qwen-family<br>targets feed the drafter their ids directly; laguna and gemma4 bridge<br>the tokenizer gap with KvFlashCrossTokScorer (relevance is a property<br>of the TEXT, so the target's history is detokenized, re-tokenized for<br>the drafter, scored, and mapped back to chunk boundaries by character<br>spans). LRU is the fallback when no drafter is found (the banner says<br>which policy you got) or the explicit choice via --kvflash-policy lru.<br>auto sizes the pool from the GPU, not a fixed fraction: half of the<br>free VRAM left after weights (minus a reserve for compute buffers and<br>the drafter), converted at the model's KV density, capped where decode<br>speed stays near the flat optimum (16384 tokens by default,<br>DFLASH_KVFLASH_MAX_POOL to override) and at --max-ctx. Bigger pools<br>mean more resident chunks and fewer forced evictions of useful context;<br>the cap keeps the per-step KV read small enough that decode stays near<br>the small-pool speed.

--kvflash : resident pool size (rounded to 256; clamped to<br>--max-ctx; floored at the protected minimum — 512 for qwen-family and<br>gemma4, larger on laguna where the SWA window stays resident — so<br>eviction always has a victim). Env: DFLASH_KVFLASH.

--kvflash-tau : reselect interval floor (default 64; the effective<br>interval grows with history so rescore overhead stays ~15% of...

drafter kvflash cache 256k lucebox pool

Related Articles