Luce KVFlash: 256K context with 72MiB of KV cache on the GPU

lucebox-hub/optimizations/kvflash at main · Luce-Org/lucebox-hub · GitHub

//files/disambiguate" data-turbo-transient="true" />

Search or jump to...

Search code, repositories, users, issues, pull requests...

-->

Clear

Search syntax tips

Provide feedback

--> We read every piece of feedback, and take your input very seriously.

Include my email address so I can be contacted

Cancel

Submit feedback

Saved searches

Use saved searches to filter your results more quickly

-->

Name

Query

To see all available qualifiers, see our documentation.

Cancel

Create saved search

//files/disambiguate;ref_cta:Sign up;ref_loc:header logged out"}" Sign up

Appearance settings

Resetting focus

You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session. You switched accounts on another tab or window. Reload to refresh your session.

Dismiss alert

Luce-Org

lucebox-hub

Public

Notifications You must be signed in to change notification settings

Fork 232

Star 2.5k

FilesExpand file tree

main

/kvflash Copy path

Directory actions

More options More options

Directory actions

More options More options

Latest commit

History History History

main

/kvflash Copy path

Top

Folders and files NameNameLast commit message Last commit date parent directory .. DESIGN.md

DESIGN.md

README.md

RESULTS.md

hero.png

View all files

README.md Outline

← lucebox-hub

Luce KVFlash

Lookahead sparse attention for dflash. Bounded KV residency on one GPU.

The attention KV cache lives in a fixed pool of slots; cold 64-token chunks page to host RAM, bit-exact and recallable. With pflash, its drafter doubles as a Memory Indexer that recalls the context the generation needs next.

Qwen3.6-27B Q4_K_M on a single RTX 3090: native 256K context at 38.6 tok/s with 72 MiB of resident KV , needle recall 88-100% at 6% residency, harness accuracy unchanged (36/36 vs full cache).

decode tok/s KV in VRAM (Q8_0) needle (d=10/50/90%) full cache @ 64K 27.8 1152 MiB 16/16 full cache @ 128K 19.6 2304 MiB 16/16 full cache @ 256K 13.1 4608 MiB 16/16 KVFlash 4K @ 64K 38.6 72 MiB 14/16 KVFlash 4K @ 128K 38.6 72 MiB 14/16 KVFlash 4K @ 256K 38.6 72 MiB 15/16

Decode speed is flat at any context length (the per-step KV read is pool-sized, not context-sized), prefill is up to 2.8x faster, and a 256K prompt that costs 4.6 GiB of VRAM as a full cache costs 72 MiB resident + 4.2 GiB of host RAM. (The full-cache 256K rows are measured, not extrapolated: they fit the 24 GB card only thanks to Q8_0 KV; with F16 KV the cache alone is 9.2 GiB and 256K does not fit at all.)

Usage

# recommended: drafter-scored residency, pool auto-sized from VRAM. # pass --prefill-drafter so the drafter is guaranteed (no silent LRU fallback). dflash_server model.gguf --max-ctx 32768 --kvflash auto \ --prefill-drafter /opt/lucebox/models/drafter/Qwen3-0.6B-BF16.gguf

# drop the path to auto-probe (model dir, drafter/, draft/, /opt/lucebox/models/drafter/); # falls back to LRU if none is found, so check the banner reads policy=drafter dflash_server model.gguf --max-ctx 32768 --kvflash auto

# explicit pool size, recency-only LRU dflash_server model.gguf --max-ctx 32768 --kvflash 8192 --kvflash-policy lru

Drafter-scored residency is the DEFAULT policy on every model family: the server probes for Qwen3-0.6B-BF16.gguf next to the model (same dir, drafter/, draft/, then /opt/lucebox/models/drafter/) and lazy-loads it on the first reselect; --prefill-drafter overrides the location, prefill compression can stay off either way. Qwen-family targets feed the drafter their ids directly; laguna and gemma4 bridge the tokenizer gap with KvFlashCrossTokScorer (relevance is a property of the TEXT, so the target's history is detokenized, re-tokenized for the drafter, scored, and mapped back to chunk boundaries by character spans). LRU is the fallback when no drafter is found (the banner says which policy you got) or the explicit choice via --kvflash-policy lru. auto sizes the pool from the GPU, not a fixed fraction: half of the free VRAM left after weights (minus a reserve for compute buffers and the drafter), converted at the model's KV density, capped where decode speed stays near the flat optimum (16384 tokens by default, DFLASH_KVFLASH_MAX_POOL to override) and at --max-ctx. Bigger pools mean more resident chunks and fewer forced evictions of useful context; the cap keeps the per-step KV read small enough that decode stays near the small-pool speed.

--kvflash : resident pool size (rounded to 256; clamped to --max-ctx; floored at the protected minimum — 512 for qwen-family and gemma4, larger on laguna where the SWA window stays resident — so eviction always has a victim). Env: DFLASH_KVFLASH.

--kvflash-tau : reselect interval floor (default 64; the effective interval grows with history so rescore overhead stays ~15% of...

Luce KVFlash: 256K context with 72MiB of KV cache on the GPU

Related Articles

The Newest Instagram "Exploit" Is the Goofiest I've Seen

Apple WWDC 2026 Livestream

Claude Fable 5

US Government directive to suspend access to Fable 5 and Mythos 5

It's Not Just X. It's Y