lucebox-hub/optimizations/kvflash at main · Luce-Org/lucebox-hub · GitHub
//files/disambiguate" data-turbo-transient="true" />
Skip to content
Search or jump to...
Search code, repositories, users, issues, pull requests...
-->
Search
Clear
Search syntax tips
Provide feedback
--><br>We read every piece of feedback, and take your input very seriously.
Include my email address so I can be contacted
Cancel
Submit feedback
Saved searches
Use saved searches to filter your results more quickly
-->
Name
Query
To see all available qualifiers, see our documentation.
Cancel
Create saved search
Sign in
//files/disambiguate;ref_cta:Sign up;ref_loc:header logged out"}"<br>Sign up
Appearance settings
Resetting focus
You signed in with another tab or window. Reload to refresh your session.<br>You signed out in another tab or window. Reload to refresh your session.<br>You switched accounts on another tab or window. Reload to refresh your session.
Dismiss alert
{{ message }}
Luce-Org
lucebox-hub
Public
Notifications<br>You must be signed in to change notification settings
Fork<br>232
Star<br>2.5k
FilesExpand file tree
main
/kvflash<br>Copy path
Directory actions
More options<br>More options
Directory actions
More options<br>More options
Latest commit
History<br>History<br>History
main
/kvflash<br>Copy path
Top
Folders and files<br>NameNameLast commit message<br>Last commit date<br>parent directory<br>..<br>DESIGN.md
DESIGN.md
README.md
README.md
RESULTS.md
RESULTS.md
hero.png
hero.png
View all files
README.md<br>Outline
← lucebox-hub
Luce KVFlash
Lookahead sparse attention for dflash. Bounded KV residency on one GPU.
The attention KV cache lives in a fixed pool of slots; cold 64-token chunks page to host RAM, bit-exact and recallable.<br>With pflash, its drafter doubles as a Memory Indexer that recalls the context the generation needs next.
Qwen3.6-27B Q4_K_M on a single RTX 3090: native 256K context at 38.6 tok/s with 72 MiB of resident KV ,<br>needle recall 88-100% at 6% residency, harness accuracy unchanged (36/36 vs full cache).
decode tok/s KV in VRAM (Q8_0) needle (d=10/50/90%)<br>full cache @ 64K 27.8 1152 MiB 16/16<br>full cache @ 128K 19.6 2304 MiB 16/16<br>full cache @ 256K 13.1 4608 MiB 16/16<br>KVFlash 4K @ 64K 38.6 72 MiB 14/16<br>KVFlash 4K @ 128K 38.6 72 MiB 14/16<br>KVFlash 4K @ 256K 38.6 72 MiB 15/16
Decode speed is flat at any context length (the per-step KV read is pool-sized,<br>not context-sized), prefill is up to 2.8x faster, and a 256K prompt that costs<br>4.6 GiB of VRAM as a full cache costs 72 MiB resident + 4.2 GiB of host RAM.<br>(The full-cache 256K rows are measured, not extrapolated: they fit the 24 GB<br>card only thanks to Q8_0 KV; with F16 KV the cache alone is 9.2 GiB and 256K<br>does not fit at all.)
Usage
# recommended: drafter-scored residency, pool auto-sized from VRAM.<br># pass --prefill-drafter so the drafter is guaranteed (no silent LRU fallback).<br>dflash_server model.gguf --max-ctx 32768 --kvflash auto \<br>--prefill-drafter /opt/lucebox/models/drafter/Qwen3-0.6B-BF16.gguf
# drop the path to auto-probe (model dir, drafter/, draft/, /opt/lucebox/models/drafter/);<br># falls back to LRU if none is found, so check the banner reads policy=drafter<br>dflash_server model.gguf --max-ctx 32768 --kvflash auto
# explicit pool size, recency-only LRU<br>dflash_server model.gguf --max-ctx 32768 --kvflash 8192 --kvflash-policy lru
Drafter-scored residency is the DEFAULT policy on every model family:<br>the server probes for Qwen3-0.6B-BF16.gguf next to the model (same<br>dir, drafter/, draft/, then /opt/lucebox/models/drafter/) and<br>lazy-loads it on the first reselect; --prefill-drafter overrides the<br>location, prefill compression can stay off either way. Qwen-family<br>targets feed the drafter their ids directly; laguna and gemma4 bridge<br>the tokenizer gap with KvFlashCrossTokScorer (relevance is a property<br>of the TEXT, so the target's history is detokenized, re-tokenized for<br>the drafter, scored, and mapped back to chunk boundaries by character<br>spans). LRU is the fallback when no drafter is found (the banner says<br>which policy you got) or the explicit choice via --kvflash-policy lru.<br>auto sizes the pool from the GPU, not a fixed fraction: half of the<br>free VRAM left after weights (minus a reserve for compute buffers and<br>the drafter), converted at the model's KV density, capped where decode<br>speed stays near the flat optimum (16384 tokens by default,<br>DFLASH_KVFLASH_MAX_POOL to override) and at --max-ctx. Bigger pools<br>mean more resident chunks and fewer forced evictions of useful context;<br>the cap keeps the per-step KV read small enough that decode stays near<br>the small-pool speed.
--kvflash : resident pool size (rounded to 256; clamped to<br>--max-ctx; floored at the protected minimum — 512 for qwen-family and<br>gemma4, larger on laguna where the SWA window stays resident — so<br>eviction always has a victim). Env: DFLASH_KVFLASH.
--kvflash-tau : reselect interval floor (default 64; the effective<br>interval grows with history so rescore overhead stays ~15% of...