A 35B MoE on a 16 GB GPU, without the offload tax

GreenGames1 pts0 comments

Luce Spark: a 35B MoE on a 16 GB GPU, without the offload tax | lucebox

June 2026<br>By Davide Ciffa<br>Luce Spark: a 35B MoE on a 16 GB GPU, without the offload tax<br>A 33-35B mixture-of-experts model fires only a handful of its experts per token, but to keep it on the GPU you still pay for all of them. Luce Spark pins the experts your traffic actually uses, offloads the rest to CPU, and decodes the whole token in one fused graph so offload stops costing speed. Qwen3.6 35B-A3B runs in 13.3 GiB (down from ~20.5) and Laguna XS.2 in 14.6 GiB (down from 18.8), both on a 16 GB card that could not load them before, and decode holds ~100 tok/s, near the ~119 all-GPU ceiling . It tunes itself from live traffic. One flag, no calibration step.<br>TL;DR<br>33-35B MoE on a 16 GB GPU. Qwen3.6 35B-A3B: 13.3 GiB (was ~20.5). Laguna XS.2 33B-A3B: 14.6 GiB (was 18.8). Both measured on an RTX 3090, both under 16 GiB, so a 16 GB card now runs models it could not load at all.<br>Only the active experts stay on the GPU. An A3B model routes to ~8 of 256 experts per token. Spark calibrates which experts your traffic hits and keeps those hot; the long tail lives in system RAM and is swapped in on demand through a bounded GPU cache.<br>Self-tuning. The placement is learned from live routing and written next to the model. Each restart loads a better profile. No corpus, no offline calibration step required.<br>One command, both backends. dflash_server --spark works for laguna and qwen35moe. The server picks the cache size, loads the learned profile if present, and keeps persisting it.<br>Offload without the speed cliff. Under offload, laguna runs the whole token as one fused graph , not 40 per-layer graphs. At full residency that graph is bit-identical to all-GPU and just as fast (119 tok/s) ; at 60% residency it holds ~100 tok/s (1.5x over a naive offload at 66). On a 16 GB card the alternative is not slower, it is "does not run".<br>The problem: a sparse model with a dense memory bill<br>Qwen3.6 35B-A3B and Laguna XS.2 are both A3B models: 35B and 33B total parameters, but only ~3B active per token. The router picks roughly 8 of 256 experts at each layer and ignores the rest. The compute bill is small. The memory bill is not: to keep the model on the GPU you hold every expert in VRAM, because any of them might be next.<br>On a 24 GB card that fits, barely. The experts alone are 18.2 GiB on Qwen and 16.6 GiB on Laguna ; add the non-expert weights and a KV cache and you are at 18-21 GiB before context. On a 16 GB card it does not fit at all. You are paying full price for parameters that, for any given request, are mostly idle.<br>Standard expert offloading puts the cold experts in system RAM and computes them on the CPU. That frees VRAM but it is slow if you offload the wrong ones: pick the resident set badly and you hit the CPU tier on a third of every token's routing. The resident set is the whole game.<br>How Spark works<br>Spark is built on the hot/cold MoE offload engine that already ships in lucebox-hub. It adds the two pieces that make offload actually fast: knowing which experts to keep, and a cheap way to fix that decision while serving.<br>Calibrated placement. The expert that should stay resident is the one your traffic routes to most. Spark accumulates per-(layer, expert) routing frequencies from real requests and pins the most-used set on the GPU. On held-out traffic this drops the cold-hit rate from 36% (a uniform split) to about 7%.<br>A bounded expert cache, copied async. A fixed ring of spare GPU slots. When a request hits a cold expert, its weights are copied (asynchronously, from pinned host memory, overlapped with compute) into a spare slot and served on the GPU, evicting the least-recently-used entry. At 60% residency a few percent of routings still miss the resident set each token, but the copy is hidden under the matmuls instead of stalling them, so it costs throughput, not a cliff.<br>router picks 8 experts<br>hot (calibrated, pinned on GPU) ───────────► GPU<br>warm (in the cache ring) ───────────► GPU<br>cold miss ─ swap into a spare slot (LRU) ───► GPU<br>(rare after warmup, bounded VRAM) The cache ring is a small over-allocation of the hot expert stack, so a swap is "copy three weight tensors into a spare slot and update one routing entry". The existing GPU FFN serves it with no special path. It is the same mechanism for both backends: laguna and qwen partition hot from cold on the host, so the swap is picked up by the lookup they already do.<br>Memory: a 33-35B MoE under 16 GiB<br>Peak VRAM measured on an RTX 3090, ctx 4096. "All-GPU" holds every expert resident; "Spark" pins ~60% of expert weight and swaps the rest through the cache.

ModelAll-GPU VRAMSpark VRAMSavedFits 16 GB? Laguna XS.2 (33B-A3B)18.8 GiB14.6 GiB 4.2 GiByes Qwen3.6 35B-A3B~20.5 GiB13.3 GiB ~7 GiByes The footprint is set by two numbers you control: the share of experts pinned hot and the number of cache slots. Both are capped, so the total never drifts above the budget. Trade cache slots against context...

experts offload expert spark cache from

Related Articles