The economics of speculative decoding
The economics of speculative decoding<br>8 Jun 2026 · 19 min read ·<br>Cover: William Holbrook Beard, The Bulls and Bears in the Market (1879), via Wikimedia Commons.
Speculative decoding is one of the cleanest performance wins in inference<br>optimisation: it’s lossless, it hits decode latency when not much else does,<br>and in its standard formulation it’s simple and elegant.
It works by looking forwards: speculative decoding takes a position on what<br>tokens will come next. For dense transformers the bet is riskless: accepted<br>tokens pay off, rejected tokens cost nothing, a clean arbitrage on spare memory<br>bandwidth.
A burst of research activity has recently pushed the envelope on how far<br>forwards we can take that bet, for example Eagle<br>3.1,<br>DFlash,<br>SSD.
This post looks at two architectural shifts that have changed the underlying<br>economics of speculation: what mixture-of-experts routing does to the decode<br>roofline, and how compressed attention takes away the slack that used to make<br>speculated tokens free.
Then it works through what they mean for when, and how far ahead, we should<br>speculate.
The expert tax§
FFN layers in older, dense transformers (like the venerable<br>LlamaI wrote about this model before, here. series) have a<br>simple roofline with batch size: arithmetic intensity climbs linearly with<br>batch size as weights get reused across the batch, then flattens onto the<br>compute ceiling.
The win for speculative decoding is clear. If you’re on the slope of the<br>roofline you’re memory bound, and speculated tokens increase the amount of<br>compute you’re doing without increasing the memory transfer. So both accepted &<br>rejected tokens are free until they push you over the knee.
Modern models almost invariablyWith some interesting exceptions. use<br>mixture-of-experts (MoE) layers in place of<br>simple dense FFNs. Each token passes first through a ‘routing’ layer, which<br>orders the relevant experts by affinity. The token hidden state is sent to the<br>top kkk experts, then the results are recombined.
This routing means that the arithmetic intensity of the MoE layer can depend on<br>the actual content of the hidden state inputs, not just the shape. In practice,<br>one training objective (for training and large scale inference reasons) is to<br>keep the experts balanced — that is, if BBB tokens come in, each expert of EEE<br>total should process a fraction B/EB/EB/E of the total.
From here on, take DeepSeek-V4-Flash as an example: k=6k=6k=6 routed experts of<br>E=256E=256E=256, plus one always-on shared expert. The intensity-vs-batch curve changes<br>in two ways vs. a dense equivalent.
Barely amortising at the bottom. At small batch each new token added to<br>the batch tends to activate fresh experts (at batch 2 the chance the new<br>token’s experts already match is small), so it drags its own weights across<br>the bus and gets little to no amortisation. The intensity leaves the origin<br>at only half its eventual slope, so a token added here, speculated or not,<br>pays close to full freight for its experts.
Shallower slope / distant knee, same ceiling. Once every expert is being<br>triggered, the MoE line climbs more gently, reaching the same ceiling only at<br>a far larger batch. The free-token band is much wider.
Dense climbs steeply; the MoE is shallower by a factor (k+1)/(E+1)(k+1)/(E+1)(k+1)/(E+1). The shaded<br>region under each line is the memory-bound stretch, where speculated tokens are<br>roughly free; it runs much wider for the MoE. Assuming uniform routing to<br>experts, which is a good assumption for DeepSeek, and single-node deployment<br>(expert parallelism changes stuff a bit). We’re using the fp4 threshold since<br>DeepSeek’s experts are natively mxfp4. Not visible on this plot, because of the<br>shallowness of the MoE roofline: the curve between B=0B=0B=0 and ~B=43B=43B=43, where new<br>experts are being brought in.
The whole idea of speculative decoding is to amortise the weight transfer in<br>autoregressive decoding between multiple steps. Notably, the chart tells us at<br>batch size 111 this barely works for the MoE layers. But, as batch size grows<br>past this low region, there’s a much larger space in which speculative decoding<br>might pay.
The implications for speculative decoding are that:
The win when speculative tokens are accepted is no longer so big
The penalty when speculative tokens are rejected is no longer zero.
Both the win & the penalty from speculative decoding changes nonlinearly with batch size.
The changing face of attention§
The ‘expert tax’ at low batch size is part of the story that’s changed. The<br>other part is attention. A recap: the term for the ratio of FLOPs to memory transferred<br>for an operation is arithmetic intensity. You can figure out whether an<br>operation is memory bound or compute bound by comparing its arithmetic<br>intensity to the ratio of available flops and memory bandwidth, for the<br>hardware you’ll run the operation on.
Generically, we can write the arithmetic intensity of the attention operation...