The Economics of Speculative Decoding

The economics of speculative decoding

The economics of speculative decoding 8 Jun 2026 · 19 min read · Cover: William Holbrook Beard, The Bulls and Bears in the Market (1879), via Wikimedia Commons.

Speculative decoding is one of the cleanest performance wins in inference optimisation: it’s lossless, it hits decode latency when not much else does, and in its standard formulation it’s simple and elegant.

It works by looking forwards: speculative decoding takes a position on what tokens will come next. For dense transformers the bet is riskless: accepted tokens pay off, rejected tokens cost nothing, a clean arbitrage on spare memory bandwidth.

A burst of research activity has recently pushed the envelope on how far forwards we can take that bet, for example Eagle 3.1, DFlash, SSD.

This post looks at two architectural shifts that have changed the underlying economics of speculation: what mixture-of-experts routing does to the decode roofline, and how compressed attention takes away the slack that used to make speculated tokens free.

Then it works through what they mean for when, and how far ahead, we should speculate.

The expert tax§

FFN layers in older, dense transformers (like the venerable LlamaI wrote about this model before, here. series) have a simple roofline with batch size: arithmetic intensity climbs linearly with batch size as weights get reused across the batch, then flattens onto the compute ceiling.

The win for speculative decoding is clear. If you’re on the slope of the roofline you’re memory bound, and speculated tokens increase the amount of compute you’re doing without increasing the memory transfer. So both accepted & rejected tokens are free until they push you over the knee.

Modern models almost invariablyWith some interesting exceptions. use mixture-of-experts (MoE) layers in place of simple dense FFNs. Each token passes first through a ‘routing’ layer, which orders the relevant experts by affinity. The token hidden state is sent to the top kkk experts, then the results are recombined.

This routing means that the arithmetic intensity of the MoE layer can depend on the actual content of the hidden state inputs, not just the shape. In practice, one training objective (for training and large scale inference reasons) is to keep the experts balanced — that is, if BBB tokens come in, each expert of EEE total should process a fraction B/EB/EB/E of the total.

From here on, take DeepSeek-V4-Flash as an example: k=6k=6k=6 routed experts of E=256E=256E=256, plus one always-on shared expert. The intensity-vs-batch curve changes in two ways vs. a dense equivalent.

Barely amortising at the bottom. At small batch each new token added to the batch tends to activate fresh experts (at batch 2 the chance the new token’s experts already match is small), so it drags its own weights across the bus and gets little to no amortisation. The intensity leaves the origin at only half its eventual slope, so a token added here, speculated or not, pays close to full freight for its experts.

Shallower slope / distant knee, same ceiling. Once every expert is being triggered, the MoE line climbs more gently, reaching the same ceiling only at a far larger batch. The free-token band is much wider.

Dense climbs steeply; the MoE is shallower by a factor (k+1)/(E+1)(k+1)/(E+1)(k+1)/(E+1). The shaded region under each line is the memory-bound stretch, where speculated tokens are roughly free; it runs much wider for the MoE. Assuming uniform routing to experts, which is a good assumption for DeepSeek, and single-node deployment (expert parallelism changes stuff a bit). We’re using the fp4 threshold since DeepSeek’s experts are natively mxfp4. Not visible on this plot, because of the shallowness of the MoE roofline: the curve between B=0B=0B=0 and ~B=43B=43B=43, where new experts are being brought in.

The whole idea of speculative decoding is to amortise the weight transfer in autoregressive decoding between multiple steps. Notably, the chart tells us at batch size 111 this barely works for the MoE layers. But, as batch size grows past this low region, there’s a much larger space in which speculative decoding might pay.

The implications for speculative decoding are that:

The win when speculative tokens are accepted is no longer so big

The penalty when speculative tokens are rejected is no longer zero.

Both the win & the penalty from speculative decoding changes nonlinearly with batch size.

The changing face of attention§

The ‘expert tax’ at low batch size is part of the story that’s changed. The other part is attention. A recap: the term for the ratio of FLOPs to memory transferred for an operation is arithmetic intensity. You can figure out whether an operation is memory bound or compute bound by comparing its arithmetic intensity to the ratio of available flops and memory bandwidth, for the hardware you’ll run the operation on.

Generically, we can write the arithmetic intensity of the attention operation...

The Economics of Speculative Decoding

Related Articles

The Newest Instagram "Exploit" Is the Goofiest I've Seen

Apple WWDC 2026 Livestream

It's Not Just X. It's Y

Show HN: GoPeek – open links in live mini browser windows without new tabs

Agent Memory: An Anatomy