Adaptive speculative decoding: picking draft lengths at runtime

hasheddan1 pts0 comments

Adaptive speculative decoding: picking draft lengths at runtime

Adaptive speculative decoding: picking draft lengths at runtime<br>22 Jun 2026 · 20 min read ·<br>Cover: Caravaggio, The Cardsharps (c. 1594), Kimbell Art Museum, via Wikimedia Commons.

Last time we<br>discussed the changing economics of speculative decoding. The strategy for<br>getting the most tokens out of a running model has become more complex as the<br>“market” for tokens in the running inference engine has become more dynamic.<br>The price of dropped draft tokens is nonzero, and even verified draft tokens<br>don’t come for free. The result is that there is space for mechanisms that<br>choose how far we speculate at runtime, depending on dynamic, online policies.

In this post, we want to take some steps to figure out what the optimal<br>policy is for speculating in this fast-changing environment.

First, a new model§

Let’s swap the model from the last post, for variety’s sake.<br>Qwen3.6-35B-A3B is a hybrid<br>mixture-of-experts model from the Qwen team.

The expert half is pretty much the same as we worked out for DeepSeek Flash:<br>see the last<br>post<br>for the full expert maths. Every layer routes each token to 888Contrast 666 of 256256256 for Deepseek Flash, the knee when all experts are<br>active, such as it is, arrives sooner. of 256256256<br>experts plus one shared expert, which is the same coupon-collector picture<br>from last time: at small batch each token tends to drag in its own fresh<br>experts and amortises almost nothing, and the marginal token only rides<br>resident experts for free once the batch size is large enough to have<br>triggered most of them.

The attention half is pretty different. Recent Qwen models have bet on ‘hybrid<br>attention’: mixing both novel linear attention mechanisms (specifically,<br>GatedDeltaNet) with traditional (GQA)<br>attention. Qwen alternates its layers three to one: thirty of the forty are<br>GatedDeltaNet linear-attention layers, and only ten are conventional full<br>attentionThis is another path to KV cache compression, different from DeepSeek’s<br>maybe more ambitious modifications of the standard attention mechanism.. The upshot is that the result from last time — that MLA becomes<br>compute bound when speculating — doesn’t apply: both the linear-attention and<br>GQA layers have an arithmetic intensity that doesn’t saturate at any<br>reasonable draft length, so speculation keeps paying for long sequences.

So for Qwen, it comes down to: the expert tax from last time, plus an attention bill that<br>is quartered and, for most layers, flat in context. The full roofline maths is<br>in the appendix.

What we’ve left out, as we did before, is the cost of producing the<br>speculated tokens.

The changing face of speculation§

There are two research threads that have changed how draft models are built:

One major boost in the performance of draft models has been to condition them<br>on richer outputs from the target model. Conditioning eases the training<br>objective of the speculator, making higher accept lengths easier to<br>achieveConditioning on the hidden states has a drawback though, in that the<br>speculator must run in series with the target model (since it needs the hidden<br>states in order to run. Generally, hidden states from closer to the end of the<br>speculator model are more useful than those from the start, and the speculator<br>can only run once they’re available). So there’s little potential for overlap<br>between speculator and target..

The other half of the story powering the step change in speculative decoding is<br>the hardware sympathy of the drafter.<br>DFlash makes use of the conditioning on the<br>hidden states to make diffusion, which has generally given poor performance for<br>pure text generation, work for speculation generation. The drafter workload is<br>then much closer to its ridge point, produces its own tokens much faster, and<br>the result is higher throughput for the same accept length.

Both factors are driving massive improvements in throughput See this great<br>work<br>from the Modal, SGLang, and Z Lab teams..

We discussed last time that there are two costs to pay during speculation: the<br>cost of the draft model, and the cost of the verify. We focussed on the cost of<br>the verify, and held the draft cost as a constant fraction of the target model.

This is bad modelling.

The drafter has its own roofline§

There are two different draft model architectures widely used at the moment.

The MTP head Qwen ships is a single<br>transformer layer that drafts autoregressively. This is the<br>EAGLE lineage, but in this case pretrained<br>alongside the model. To propose a draft of γ\gammaγ tokens it runs γ\gammaγ<br>times in sequence, each pass taking the last token and producing the next, each<br>pass a single layer followed by a projection through the 248,320248{,}320248,320-entry<br>vocabulary. So the drafter’s cost is linear in γ\gammaγ.

DFlash, on the other hand, is an<br>eight-layer block-diffusionDiffusion is a bit of an overloaded term here, the architecture and<br>training methodology is pretty similar to the MLM...

draft model from last attention tokens

Related Articles