Speculation Is All You Need
All posts<br>Back Research<br>June 19, 2026•20 minute read
Speculation Is All You Need<br>Charles Frye@charles_irl Member of Technical Staff
David Wang@_dcw02 Member of Technical Staff
Shankha Biswas@ShankmanJunior Member of Technical Staff
We are all-in on speculative decoding, and we’d like to tell you why.<br>But first: we’re big fans of Z Lab’s DFlash draft model architecture. That’s why we released a state-of-the-art DFlash speculator for Qwen 3.5 397B-A17B this week and worked closely with SGLang to make sure its performance was world-beating.<br>It’s also why we worked with Z Lab to train state-of-the-art speculators for more models in the Qwen series, which we’re releasing on Hugging Face today:<br>Qwen 3.6 35B-A3B-DFlash<br>Qwen 3.5 4B-DFlash<br>Qwen 3.5 9B-DFlash<br>Qwen 3.5 27B-DFlash<br>Qwen 3.5 35B-A3B-DFlash<br>Qwen 3.5 122B-A10B-DFlash<br>On top of the strong baseline of existing DFlash speculators, these new draft models achieve an additional 5 - 20% speedup on a wide variety of workloads.<br>That’s enough to run Qwen 3.5 122B-A10B at over 1000 tok/s at concurrency 1 on a B200 node. Here’s roughly what that looks like, compared to the model running without any speculation at 250 tok/s, using the token timing simulator from our LLM Engineer’s Almanac:<br>Furthermore, they better preserve their acceptance lengths on very long context tasks, like agentic software engineering.<br>Below, we explain why we’re bullish on speculation for LLM inference acceleration — and as part of the whole continuous improvement cycle for AI applications. But first, a tl;dr with the high-level takeaways .<br>To first order, speculative decoding is the only engine optimization that matters for achieving state-of-the-art inference performance at high interactivity. Days of back-breaking kernel optimization work by expensive CUDA engineers or carefully profiling and lifting host-side bottlenecks delivers speedups measured in small percentage points. It is a grind and a game of inches. Many inference providers wasted many engineering hours building proprietary engines filled with these optimizations .<br>Speculative decoding delivers much larger speedups — measured in integral factors like 2x or 3x, not 2% or 3%. The chart below shows speedups we’ve observed in speculators that we have trained, and in the built-in MTP baselines, as a function of speculator quality. You can explore the data in a Modal Notebook if you’re interested in the details.
Proper support for speculative decoding is, therefore, more important than other optimizations. Open source inference engines like SGLang and vLLM have cottoned on and, in our experience, closed the gap with proprietary engines. Speculative decoding also generally composes with other work on inference engine performance.<br>Finally, when speculative decoding is customized to domain-specific data from an application, it delivers truly unbeatable speedups . That means speculative decoding is Bitter Lesson-pilled: because speculative decoding relies on machine learning under the hood, the speedup increases when you just throw more data and compute at the problem — no cracked kernel engineers required. That means it can ride the same exponentials of continuous improvement in hardware, algorithms, autoresearch, and scale as the AI application it accelerates.<br>Speculative decoding is so critical to the success of contemporary self-hosted inference that you might even say that speculation Is All You Need.<br>What is speculative decoding and why is it so important?<br>A brief recap: speculative decoding (aka “spec dec”) losslessly accelerates the “decode” phase of LLM inference, during which tokens are generated as output in response to an input.
This is a serial operation, because Transformer (and Transformer-like) language models generate output tokens autoregressively — based on their own outputs.<br>Speculative decoding turns this serial work into parallel work by passing in a set of tokens generated by another system, the speculator (aka “drafter” aka “draft model”). These tokens can be processed in parallel by the target model, just as the model processes the input tokens in parallel (during the “prefill phase”).<br>The target model computes its own output probabilities for the tokens and applies a resampling technique (for the ‘heads, usually sequential rejection sampling). For deterministic/greedy decoding (aka temperature 0), this just means accepting the prefix of tokens that the target model would have output autoregressively, rejecting all tokens after that, and inserting a token predicted by the target.
To repeat, this acceleration is lossless. Speculative decoding produces sample sequences from the same distribution as the target model (up to non-determinism sources like floating point accumulation re-ordering).<br>The core intuition for speculative decoding is the same intuition as that for speculative execution in microprocessors: sequential execution is so expensive that parallel execution of work...