The next generation of speculative decoding: DFlash and Spec V2

gmays2 pts0 comments

The next generation of speculative decoding: DFlash and Spec V2 - LMSYS Org<br>Projects<br>Blog<br>About<br>Donations<br>Contact

‹ Back to Blog‹ Back to BlogContents<br>DFlash: Parallel drafting with KV injection<br>Why is DFlash so fast?<br>Implementing DFlash in SGLang<br>Eliminating host overhead for DFlash with Spec V2 and overlap scheduling<br>High-performance DFlash draft models are available for a variety of models<br>Try DFlash in SGLang now<br>Acknowledgements

The next generation of speculative decoding: DFlash and Spec V2<br>Z Lab, Modal, and SGLang TeamsJune 15, 2026<br>Using Modal and Z Lab's DFlash speculative decoding models with SGLang’s newly default Spec V2 engine, you can achieve state-of-the-art latencies for LLM inference serving. Our new, jointly-released DFlash model for Qwen 3.5 397B-A17B achieves higher throughput than both the baseline model and native MTP speculation in all the settings we benchmarked. At concurrency 1 on the HumanEval coding dataset, it achieves >4.3x the throughput of baseline and 1.5x the throughput of MTP.

Workload: Qwen 3.5 397B-A17B (BF16), HumanEval. Settings: greedy decoding, thinking enabled, max new tokens 4096. Hardware: 8xB200 on Modal. Acceptance lengths are averaged across requests. Draft token/block counts selected for maximum throughput (MTP: 7 steps; DFlash: block size 16).

To celebrate this collaboration, we're releasing this model in triplicate across our Hugging Face organizations:

z-lab/Qwen3.5-397B-A17B-DFlash

modal-labs/Qwen3.5-397B-A17B-DFlash

lmsys/Qwen3.5-397B-A17B-DFlash

You can try the model yourself with this command:

export SGLANG_ENABLE_OVERLAP_PLAN_STREAM=1

python -m sglang.launch_server \<br>--model-path Qwen/Qwen3.5-397B-A17B \<br>--trust-remote-code \<br>--speculative-algorithm DFLASH \<br>--speculative-draft-model-path modal-labs/Qwen3.5-397B-A17B-DFlash \<br>--speculative-dflash-block-size 8 \<br>--speculative-draft-attention-backend fa4 \<br>--attention-backend trtllm_mha \<br>--linear-attn-prefill-backend triton \<br>--linear-attn-decode-backend flashinfer \<br>--mamba-scheduler-strategy extra_buffer \<br>--tp-size 8 \<br>--max-running-requests 32 \<br>--cuda-graph-max-bs-decode 32 \<br>--cuda-graph-backend-prefill tc_piecewise \<br>--enable-flashinfer-allreduce-fusion \<br>--mem-fraction-static 0.8 \<br>--host 0.0.0.0 \

Below, we describe DFlash’s novel diffusion + KV injection strategy for speculative decoding, why that matters for achieving massive speedups, and how the teams at Z Lab, SGLang, and Modal worked together to make those speedups available to everyone.

DFlash: Parallel drafting with KV injection

Transformer-based large language models (LLMs) are powerful, but their autoregressive decoding process makes inference slow: tokens must be generated one by one, with low arithmetic intensity that makes them a poor fit for modern hardware.

Speculative decoding addresses this bottleneck by using a smaller, faster draft model to propose multiple tokens, which are then verified in parallel by the target LLM, with no impact on model quality.

However, many speculative decoding methods, like the EAGLE series and the native multi-token prediction (MTP) modules in recent models like Gemma 4 and DeepSeek-V4, still rely on sequential autoregression – but in the draft model instead of the target. The draft model generates draft tokens one-by-one,a poor fit for modern hardware and a limit on achievable speedup.

That’s why Z Lab developed DFlash, which uses a lightweight block diffusion draft model to generate an entire block of draft tokens in parallel, just the way GPUs and TPUs like. Xiaomi's new MiMo v2.5-Pro-UltraSpeed uses DFlash to achieve over 1k output tps.

Using block diffusion for speculative drafting is non-trivial. Directly training a small block diffusion model as the drafter leads to low acceptance length, while using an existing large diffusion LLM like SpecDiff-2 as the drafter introduces a large memory footprint and high drafting cost.

The key insight of DFlash is simple: the target LLM knows the context best. Inspired by previous methods like Medusa, EAGLE and MTP (Gloeckle et al., 2024; Samragh et al., 2025), we extract hidden representations of the context tokens from the target model. Unlike previous work, we inject them directly into the draft model’s KV cache. This scales better with increased draft depth. KV injection also allows the draft model to skip modeling the full context from scratch and focus purely on predicting the next block of tokens – using the same tensors as the later layers of the target model!

With this design, DFlash leverages the rich, highly relevant contextual features produced by the target LLM while keeping the draft model extremely small and efficient. As a result, DFlash achieves high acceptance length with low drafting latency.

Why is DFlash so fast?

Speculative decoding speedup mainly depends on two factors: how many drafted tokens are accepted per cycle and how much extra cost the draft model adds. DFlash improves both: diffusion drafting lowers draft...

dflash model draft speculative decoding tokens

Related Articles