JetSpec Enables Up to 9.64x Lossless LLM Inference Speedup with Up to 1000TPS

snyhlxde1 pts1 comments

JetSpec: Breaking the Scaling Ceiling of Speculative Decoding with Parallel Tree Drafting | Hao AI Lab @ UCSD<br>Side-by-side comparison of decoding speed among JetSpec, DFlash and AR baseline.

TL;DR : Speculative decoding hits a scaling ceiling: a larger draft budget helps only while acceptance stays high and drafting stays cheap. Prior draft heads face a dilemma: autoregressive drafters condition on each path but pay with tree depth, while block-diffusion drafters draft in one pass but score branches independently, creating plausible yet mutually inconsistent trees. JetSpec trains a causal parallel draft head over fused hidden states from a frozen target model, so candidate-tree scores follow the target’s own autoregressive factorization. The frozen target then verifies the full tree in one forward pass, losslessly. On Qwen3-8B, greedy decoding with budget 256, JetSpec reaches 9.64x on MATH-500 and 4.58x on open-ended chat, and these gains carry into real single-stream serving on JetSpec’s own engine with an average of around 1000 TPS throughput on MATH-500 using a single B200 GPU.<br>Figure 1: End-to-end decoding speedup over standard autoregressive decoding on H100 GPUs across math, coding, and chat benchmarks. DFlash denotes the original block-parallel drafting method, DDTree is tree-based variant of DFlash, and JetSpec denotes our method.<br>Background#<br>Modern LLM serving is still bottlenecked by autoregressive decoding: each token depends on the previous one, so generation is inherently sequential. Speculative decoding accelerates this process by drafting multiple future tokens and verifying them with the target model, but its speedup is controlled by two factors: (1) how many drafted tokens are accepted, and (2) how cheaply those tokens are drafted. Increasing the draft budget only helps when acceptance stays high and cumulative draft overhead stays low.<br>Existing head-based speculative decoding methods expose a core trade-off. Autoregressive draft heads such as Medusa and EAGLE-style methods preserve the target’s factorization and can produce faithful continuations, but drafting grows with tree depth. Parallel block-diffusion heads can draft a whole block in one pass, but positions are scored independently, so deeper branches can drift away from what the target would actually generate. Retrieval-based drafters avoid learned heads, but depend on lexical overlap or repeated text.<br>MethodDrafting StyleCausal Draft PathTree QualityDraft CostSpeedupAR baseline NoneN/AN/AN/A1xAR draft heads multiple-pass sequential✅😃💰💰💰3~4xBlock-diffusion heads one-pass block draft❌😐💰3~6xJetSpec one-pass causal tree draft✅😃💰4~10xTable 1: Qualitative comparison of speculative decoding families. PLACEHOLDER optional rendered table.<br>This leads to the central question behind JetSpec: can we draft an entire speculative tree in one parallel pass while still scoring branches according to the target model’s causal, autoregressive factorization? JetSpec answers yes by combining causal parallel tree drafting with one-pass verification by the frozen target model.<br>At a high level, JetSpec targets both sides of the speculative-decoding bottleneck:<br>Low drafting cost: generate many tree nodes in one draft-head forward pass.<br>High acceptance: condition every node on its branch prefix, not just on its absolute future position.<br>Lossless verification: let the frozen target verify the tree and commit only the prefix it agrees with.<br>JetSpec#<br>JetSpec trains a lightweight causal parallel draft head on top of a frozen target LLM. The head reuses rich multi-layer hidden features from the target, so drafting remains cheap, but it applies a tree-causal attention mask across draft slots. Each tree node can attend to the original prefix and its own ancestors, but not to unrelated sibling branches or descendants. As a result, all nodes are computed in parallel while every branch still follows an autoregressive-like dependency structure.<br>At inference time, the frozen target verifies every node in the speculative tree in a single forward pass under a tree-causal attention mask. The acceptance rule follows speculative decoding and commits the longest prefix accepted, so JetSpec preserves the target model’s exact output distribution under the same sampling rule. In other words, JetSpec improves speed without changing what the target would generate.<br>Figure 2: JetSpec design overview. JetSpec extracts fused hidden features from the frozen target model and conditions a causal-parallel draft head to generate high-quality candidate trees in one forward pass.<br>Training the Head#<br>Only the draft head is trained; the target model stays frozen. This lets JetSpec attach to a production model without changing its weights. Training samples anchor positions in target-aligned sequences, builds future-token blocks, and supervises the head against the target’s own next-token distributions.<br>We train with a causal mask over selected anchor positions, with each anchor expanded...

draft target jetspec tree decoding pass

Related Articles