Width vs. Depth: Speculating on the Margin

Width vs. depth: speculating on the margin | Doubleword Here’s a funFor some non-universal definition of fun. interview question. Imagine you’re running Qwen3.6-35B-A3B at batch size 111, on a single GPU, and you decide you want to increase throughput. For whatever reason, your engine can only work on 222 tokens at a time. You have two choicesAssume the draft model's forward pass is free. Assume you don't have to prefill anything to batch. Assume that the sequences are short enough that the KV cache movement doesn't come into play.:

Run at batch size 222 by batching 222 random user sequences together.

Run at batch size 111, speculating 111 token ahead — so the verify works on 222 positions, the token you just sampled plus one draft — with a per-token acceptance rate α\alphaα.

Which is better, assuming you only care about the total number of tokens being output per second?

Here’s a sensible answer:

Assuming that everything is memory bound, batching is always better for α1\alphaα1, because there’s no chance that a token added to the working set by increasing the batch size will be rejected.

But here’s something that comes out when you do the modelling:

Spending your 2 positions on one speculating sequence produces, globally, more output tokens per second than spending them on a batch of 2, even with α=0.9\alpha=0.9α=0.9.

How come?

The first story: depth can be cheaper than width

To find out, we ought to look at the dataCollected from our last post: half a million draft rounds for each of our two draft models, recording the drafter's own per-depth confidence and the number of tokens actually committed, plus separate captures of which experts every token routed through -- and the same routing captures for DeepSeek-V4-Flash (all published as specdec-calibration).. The answer lives in MoE routing. MoE routing is a weird part of performance analysis of LLMs: one of the places where the semantic content of the data affects what work gets done. In principle, it can do confusing things, like make benchmarks on random data unrepresentative.

First, the empirical distribution of routed expertsPreviously, we've just assumed uniform routing, which gives us coupon-collector maths:

mean across layerslayer 0layer 1layer 2layer 3layer 4layer 5layer 6layer 7layer 8layer 9layer 10layer 11layer 12layer 13layer 14layer 15layer 16layer 17layer 18layer 19layer 20layer 21layer 22layer 23layer 24layer 25layer 26layer 27layer 28layer 29layer 30layer 31layer 32layer 33layer 34layer 35layer 36layer 37layer 38layer 39Qwen3.6-35B-A3BDeepSeek-V4-FlashHumanEvalSPEED-Bench -- all categoriesSPEED-Bench -- codingSPEED-Bench -- writingSPEED-Bench -- qaSPEED-Bench -- ragSPEED-Bench -- mathSPEED-Bench -- reasoningSPEED-Bench -- stemSPEED-Bench -- humanitiesSPEED-Bench -- summarizationSPEED-Bench -- multilingualSPEED-Bench -- roleplay

Surprisingly non-uniform! Fitting the rank-vs-share curve, it decays roughly exponentially with rank: the busiest expert pulls several times its fair shareI wonder if there's a proxy metric in here for the data distribution: given pretraining with expert load balancing loss, can we determine the distribution of the training data by how balanced the experts are on different types of data?. It varies by domain, by model and by layer.

This doesn’t by itself explain anything. But — let’s look at the difference between the two choices on the table when we decide between width and depth: work on two randomly chosen tokens, or on two tokens that follow one from the other:

The distinct experts one verify forward touches as NNN grows, three ways: NNN separate sequences (width), one sequence running NNN consecutive positions (depth), and the uniform coupon-collector.

So there’s the answer. At batch size 1 we’re memory-bound, and verifying a two-position speculative run moves less expert weight than adding a second sequence would. It does so by co-activation — speculated runs are more similar than randomly chosen data, so they activate more of the same expertsJosh did some great work on trying to recover co-activation for batching here.. Even throwing away 10% of speculated tokens at α=0.9\alpha = 0.9α=0.9, depth beats width.

This is a toy problem, though it’s interesting to think about how we could make use of the insightIn real engines, it gets washed out by the cost of the...

Width vs. Depth: Speculating on the Margin

Related Articles

(no title)

Is AI ruining our skills? Early results are in – and they're not good

The Anatomy of an AI-Native Org

ZCode – Harness for GLM-5.2

Apertus – Open Foundation Model for Sovereign AI