Width vs. Depth: Speculating on the Margin

somnial3 pts1 comments

Width vs. depth: speculating on the margin | Doubleword<br>Here’s a funFor some non-universal definition of fun. interview question. Imagine you’re running Qwen3.6-35B-A3B at<br>batch size 111, on a single GPU, and you decide you want to increase<br>throughput. For whatever reason, your engine can only work on 222 tokens at a<br>time. You have two choicesAssume the draft model's forward pass is free. Assume you don't have to<br>prefill anything to batch. Assume that the sequences are short enough that the<br>KV cache movement doesn't come into play.:

Run at batch size 222 by batching 222 random user sequences together.

Run at batch size 111, speculating 111 token ahead — so the verify works<br>on 222 positions, the token you just sampled plus one draft — with a<br>per-token acceptance rate α\alphaα.

Which is better, assuming you only care about the total number of tokens<br>being output per second?

Here’s a sensible answer:

Assuming that everything is memory bound, batching is always better for<br>α1\alphaα1, because there’s no chance that a token added to the working set<br>by increasing the batch size will be rejected.

But here’s something that comes out when you do the modelling:

Spending your 2 positions on one speculating sequence produces, globally,<br>more output tokens per second than spending them on a batch of 2, even with<br>α=0.9\alpha=0.9α=0.9.

How come?

The first story: depth can be cheaper than width

To find out, we ought to look at the dataCollected from our last<br>post: half a million draft<br>rounds for each of our two draft models, recording the drafter's own per-depth<br>confidence and the number of tokens actually committed, plus separate captures<br>of which experts every token routed through -- and the same routing captures<br>for DeepSeek-V4-Flash (all published as<br>specdec-calibration).. The answer lives in MoE routing.<br>MoE routing is a weird part of performance analysis of LLMs: one of the places<br>where the semantic content of the data affects what work gets done. In<br>principle, it can do confusing things, like make benchmarks on random data<br>unrepresentative.

First, the empirical distribution of routed expertsPreviously, we've just assumed uniform routing, which gives us coupon-collector<br>maths:

mean across layerslayer 0layer 1layer 2layer 3layer 4layer 5layer 6layer 7layer 8layer 9layer 10layer 11layer 12layer 13layer 14layer 15layer 16layer 17layer 18layer 19layer 20layer 21layer 22layer 23layer 24layer 25layer 26layer 27layer 28layer 29layer 30layer 31layer 32layer 33layer 34layer 35layer 36layer 37layer 38layer 39Qwen3.6-35B-A3BDeepSeek-V4-FlashHumanEvalSPEED-Bench -- all categoriesSPEED-Bench -- codingSPEED-Bench -- writingSPEED-Bench -- qaSPEED-Bench -- ragSPEED-Bench -- mathSPEED-Bench -- reasoningSPEED-Bench -- stemSPEED-Bench -- humanitiesSPEED-Bench -- summarizationSPEED-Bench -- multilingualSPEED-Bench -- roleplay

Surprisingly non-uniform! Fitting the rank-vs-share curve, it decays roughly<br>exponentially with rank: the busiest expert pulls several times its fair<br>shareI wonder if there's a proxy metric in here for the data distribution:<br>given pretraining with expert load balancing loss, can we determine the<br>distribution of the training data by how balanced the experts are on different<br>types of data?. It varies by domain, by model and by layer.

This doesn’t by itself explain anything. But — let’s look at the difference<br>between the two choices on the table when we decide between width and depth:<br>work on two randomly chosen tokens, or on two tokens that follow one from the<br>other:

The distinct experts one verify forward touches as NNN<br>grows, three ways: NNN separate sequences (width), one sequence running NNN<br>consecutive positions (depth), and the uniform coupon-collector.

mean across layerslayer 0layer 1layer 2layer 3layer 4layer 5layer 6layer 7layer 8layer 9layer 10layer 11layer 12layer 13layer 14layer 15layer 16layer 17layer 18layer 19layer 20layer 21layer 22layer 23layer 24layer 25layer 26layer 27layer 28layer 29layer 30layer 31layer 32layer 33layer 34layer 35layer 36layer 37layer 38layer 39Qwen3.6-35B-A3BDeepSeek-V4-FlashHumanEvalSPEED-Bench -- all categoriesSPEED-Bench -- codingSPEED-Bench -- writingSPEED-Bench -- qaSPEED-Bench -- ragSPEED-Bench -- mathSPEED-Bench -- reasoningSPEED-Bench -- stemSPEED-Bench -- humanitiesSPEED-Bench -- summarizationSPEED-Bench -- multilingualSPEED-Bench -- roleplay

So there’s the answer. At batch size 1 we’re memory-bound, and verifying a<br>two-position speculative run moves less expert weight than adding a second<br>sequence would. It does so by co-activation — speculated runs are more similar than randomly<br>chosen data, so they activate more of the same expertsJosh did some great work on trying to recover co-activation for batching<br>here.. Even throwing away 10%<br>of speculated tokens at α=0.9\alpha = 0.9α=0.9, depth beats width.

This is a toy problem, though it’s interesting to think about how we could make<br>use of the insightIn real engines, it gets washed out by the cost of the...

bench depth batch tokens width data

Related Articles