Benchmarking llama.cpp's new MTP support on Strix Halo

Benchmarking llama.cpp's brand-new MTP support on Strix Halo | Caleb Coffie

PR #22673 landed in llama.cpp on May 16. It adds first-class Multi-Token Prediction (MTP) speculative decoding for models that ship with an MTP head, including Qwen3.6 27B dense and the 35B-A3B MoE. The author posted ~2.5× speedups on a DGX Spark.

I have a Strix Halo Framework Desktop and an RTX 3090, so I built llama.cpp from master a few hours after the merge and ran my speed-bench harness against both. Most wrappers (lemonade, ollama, LM Studio) won't have MTP for a while, so this is from-source territory.

TL;DR

MTP n=3 gives Qwen3.6 27B Q4_K_M a 1.81× speedup on Strix Halo (11.7 → 21.2 tok/s on chat); on Q8_0 the same setup hits 2.44× (7.4 → 18.1 tok/s), the biggest relative gain in the dataset.

MTP is still a smaller lever than the 3090's power budget. A 3090 at the full 450 W cap chews Q4_K_M at 38.7 tok/s baseline and tops out at 59.5 tok/s with MTP n=2 (1.54× speedup, n=2 is the sweet spot once the card is uncapped). MTP helps less on the 3090 than on Strix because the card has more raw headroom to burn through. See the power-limits post for what the same card looks like at 200 W.

It works on the MoE 35B-A3B too , not just dense.

It costs almost no extra VRAM , because the spec head shares the target model's embeddings, LM head, tokenizer, and main KV cache.

Output is identical to baseline. Speculative decoding only accepts drafted tokens the main model would have generated anyway, so quality doesn't change. You're trading idle GPU time for faster output, not accuracy.

Build steps for mainline llama.cpp with MTP are at the bottom of this post.

What MTP does

MTP makes the model draft several tokens at once and verify them in a single forward pass, instead of generating one token per pass. Normal generation is one-at-a-time: full pass, pick the next token, repeat. Speculative decoding skips that loop by guessing ahead. If the guesses are right, you got multiple tokens for the cost of one pass. If they're wrong, you fall back to one token like normal.

Quality doesn't change. The verify step only accepts drafted tokens the main model would have generated anyway. A rejected guess gets resampled from the main model's true distribution at that position. So the output you get with MTP at temperature 0 is bit-identical to baseline, and at higher temperatures it's statistically equivalent. You're trading wall-clock time, not accuracy.

The usual way is to run a tiny separate model alongside the big one to make the guesses. That costs you a second model's worth of VRAM. MTP cuts that out by giving the big model a small extra head (one or a few transformer layers) that does the guessing itself, sharing the main model's input lookup table, output layer, tokenizer, and KV cache (the conversation's running working memory). The VRAM overhead is a fraction of a gigabyte.

You turn it on with --spec-type draft-mtp --spec-draft-n-max N on the new llama.cpp. Bigger N means more aggressive guessing per step. Acceptance drops as N grows, so there's a sweet spot per model. The PR author measured ~75 % acceptance at N=3 on Qwen3.6 27B and got similar speedups on the dense 27B and the 35B-A3B Mixture-of-Experts variant (an architecture where only a slice of the model runs per token).

The rigs

Two pieces of hardware, both running llama.cpp built from master at commit 4f13cb7:

Strix Halo Framework Desktop : AMD Ryzen AI MAX+ 395 with the integrated Radeon 8060S GPU. 128 GiB of unified memory total, 96 GiB of which the GPU can use as VRAM. Runs on ROCm 7.2.3.

RTX 3090 : 24 GiB of GDDR6X memory, running at the full 450 W cap. One card for the Q4_K_M runs, two layer-split cards for the Q8_0 runs (which don't fit on 24 GiB). CUDA 13.1. The cap matters a lot here. See the companion post on power limits for what the same card looks like at 200 W (roughly half the speed).

Models: unsloth/Qwen3.6-27B-MTP-GGUF (dense) and unsloth/Qwen3.6-35B-A3B-MTP-GGUF (MoE). Bench harness is the same one feeding /benchmarks: 5 measured runs per cell after 2 warmups, four workload shapes, temperature 0, median tok/s.

Headline numbers: Qwen3.6 27B (dense)

Q4_K_M, chat shape (single-stream, 100 tokens out):

VariantQuantHardwareBackendModeShapeConc.Gen tok/s ↓TTFTTPOT (ms)27B-MTPthinkQ4_K_MGeForce RTX 3090 · 24 GiB450 Wdrv 590 llama.cpp cuda-4f13cb7 (cuda)MTP n=2chat159.5

259ms0.127B-MTPthinkQ4_K_MGeForce RTX 3090 · 24 GiB450 Wdrv 590 llama.cpp cuda-4f13cb7 (cuda)MTP n=3chat158.7

259ms0.127B-MTPthinkQ4_K_MGeForce RTX 3090 · 24 GiB450 Wdrv 590 llama.cpp cuda-4f13cb7 (cuda)baselinechat138.7

238ms23.427B-MTPthinkQ4_K_MGeForce RTX 3090 · 24 GiB200 Wdrv 590 llama.cpp 4f13cb7-mtp (cuda)mtp-3-pl-200wchat134.2

283ms0.127B-MTPthinkQ4_K_MGeForce RTX 3090 · 24 GiB200 Wdrv 590 llama.cpp 4f13cb7-mtp (cuda)mtp-2-pl-200wchat132.0

271ms0.127B-MTPthinkQ4_K_MStrix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)unified llama.cpp 4f13cb7-mtp (rocm)MTP...

Benchmarking llama.cpp's new MTP support on Strix Halo

Related Articles

Elevated error rates on requests to multiple models

Donald Trump and sons to be 'forever' exempt from tax audits

PopuLoRA: Co-Evolving LLM Populations for Reasoning Self- Play

Old Reddit Is Down

The ultimate female fantasy – A feminist critique of Beauty and the Beast