GLM5.2 on AMD MI355X at 2626 tok/s/node at over 2x lower cost than Blackwell

latchkey1 pts0 comments

Performance per dollar is getting faster and cheaper | WaferAnnouncing the fastest GLM-5.2, in collaboration with Vercel AI Gateway and OpenRouter.GLM-5.2 is live — via Vercel AI Gateway & OpenRouter. Read more

Log inSign up<br>Sign up

Have you noticed we like AMD?

The demand for inference is skyrocketing and outpacing supply. With frontier models being released almost every other week — Claude Fable, GLM5.2, and Minimax M3, to name a few — the token craze is only getting crazier, and there aren’t enough Blackwells going around to support it. Thus, NVIDIA GPU prices are climbing fast, and tokens are getting really expensive.

In comes AMD. At around 2.75x cheaper per GPU on average (MI355X vs B300) with comparable hardware specs, the solution to cheap inference is hiding in plain sight — a message we at Wafer have been preaching for months. But although AMD’s Instinct MI350 series competes with Blackwells at the silicon level, NVIDIA’s software advantage and day-0 support typically allows providers to serve inference much faster on their hardware with much less friction.

Conversely, on the MI355X / ROCm stack SOTA performance rarely comes out of the box for these frontier models (sometimes it does!). In fact, you’re lucky if you can find an image that runs them at all. Without this day-0 support, building and optimizing for the newest models can require weeks of engineering and compute. By then, the newest model has already been released, making it so AMD is always playing catch-up.

But as agents improve at kernel and model optimization, this gap is closing in real time. At Wafer, we’ve proven this time and time again.

And again — on a 20k in / 1k out, 60% cache hit rate workload, we hit an aggregate throughput of 2626 tok/s/node @ 2.4 rps with a defined knee of ≤5s TTFT — only 80% of the performance measured on a B200, despite being over 2x cheaper.

Sustained RPS<br>Aggregate tok/s/node<br>TTFT p50 / p95<br>Success

0.5<br>449<br>0.59s / 0.60s<br>100%

1.0<br>974<br>0.60s / 0.81s<br>100%

1.5<br>1913<br>0.62s / 1.03s<br>100%

2.0<br>1944<br>0.62s / 1.05s<br>100%

2.25<br>2089<br>0.63s / 1.23s<br>100%

2.4 (saturation)<br>2626<br>0.81s / 2.22s<br>100%

We also hit 213 tok/s on GLM5.2 on 10k input tokens / 1.5k output tokens single stream, following Artificial Analysis standards, served on AMD MI355X capacity from TensorWave. Though this number doesn’t top the AA leaderboard, it still wins on performance per dollar.

How we did it

The first step with any model work is to choose a quantization and framework. We quantized the base bf16 GLM-5.2 to MXFP4 with AMD Quark. In comparison to z-ai’s official FP8 quantization, our MXFP4 was lossless (GPQA-Diamond, tau2, GSM8K).

Eval<br>FP8 baseline<br>MXFP4<br>Δ (MXFP4 − FP8)

GSM8K (200q, 5-shot, greedy)<br>0.965 ± 0.013<br>0.955 ± 0.014<br>−0.010

GPQA-Diamond (198q × 2 seeds, temp 1.0)<br>0.9217 ± 0.027<br>0.9026 ± 0.029<br>−0.019

tau2 macro<br>0.819<br>0.834<br>+0.015

As for the inference framework, we had three options — vLLM, ATOM, and sglang. Among the three, we chose sglang — vLLM had no working MXFP4 + GlmMoeDsa path so the MXFP4 weights provided no benefit, and ATOM’s output degraded at long context. Sglang was the inference engine with the least friction to native support, able to take advantage of the quantization while remaining coherent.

The next natural step to improving throughput was enabling speculative decode on sglang. However, the sglang ROCm image does not support this out of the box. There were two fixes needed before MTP worked properly.

First, the MTP head, like every other layer, keeps its single shared expert stored in bf16, not MXFP4. However, the MTP head is registered under a different module prefix than the main decoder stack (Quark names its bf16 shared expert model.layers.78.mlp.shared_experts.*, while the MTP layer’s real prefix is model.decoder.*). Because of the mismatch, sglang’s quantization lookup fails and defaults to building that shared expert as MXFP4. At load it then tries to read a full-width bf16 weight into a half-width 4-bit slot and the init crashes on a shape mismatch. Quark records which weights to leave un-quantized as a list of layer names, so we copied over the layer 78 entries to that list a second time under the decoder name sglang actually uses. This fix unblocked speculative decode, netting us close to a 3x gain in single stream throughput.

Second, deep speculative decode (such as the 5/1/6 config z-ai suggests) was still blocked. The fused multi-step metadata kernel needed for draft depth ≥4 writes #include with no ROCm guard. Fix: one #ifdef USE_ROCM guard.

Two trivial, but necessary changes to take full advantage of speculative decode. With spec dec working properly, alongside a few config optimizations (such as --kv-cache-dtype fp8_e4m3 and --enable-aiter-allreduce-fusion), we reached our headline single stream decode number at 213 tok/s.

But for aggregate throughput, especially with our defined workload, decode optimizations are necessary but insufficient. At 20k in @ 60% cache, the workload...

mxfp4 sglang decode inference support model

Related Articles