Running GLM-5.2 5x faster at 500tps with limitation

darkbatman1 pts0 comments

Running GLM-5.2 5× faster than vLLM, on a runtime that doesn't support it - abhishek.it

Back to writingJune 2026. Every tok/s number is from real runs on a rented Lambda 8×NVIDIA B200 node. Same FP8 GLM-5.2 weights, same prompts, 3 runs averaged. Each local engine measured at its best config. Nothing mocked.

TileRT is Xiaomi MiMo's inference runtime. It's the thing that pushed a trillion-parameter model past 1000 tok/s on a standard 8-card node. The model it ships with is GLM-5. GLM-5.2 came out with no published speed numbers, so I rented an 8×B200 and tried to measure it on TileRT myself.

Three walls. One of them permanent. I got the model running at ~480 tok/s, quality identical to OpenRouter, about 5× faster than vLLM on the same GPUs. Here's what broke, where the 5× actually comes from, and the catch.

Wall 1: the B200 you rent isn't CUDA-13-ready

TileRT 0.1.4's image is built for CUDA 13. The Lambda B200 boots on the older driver, R570 / CUDA 12.8, managed by Lambda's own driver stack. First thing TileRT did was crash.

RuntimeError: The NVIDIA driver on your system is too old (found version 12080).

No CUDA-12 build of TileRT exists. I didn't want a compatibility shim polluting the numbers, so I upgraded the driver for real: R570 to R580. On a Lambda node that's fiddly. The fabric manager (the piece that lets the 8 GPUs talk to each other over NVLink) has to match the new driver exactly, or the GPUs can't see each other. And you have to confirm the new driver actually built for the running kernel before you reboot, or the box comes back with no GPUs at all.

It worked. Lesson stuck though: a B200 you rent today is not, out of the box, ready to run a CUDA-13 stack. Budget an hour for the driver before you measure anything.

Wall 2: TileRT doesn't support GLM-5.2

TileRT 0.1.4 supports GLM-5.1. GLM-5.2 looks almost identical on paper, 78 layers, 256 experts, MLA attention, FP8. But it adds one thing TileRT has never seen: IndexShare.

Both models use DeepSeek-style sparse attention. A small "indexer" picks the top-2048 tokens each layer attends to. In GLM-5.1, and in what TileRT expects, every layer runs its own indexer. GLM-5.2 marks every 4th layer full (it computes the index) and the 3 layers between shared (they reuse the previous full layer's pick). About 2.9× fewer attention FLOPs at max context. It's also why the conversion died on the spot:

KeyError: 'model.layers.3.self_attn.indexer.wk.weight'

Layer 3 is shared. It has no indexer weights, because in GLM-5.2 it doesn't need them. TileRT's converter assumes every layer has them.

So I made GLM-5.2 look like the uniform model TileRT wants. The remap synthesizes the missing indexer subtree on each shared layer by copying it from the full layer it shares from. 399 small tensors. The 700 GB of real weights stay untouched, only the index gets patched. TileRT's converter then ate it as a plain GLM-5, and the model ran.

The part that mattered: I checked the output against OpenRouter's GLM-5.2 on knowledge questions. Capital of Australia (Canberra), 17×24 (408), the bat-and-ball trap ($0.05), gold's symbol (Au), War and Peace (Tolstoy). Every answer matched. The remap isn't a lobotomy. Inside its window it's bit-for-bit GLM-5.2.

Wall 3: the 2048-token ceiling, and this one doesn't move

"Inside its window" is carrying the whole post.

The remap is exact only while total context stays under 2048 tokens. Reason: when the sequence is shorter than index_topk (2048), the top-2048 grabs every token. Attention is dense, and what the indexer picked stops mattering. Past 2048, sparse selection starts to matter. My synthesized indexers then compute the index from each shared layer's own hidden state, instead of reusing the full layer's pick. They grab the wrong tokens. The model falls apart into wait wait wait wait, then 0: 0: 0: 0:. You can watch it the second a chat crosses ~2048 tokens.

I tried to break it. The kernel hardcodes the window. Set index_topk to anything but 2048 and it throws idx_selects must have last dim 2048. The obvious fix is to make the shared layers reuse the full layer's real selection. That needs a hook between the indexing step and the attention step. There isn't one. TileRT runs the whole model in a single closed kernel that stays resident on the GPU: one call runs all 78 layers inside it, and from the outside you never see the individual layers. No place to reach in, and the compiler that builds that kernel isn't public. So the ceiling can't move without TileRT shipping GLM-5.2 itself. I cap output at 2000 tokens so it cuts clean instead of spewing garbage.

This is the catch nobody puts in the headline. Fast and correct, but length-capped. Fine for a coding turn. Useless for a 100k-token doc.

The benchmark

Forty questions, four domains (Python, JavaScript, algorithms, SQL), 3 runs each, averaged. Same FP8 weights for the two local engines, no-think mode, each engine at its best config. That last clause matters and I'll get...

tilert layer model driver from runs

Related Articles