GLM 5.2 playing text adventures
I’ve heard some buzz around the new glm 5.2 open-weights model. They say it’s<br>very capable! I won’t run a full comparison benchmark, but I have some credits<br>sloshing around on OpenRouter so I figured I might compare glm 5.2 to the<br>similarly-priced Gemini 3 Flash11 The market currently infers with the glm<br>5.2 model at $4.4 per million output tokens, whereas Google charges $3 per<br>million output tokens for their model. I expect the price of the glm model to<br>go down somewhat when people figure out how to deploy it more efficiently and/or<br>the buzz dies down. That’s what happened with previous open-weight models I’ve<br>tested., and see where things land.
This uses the same setup as the previous benchmark: each llm gets a few<br>attempts at playing the game, with each attempt being limited to a fixed budget<br>of around $0.15. The llm doesn’t know it, but the harness tracks achievements<br>for each game, and counts how many the llm earns in each attempt.
Here are the number of attempts for each game in this run.
Game<br>Attempts per model
Lost Pig
Organ Grinder’s Monkey
Not All That Shimmers
Kill Wizard
9:05
Total<br>17
💸<br>$5.1
Then I did the stupid, silly thing and fitted a plain linear regression<br>predicting the achievement count for each attempt, with the llm model as an<br>explainatory fixed effect, and the game as a random effect.22 Why didn’t I use<br>random effects for game difficulty before? I should have! But I didn’t know<br>about mixed-effects modeling then. I learn things. When thusly controlling for<br>game difficulty, Gemini 3 Flash earns just over eight achievements in a typical<br>attempt. The new glm 5.2 earns 15 % fewer, and this is statistically<br>significant at customary significance levels.
This does not tell us much – is 15 % fewer achievements very bad or reasonable?<br>Hard to tell without comparing to other models, but it’s roughly the same<br>magnitude as the standard deviation of the resitual noise in the fitted model.<br>Thus we can say it’s about 0.8 levels of noise worse from the king of text<br>adventure playing llms. That’s impressive. For example, it is definitely<br>better than Gemini 2.5 Flash, which is 1.6 noise levels worse than Gemini 3<br>Flash.
(Due to the budget constraint, models like Sonnet 4.5 or gpt 5.2 are 2.5×<br>noise and 3× worse than the noise level.)