GLM 5.2 playing text adventures

kqr1 pts0 comments

GLM 5.2 playing text adventures

I&rsquo;ve heard some buzz around the new glm 5.2 open-weights model. They say it&rsquo;s<br>very capable! I won&rsquo;t run a full comparison benchmark, but I have some credits<br>sloshing around on OpenRouter so I figured I might compare glm 5.2 to the<br>similarly-priced Gemini 3 Flash11 The market currently infers with the glm<br>5.2 model at $4.4 per million output tokens, whereas Google charges $3 per<br>million output tokens for their model. I expect the price of the glm model to<br>go down somewhat when people figure out how to deploy it more efficiently and/or<br>the buzz dies down. That&rsquo;s what happened with previous open-weight models I&rsquo;ve<br>tested., and see where things land.

This uses the same setup as the previous benchmark: each llm gets a few<br>attempts at playing the game, with each attempt being limited to a fixed budget<br>of around $0.15. The llm doesn&rsquo;t know it, but the harness tracks achievements<br>for each game, and counts how many the llm earns in each attempt.

Here are the number of attempts for each game in this run.

Game<br>Attempts per model

Lost Pig

Organ Grinder&rsquo;s Monkey

Not All That Shimmers

Kill Wizard

9:05

Total<br>17

💸<br>$5.1

Then I did the stupid, silly thing and fitted a plain linear regression<br>predicting the achievement count for each attempt, with the llm model as an<br>explainatory fixed effect, and the game as a random effect.22 Why didn&rsquo;t I use<br>random effects for game difficulty before? I should have! But I didn&rsquo;t know<br>about mixed-effects modeling then. I learn things. When thusly controlling for<br>game difficulty, Gemini 3 Flash earns just over eight achievements in a typical<br>attempt. The new glm 5.2 earns 15 % fewer, and this is statistically<br>significant at customary significance levels.

This does not tell us much – is 15 % fewer achievements very bad or reasonable?<br>Hard to tell without comparing to other models, but it&rsquo;s roughly the same<br>magnitude as the standard deviation of the resitual noise in the fitted model.<br>Thus we can say it&rsquo;s about 0.8 levels of noise worse from the king of text<br>adventure playing llms. That&rsquo;s impressive. For example, it is definitely<br>better than Gemini 2.5 Flash, which is 1.6 noise levels worse than Gemini 3<br>Flash.

(Due to the budget constraint, models like Sonnet 4.5 or gpt 5.2 are 2.5×<br>noise and 3× worse than the noise level.)

rsquo model game noise playing gemini

Related Articles