I evaluated GLM 5.2 against the frontier on tasks from real repos

GLM 5.2 on 50 real Go and Rust PRs: last on quality, and not the cheapest — Stet

GLM 5.2 on 50 real Go and Rust PRs: last on quality, and not the cheapest June 19, 2026 GLM 5.2 has been getting a lot of hype for being a "frontier killer". We can evaluate it 2 ways: how cheap is it, and is good enough for most work?

On tasks from these 2 open source repos, it is neither. It costs roughly twice Composer 2.5 for lower quality on both dimensions that matter.

Route GLM to supervised first-draft work with a smarter model supervising. Don't run it unattended against production-grade repositories.

The setup: fifty real merged pull requests across two repos — graphql-go-tools in Go and sqlparser-rs in Rust — replayed against frozen snapshots so nothing leaks, one attempt per task. Every patch is graded beyond pass/fail: a craft score (0–4, code quality and idiom) and equivalence (0–1, how closely the patch reproduces the merged human PR's actual behavior). Stet, the local eval harness I build, runs and grades them; a blinded gpt-5.4 judge scores independently of the runner. GLM ran at medium reasoning. Does GLM belong at this table, and if not, where does it actually belong?

n=50 slice: graphql-go-tools (Go, n=25) plus sqlparser-rs (Rust, n=25), blinded GPT-5.4 judge. GLM 5.2 (run at medium reasoning) lands last in the field on craft and equivalence in both repos — cheaper than the premium arms, but pricier than Composer and last on quality. Below: where it stands, how it behaves, and what it costs.

calibrated standing · GLM 5.2 medium Last on craft and equivalence, in both repos craft = 8-grader mean (0–4); equivalence = how closely the patch reproduces the merged human PR's behavior (0–1). GLM is decision-grade behind the whole premium field on both, in both repos — a gap big enough to survive the statistics at this sample size. Against the budget arm Composer 2.5 it is a noise-band peer (too close to call) on Go and decision-grade behind on Rust equivalence — while costing about twice as much on Rust.

graphql-go-tools (Go), n=25 Armcraftequiv$/taskOpus 4.8 high2.900.73$3.98GPT-5.5 high2.720.73$4.69Opus 4.7 xhigh2.630.68$5.93Composer 2.52.480.60$0.71GLM 5.2 medium2.380.47$1.40 sqlparser-rs (Rust), n=25 Armcraftequiv$/taskOpus 4.8 high3.280.98$3.02Opus 4.7 xhigh2.980.97$3.55GPT-5.5 high2.940.96$3.41Composer 2.52.840.95$0.53GLM 5.2 medium2.690.78$1.04

GLM ran at medium reasoning. Composer's Go cost is recovered from raw Cursor logs (directional); all other figures are from the calibrated per-repo panels.

how GLM works Normal-sized patches, last on quality Median agent patch by model (additions right, deletions left). Every arm writes more than the human PR here (Go +111 / −47, Rust +110 / −17), and GLM sits mid-field — it adds the most on Rust, is unremarkable on Go, and deletes little, like the Opus arms; GPT-5.5 and Composer churn more. So GLM's gap isn't a patch-size problem — it writes a normal-looking diff and still lands last on equivalence and craft.

graphql-go-tools (Go) ← deleted · added →

GLM 5.2

+222 / −16 Composer 2.5

+282 / −42 GPT-5.5

+324 / −35 Opus 4.8

+228 / −14 Opus 4.7

+213 / −15

sqlparser-rs (Rust) ← deleted · added →

GLM 5.2

+284 / −12 Composer 2.5

+235 / −35 GPT-5.5

+223 / −19 Opus 4.8

+175 / −18 Opus 4.7

+143 / −10

tokens, turns & patches by model Modelinput/taskoutput/taskturnspatchesGLM 5.23.2M22k12250/50Composer 2.52.1M16k—50/50GPT-5.54.8M15k9450/50Opus 4.83.0M32k11350/50Opus 4.75.6M31k10046/50 Medians per task; input/output tokens and patch counts pooled across both repos. Input is context (including cache reads); output is generated tokens. Turns are the sqlparser-rs median, where capture is complete across the claude-code and codex arms — GLM runs the most (122 vs Opus 4.8's 113, GPT-5.5's 94); Composer ran on Cursor, which batches its work under a few assistant turns, so its turn count isn't comparable. Patches = tasks with a non-empty diff (Opus 4.7's four misses are routing no-patches). GLM also grinds the longest by wall-clock — a ~16-minute median, the slowest arm; its worst Go task burned 14.1M tokens and $4.07 in a single 326-turn loop, almost all of it re-reading the same files.

glm 5.2 comparison cost vs local score quality metricLocal scoreCraftCode reviewEquivalenceTestsvvsspend metricCostTimeTokensv Compare weighted Stet quality against rollout spend. Use the selectors to choose the two metrics that matter for the decision.

Scrollable metric scatter chartcost vs local scorerepo-balanced local score versus cost per task. Each colored point represents one model on the selected metric pair.$0.0081.4$1.0277.8$2.0574.2$3.0770.6$4.1067.0$5.1263.4GLM 5.2 mediumComposer 2.5GPT-5.5 highOpus 4.8 highOpus 4.7 xhighcost per taskrepo-balanced local score

modelLocal scoreCost GLM 5.2 medium64.6$1.22Composer 2.571.6$0.62GPT-5.5 high75.6$4.05Opus 4.8 high80.2$3.50Opus 4.7 xhigh75.9$4.74

GLM 5.2 medium: repo-balanced local score 64.6, cost per task $1.22 Composer...

I evaluated GLM 5.2 against the frontier on tasks from real repos

Related Articles

Apple WWDC 2026 Livestream

Claude Fable 5

US Government directive to suspend access to Fable 5 and Mythos 5

Is AI ruining our skills? Early results are in – and they're not good

German ruling declares Google liable for false answers in AI Overviews