GLM 5.2 on 50 real Go and Rust PRs: last on quality, and not the cheapest — Stet
GLM 5.2 on 50 real Go and Rust PRs: last on quality, and not the cheapest<br>June 19, 2026<br>GLM 5.2 has been getting a lot of hype for being a "frontier killer". We can evaluate it 2 ways: how cheap is it, and is good enough for most work?
On tasks from these 2 open source repos, it is neither. It costs roughly twice Composer 2.5 for lower quality on both dimensions that matter.
Route GLM to supervised first-draft work with a smarter model supervising. Don't run it unattended against production-grade repositories.
The setup: fifty real merged pull requests across two repos — graphql-go-tools in Go and sqlparser-rs in Rust — replayed against frozen snapshots so nothing leaks, one attempt per task. Every patch is graded beyond pass/fail: a craft score (0–4, code quality and idiom) and equivalence (0–1, how closely the patch reproduces the merged human PR's actual behavior). Stet, the local eval harness I build, runs and grades them; a blinded gpt-5.4 judge scores independently of the runner. GLM ran at medium reasoning. Does GLM belong at this table, and if not, where does it actually belong?
n=50 slice: graphql-go-tools (Go, n=25) plus sqlparser-rs (Rust, n=25), blinded GPT-5.4 judge. GLM 5.2 (run at medium reasoning) lands last in the field on craft and equivalence in both repos — cheaper than the premium arms, but pricier than Composer and last on quality. Below: where it stands, how it behaves, and what it costs.
calibrated standing · GLM 5.2 medium<br>Last on craft and equivalence, in both repos<br>craft = 8-grader mean (0–4); equivalence = how closely the patch reproduces the merged human PR's behavior (0–1). GLM is decision-grade behind the whole premium field on both, in both repos — a gap big enough to survive the statistics at this sample size. Against the budget arm Composer 2.5 it is a noise-band peer (too close to call) on Go and decision-grade behind on Rust equivalence — while costing about twice as much on Rust.
graphql-go-tools (Go), n=25<br>Armcraftequiv$/taskOpus 4.8 high2.900.73$3.98GPT-5.5 high2.720.73$4.69Opus 4.7 xhigh2.630.68$5.93Composer 2.52.480.60$0.71GLM 5.2 medium2.380.47$1.40<br>sqlparser-rs (Rust), n=25<br>Armcraftequiv$/taskOpus 4.8 high3.280.98$3.02Opus 4.7 xhigh2.980.97$3.55GPT-5.5 high2.940.96$3.41Composer 2.52.840.95$0.53GLM 5.2 medium2.690.78$1.04
GLM ran at medium reasoning. Composer's Go cost is recovered from raw Cursor logs (directional); all other figures are from the calibrated per-repo panels.
how GLM works<br>Normal-sized patches, last on quality<br>Median agent patch by model (additions right, deletions left). Every arm writes more than the human PR here (Go +111 / −47, Rust +110 / −17), and GLM sits mid-field — it adds the most on Rust, is unremarkable on Go, and deletes little, like the Opus arms; GPT-5.5 and Composer churn more. So GLM's gap isn't a patch-size problem — it writes a normal-looking diff and still lands last on equivalence and craft.
graphql-go-tools (Go)<br>← deleted · added →
GLM 5.2
+222 / −16<br>Composer 2.5
+282 / −42<br>GPT-5.5
+324 / −35<br>Opus 4.8
+228 / −14<br>Opus 4.7
+213 / −15
sqlparser-rs (Rust)<br>← deleted · added →
GLM 5.2
+284 / −12<br>Composer 2.5
+235 / −35<br>GPT-5.5
+223 / −19<br>Opus 4.8
+175 / −18<br>Opus 4.7
+143 / −10
tokens, turns & patches by model<br>Modelinput/taskoutput/taskturnspatchesGLM 5.23.2M22k12250/50Composer 2.52.1M16k—50/50GPT-5.54.8M15k9450/50Opus 4.83.0M32k11350/50Opus 4.75.6M31k10046/50<br>Medians per task; input/output tokens and patch counts pooled across both repos. Input is context (including cache reads); output is generated tokens. Turns are the sqlparser-rs median, where capture is complete across the claude-code and codex arms — GLM runs the most (122 vs Opus 4.8's 113, GPT-5.5's 94); Composer ran on Cursor, which batches its work under a few assistant turns, so its turn count isn't comparable. Patches = tasks with a non-empty diff (Opus 4.7's four misses are routing no-patches). GLM also grinds the longest by wall-clock — a ~16-minute median, the slowest arm; its worst Go task burned 14.1M tokens and $4.07 in a single 326-turn loop, almost all of it re-reading the same files.
glm 5.2 comparison<br>cost vs local score<br>quality metricLocal scoreCraftCode reviewEquivalenceTestsvvsspend metricCostTimeTokensv<br>Compare weighted Stet quality against rollout spend. Use the selectors to choose the two metrics that matter for the decision.
Scrollable metric scatter chartcost vs local scorerepo-balanced local score versus cost per task. Each colored point represents one model on the selected metric pair.$0.0081.4$1.0277.8$2.0574.2$3.0770.6$4.1067.0$5.1263.4GLM 5.2 mediumComposer 2.5GPT-5.5 highOpus 4.8 highOpus 4.7 xhighcost per taskrepo-balanced local score
modelLocal scoreCost<br>GLM 5.2 medium64.6$1.22Composer 2.571.6$0.62GPT-5.5 high75.6$4.05Opus 4.8 high80.2$3.50Opus 4.7 xhigh75.9$4.74
GLM 5.2 medium: repo-balanced local score 64.6, cost per task $1.22<br>Composer...