Opus 4.8 vs Opus 4.7 vs GPT-5.5 vs Composer 2.5 - 50 Real PRs in Go and Rust — Stet
Opus 4.8 vs Opus 4.7 vs GPT-5.5 vs Composer 2.5 - 50 Real PRs in Go and Rust<br>June 2, 2026
Opus 4.8 is finally out - how good is it actually?
In this benchmark I compared Opus 4.8 against the rest of the frontier (GPT-5.5, Opus 4.7, Composer 2.5) on 50 real tasks from two open-source repos - graphql-go-tools and sqlparser-rs, Go and Rust respectively - representing complex backend software engineering work across a variety of tasks.
The important part is that these repos are arbitrary. I could have tested the models on my own repo, with my own tasks, to see how the frontier performs on domain-specific work. The goal here is to explore, with some granularity, how a benchmark like this is built and what it can actually tell you about model behavior.
The result
The king is back. On this n=50 slice, Opus 4.8 is the craft leader in both Go and Rust , and it dominates the two premium-reasoning arms - GPT-5.5 high and Opus 4.7 xhigh - on the cost-quality plane : equal-or-better craft while running cheaper and leaner. Its only loss is raw price - Composer 2.5 is ~6.5× cheaper on Rust and ~7× on Go, but materially weaker on craft.
Against GPT-5.5 it's a clean win: better craft and leaner everywhere, cheaper on Rust and on par on Go. Against Opus 4.7 xhigh it matches or beats its own predecessor at a lower reasoning tier, plus a clean reliability win. Against Composer it's the quality win and the price loss.
The binary test gate is near-saturated and not the axis that separates these models (pooled 47/44/44/42 of 50 - the next section). The separation lives in the craft band above the gate.
How strong is each claim? The craft win over Composer is decision-grade in both repos; over GPT-5.5 it's decision-grade on Rust but only directional on Go; and the exact ordering among the "premium" models is directional (n=25, one grader pass). "Decision-grade" vs "directional" is defined two sections down.
opus 4.8 launch comparisoncanonical n=50 Go + Rust slice<br>Every frontier model clears the test gate, so tests can't separate them. The separation is in the craft band above the gate, where Opus 4.8 leads in both Go and Rust while running cheaper than both premium arms.<br>cost vs custom-score frontier<br>custom score on y, $/task (log) on x
$0.50$1$2$564687276cost / task (log scale)custom score (0-100)Opus 4.8GPT-5.5Composer 2.5Opus 4.7 xhighCustom score = 5% tests + 30% equivalence + 25% code review + 25% craft + 15% footprint, scaled 0-100. Opus 4.8 sits up-and-left of both premium arms - higher score at lower cost - while Composer anchors the cheap, lower-score corner. The composite is a directional read; the calibrated per-grader claims are in the calibration table below.
gate and craft headroom<br>exact per-repo values; tests are the floor
modeltestsRust craftRust CRRust equivGo craftGo CRGo equivreadOpus 4.847/503.283.3292%2.902.7640%craft leaderGPT-5.544/502.943.0388%2.722.5144%Go equiv edgeOpus 4.7 xhigh42/502.983.4472%2.632.2928%reliability gapComposer 2.544/502.843.2080%2.482.2528%price tradeoff
behavioral fingerprint<br>four shapes the gate collapses into one column
testsequivreviewcraftfootprintcost-effOpus 4.8<br>the disciplined frontier
testsequivreviewcraftfootprintcost-effGPT-5.5<br>the gate-passer
testsequivreviewcraftfootprintcost-effComposer 2.5<br>the sprinter
testsequivreviewcraftfootprintcost-effOpus 4.7 xhigh<br>the over-thinker
Axes are normalized 0-1 across these four arms (relative, not absolute) and repo-balanced, so a smallest lobe means worst-of-four, not zero. The gate sees one near-flat column; graded measurement sees four distinct shapes, which is what you actually choose on.
claim calibration<br>decision-grade, directional, or no survivor
pair<br>q=0.05<br>q=0.10<br>claim strength
Go vs Composer<br>10/11<br>10/11<br>uniform craft dominance
Rust vs Composer<br>7/11<br>8/11<br>robust after clean 2107
Rust vs GPT-5.5<br>4/11<br>6/11<br>decision-grade Rust craft edge
Go vs GPT-5.5<br>0/11<br>1/11<br>directional only
Rust vs Opus 4.7<br>0/11<br>0/11<br>even match
Go vs Opus 4.7<br>2/11<br>4/11<br>ahead of predecessor in Go
The grey rows are the honesty mechanism: they stay visible, but they do not get promoted into clean winner claims.
cost-ratio evidence<br>ratio = Opus 4.8 / baseline; lower is leaner
paircosttokensgradecaveatRust vs GPT-5.50.81x0.71xDG cost/tokensDG craft edgeGo vs GPT-5.50.83x0.60x0.83x Go GPT costnoise-bandGo vs Opus 4.70.66x0.50xDG all threeahead in GoRust vs Opus 4.71.23x1.07xwashdirectionalRust vs Composer6.47x1.63xComposer cheaperowner-attestedGo vs Composer7.15x1.40xComposer Go costrecovered
n=50, two repos, single seed, GPT-5.4 judge, per-repo replication.<br>Headline craft claims rest on BH-FDR calibration; the frontier custom score is a directional composite, not a calibrated ranking.<br>GPT-5.5 Go cost was re-priced from a cache artifact to 0.83x noise-band.<br>Composer Rust test validity is owner-attested via waiver, not hermetic; craft and cost axes are...