I benchmarked Opus 4.8 vs. GPT 5.5 on 2 open source repos

Opus 4.8 vs Opus 4.7 vs GPT-5.5 vs Composer 2.5 - 50 Real PRs in Go and Rust — Stet

Opus 4.8 vs Opus 4.7 vs GPT-5.5 vs Composer 2.5 - 50 Real PRs in Go and Rust June 2, 2026

Opus 4.8 is finally out - how good is it actually?

In this benchmark I compared Opus 4.8 against the rest of the frontier (GPT-5.5, Opus 4.7, Composer 2.5) on 50 real tasks from two open-source repos - graphql-go-tools and sqlparser-rs, Go and Rust respectively - representing complex backend software engineering work across a variety of tasks.

The important part is that these repos are arbitrary. I could have tested the models on my own repo, with my own tasks, to see how the frontier performs on domain-specific work. The goal here is to explore, with some granularity, how a benchmark like this is built and what it can actually tell you about model behavior.

The result

The king is back. On this n=50 slice, Opus 4.8 is the craft leader in both Go and Rust , and it dominates the two premium-reasoning arms - GPT-5.5 high and Opus 4.7 xhigh - on the cost-quality plane : equal-or-better craft while running cheaper and leaner. Its only loss is raw price - Composer 2.5 is ~6.5× cheaper on Rust and ~7× on Go, but materially weaker on craft.

Against GPT-5.5 it's a clean win: better craft and leaner everywhere, cheaper on Rust and on par on Go. Against Opus 4.7 xhigh it matches or beats its own predecessor at a lower reasoning tier, plus a clean reliability win. Against Composer it's the quality win and the price loss.

The binary test gate is near-saturated and not the axis that separates these models (pooled 47/44/44/42 of 50 - the next section). The separation lives in the craft band above the gate.

How strong is each claim? The craft win over Composer is decision-grade in both repos; over GPT-5.5 it's decision-grade on Rust but only directional on Go; and the exact ordering among the "premium" models is directional (n=25, one grader pass). "Decision-grade" vs "directional" is defined two sections down.

opus 4.8 launch comparisoncanonical n=50 Go + Rust slice Every frontier model clears the test gate, so tests can't separate them. The separation is in the craft band above the gate, where Opus 4.8 leads in both Go and Rust while running cheaper than both premium arms. cost vs custom-score frontier custom score on y, $/task (log) on x

$0.50$1$2$564687276cost / task (log scale)custom score (0-100)Opus 4.8GPT-5.5Composer 2.5Opus 4.7 xhighCustom score = 5% tests + 30% equivalence + 25% code review + 25% craft + 15% footprint, scaled 0-100. Opus 4.8 sits up-and-left of both premium arms - higher score at lower cost - while Composer anchors the cheap, lower-score corner. The composite is a directional read; the calibrated per-grader claims are in the calibration table below.

gate and craft headroom exact per-repo values; tests are the floor

modeltestsRust craftRust CRRust equivGo craftGo CRGo equivreadOpus 4.847/503.283.3292%2.902.7640%craft leaderGPT-5.544/502.943.0388%2.722.5144%Go equiv edgeOpus 4.7 xhigh42/502.983.4472%2.632.2928%reliability gapComposer 2.544/502.843.2080%2.482.2528%price tradeoff

behavioral fingerprint four shapes the gate collapses into one column

testsequivreviewcraftfootprintcost-effOpus 4.8 the disciplined frontier

testsequivreviewcraftfootprintcost-effGPT-5.5 the gate-passer

testsequivreviewcraftfootprintcost-effComposer 2.5 the sprinter

testsequivreviewcraftfootprintcost-effOpus 4.7 xhigh the over-thinker

Axes are normalized 0-1 across these four arms (relative, not absolute) and repo-balanced, so a smallest lobe means worst-of-four, not zero. The gate sees one near-flat column; graded measurement sees four distinct shapes, which is what you actually choose on.

claim calibration decision-grade, directional, or no survivor

pair q=0.05 q=0.10 claim strength

Go vs Composer 10/11 10/11 uniform craft dominance

Rust vs Composer 7/11 8/11 robust after clean 2107

Rust vs GPT-5.5 4/11 6/11 decision-grade Rust craft edge

Go vs GPT-5.5 0/11 1/11 directional only

Rust vs Opus 4.7 0/11 0/11 even match

Go vs Opus 4.7 2/11 4/11 ahead of predecessor in Go

The grey rows are the honesty mechanism: they stay visible, but they do not get promoted into clean winner claims.

cost-ratio evidence ratio = Opus 4.8 / baseline; lower is leaner

paircosttokensgradecaveatRust vs GPT-5.50.81x0.71xDG cost/tokensDG craft edgeGo vs GPT-5.50.83x0.60x0.83x Go GPT costnoise-bandGo vs Opus 4.70.66x0.50xDG all threeahead in GoRust vs Opus 4.71.23x1.07xwashdirectionalRust vs Composer6.47x1.63xComposer cheaperowner-attestedGo vs Composer7.15x1.40xComposer Go costrecovered

n=50, two repos, single seed, GPT-5.4 judge, per-repo replication. Headline craft claims rest on BH-FDR calibration; the frontier custom score is a directional composite, not a calibrated ranking. GPT-5.5 Go cost was re-priced from a cache artifact to 0.83x noise-band. Composer Rust test validity is owner-attested via waiver, not hermetic; craft and cost axes are...

I benchmarked Opus 4.8 vs. GPT 5.5 on 2 open source repos

Related Articles

The Newest Instagram "Exploit" Is the Goofiest I've Seen

It's Not Just X. It's Y

Amazon, Facebook, FBI have access to a private intelligence-sharing network

Show HN: GoPeek – open links in live mini browser windows without new tabs

Agent Memory: An Anatomy