GPT-5.5 High Regression Check on GraphQL-go-tools — Stet
GPT-5.5 High Regression Check on GraphQL-go-tools<br>May 18, 2026 · Updated May 19, 2026
TL;DR
There were reports of a GPT-5.5 regression, so I compared GPT-5.5 high against a prior run from May 5 on 21 tasks from the GraphQL-go-tools open-source repo.
On that slice, GPT-5.5 high moved from:
19/21 resolved to 18/21, a -1 test/gate movement.
14/21 equivalent to 12/21, a -2 equivalence movement.
8/21 code-review pass to 7/21, a -1 review-pass movement.
That is a real directional regression on tests, equivalence, and review pass count, but it is not a broad quality collapse. Code-review rubric mean improved, most craft/discipline averages improved, and cost per task was roughly flat. Footprint risk was roughly flat, with a small mean increase.
The strongest negative signal is qualitative: GPT-5.5 high still writes plausible, test-passing patches, but it shows recurring weakness on deep invariants where tests do not fully encode lifecycle, concurrency, or GraphQL validity requirements.
Scorecard
For this post, "equivalent" means the patch matched the intent of the merged human PR; "code-review pass" means an AI reviewer judged it acceptable; craft/discipline is a 0-4 maintainability/style rubric; footprint risk is how much extra code the agent touched relative to the human patch.
Metric5/5 GPT-5.5 high5/18 GPT-5.5 highDeltaClean comparable tasks2121flatTests / Harbor resolved19/2118/21-1Equivalent to human PR14/2112/21-2Code-review pass8/217/21-1Code-review fail11/2112/21+1Code-review unsure2/212/21flatCode-review rubric mean2.7042.857+0.154Footprint risk, lower better0.3190.328+0.009Craft avg2.7452.814+0.069Discipline avg2.6512.818+0.167All custom graders avg2.7092.815+0.105Cost per task$6.45$6.54+$0.09Mean agent duration558s611s+53sInput tokens126.9M153.2M+26.4MOutput tokens328.3K360.4K+32.1KCached input tokens113.1M142.2M+29.1MUncached input tokens13.9M11.1M-2.8M
The outcome metrics are worse by one resolved row, two equivalent rows, and one review-pass row. The equivalence movement is a net number: four prior equivalent rows became non-equivalent, while two prior non-equivalent rows became equivalent. Rubric means improve; cost is slightly higher.
That mix is why I'd describe this as a targeted reliability concern and not a blanket model-quality regression. I'll look at specific examples below.
regression scorecardn = 21 clean tasks - judge: gpt-5.4<br>Fresh GPT-5.5 high lost one resolved task, two equivalent patches, and one review-pass row. Rubric means still moved up, so the signal reads as targeted reliability risk rather than a broad collapse.
May 5 highprior GPT-5.5 high
May 18 highfresh GPT-5.5 high
outcomes21 clean comparable tasks<br>tests / Harbor resolved-1
19/21
18/21
equivalent to human PR-2
14/21
12/21
code-review pass-1
8/21
7/21
code-review fail+1
11/21
12/21
quality0-4 unless noted<br>code-review rubric mean+0.154
2.704
2.857
all custom graders avg+0.105
2.709
2.815
scope discipline+0.366
2.610
2.976
instruction adherence-0.057
2.728
2.671
cost and effortlower is better<br>cost per task+$0.09
$6.45
$6.54
mean agent duration+53s
558s
611s
mean tool calls+8.5
78.4
87.0
Grader Detail
GraderPrior GPT-5.5 highNew GPT-5.5 highDeltaSimplicity2.8522.823-0.029Coherence2.5432.581+0.039Intentionality3.0813.153+0.072Robustness2.4472.557+0.110Clarity2.8002.800flatInstruction adherence2.7282.671-0.057Scope discipline2.6102.976+0.366Diff minimality2.6152.700+0.085
Most grader means move up, especially scope discipline and robustness. Simplicity and instruction adherence move down - the new run often looked cleaner and more disciplined, but it sometimes missed specific obligations or semantic boundaries.
Most small deltas are within expected variance. I wouldn't read too much into them individually.
Tool Behavior
The new run used more agent interaction overall: more total tool calls, shell calls, and patch calls, but fewer detected test-command invocations.
MetricPrior GPT-5.5 highNew GPT-5.5 highDeltaCaptured task traces21/2121/21flatMean agent steps100.1107.2+7.1Mean tool calls78.487.0+8.5Mean shell calls65.474.1+8.7Mean patch calls13.113.8+0.6Mean test-command calls9.79.0-0.6Max tool calls on one task125197+72Max patch calls on one task3350+17Total tool calls1,6471,826+179Total shell calls1,3741,557+183Total patch calls276289+13Total test-command calls203190-13
More shell and tool activity overall, especially on a few high-effort rows, and slightly more patch calls after removing the contaminated row. That lines up with the broader read: the new run often looked more disciplined, while still missing some specific semantic obligations the old run caught.
Case Studies
Real Negative Semantic Flip: pr-1076
This is the clearest regression row. The task was to rework GraphQL subscription concurrency so response writes are serialized per subscription, heartbeat writes move out of shared global state,...