A brief investigation into the GPT-5.5 regression claims

bisonbear1 pts0 comments

GPT-5.5 High Regression Check on GraphQL-go-tools — Stet

GPT-5.5 High Regression Check on GraphQL-go-tools<br>May 18, 2026 · Updated May 19, 2026

TL;DR

There were reports of a GPT-5.5 regression, so I compared GPT-5.5 high against a prior run from May 5 on 21 tasks from the GraphQL-go-tools open-source repo.

On that slice, GPT-5.5 high moved from:

19/21 resolved to 18/21, a -1 test/gate movement.

14/21 equivalent to 12/21, a -2 equivalence movement.

8/21 code-review pass to 7/21, a -1 review-pass movement.

That is a real directional regression on tests, equivalence, and review pass count, but it is not a broad quality collapse. Code-review rubric mean improved, most craft/discipline averages improved, and cost per task was roughly flat. Footprint risk was roughly flat, with a small mean increase.

The strongest negative signal is qualitative: GPT-5.5 high still writes plausible, test-passing patches, but it shows recurring weakness on deep invariants where tests do not fully encode lifecycle, concurrency, or GraphQL validity requirements.

Scorecard

For this post, "equivalent" means the patch matched the intent of the merged human PR; "code-review pass" means an AI reviewer judged it acceptable; craft/discipline is a 0-4 maintainability/style rubric; footprint risk is how much extra code the agent touched relative to the human patch.

Metric5/5 GPT-5.5 high5/18 GPT-5.5 highDeltaClean comparable tasks2121flatTests / Harbor resolved19/2118/21-1Equivalent to human PR14/2112/21-2Code-review pass8/217/21-1Code-review fail11/2112/21+1Code-review unsure2/212/21flatCode-review rubric mean2.7042.857+0.154Footprint risk, lower better0.3190.328+0.009Craft avg2.7452.814+0.069Discipline avg2.6512.818+0.167All custom graders avg2.7092.815+0.105Cost per task$6.45$6.54+$0.09Mean agent duration558s611s+53sInput tokens126.9M153.2M+26.4MOutput tokens328.3K360.4K+32.1KCached input tokens113.1M142.2M+29.1MUncached input tokens13.9M11.1M-2.8M

The outcome metrics are worse by one resolved row, two equivalent rows, and one review-pass row. The equivalence movement is a net number: four prior equivalent rows became non-equivalent, while two prior non-equivalent rows became equivalent. Rubric means improve; cost is slightly higher.

That mix is why I'd describe this as a targeted reliability concern and not a blanket model-quality regression. I'll look at specific examples below.

regression scorecardn = 21 clean tasks - judge: gpt-5.4<br>Fresh GPT-5.5 high lost one resolved task, two equivalent patches, and one review-pass row. Rubric means still moved up, so the signal reads as targeted reliability risk rather than a broad collapse.

May 5 highprior GPT-5.5 high

May 18 highfresh GPT-5.5 high

outcomes21 clean comparable tasks<br>tests / Harbor resolved-1

19/21

18/21

equivalent to human PR-2

14/21

12/21

code-review pass-1

8/21

7/21

code-review fail+1

11/21

12/21

quality0-4 unless noted<br>code-review rubric mean+0.154

2.704

2.857

all custom graders avg+0.105

2.709

2.815

scope discipline+0.366

2.610

2.976

instruction adherence-0.057

2.728

2.671

cost and effortlower is better<br>cost per task+$0.09

$6.45

$6.54

mean agent duration+53s

558s

611s

mean tool calls+8.5

78.4

87.0

Grader Detail

GraderPrior GPT-5.5 highNew GPT-5.5 highDeltaSimplicity2.8522.823-0.029Coherence2.5432.581+0.039Intentionality3.0813.153+0.072Robustness2.4472.557+0.110Clarity2.8002.800flatInstruction adherence2.7282.671-0.057Scope discipline2.6102.976+0.366Diff minimality2.6152.700+0.085

Most grader means move up, especially scope discipline and robustness. Simplicity and instruction adherence move down - the new run often looked cleaner and more disciplined, but it sometimes missed specific obligations or semantic boundaries.

Most small deltas are within expected variance. I wouldn't read too much into them individually.

Tool Behavior

The new run used more agent interaction overall: more total tool calls, shell calls, and patch calls, but fewer detected test-command invocations.

MetricPrior GPT-5.5 highNew GPT-5.5 highDeltaCaptured task traces21/2121/21flatMean agent steps100.1107.2+7.1Mean tool calls78.487.0+8.5Mean shell calls65.474.1+8.7Mean patch calls13.113.8+0.6Mean test-command calls9.79.0-0.6Max tool calls on one task125197+72Max patch calls on one task3350+17Total tool calls1,6471,826+179Total shell calls1,3741,557+183Total patch calls276289+13Total test-command calls203190-13

More shell and tool activity overall, especially on a few high-effort rows, and slightly more patch calls after removing the contaminated row. That lines up with the broader read: the new run often looked more disciplined, while still missing some specific semantic obligations the old run caught.

Case Studies

Real Negative Semantic Flip: pr-1076

This is the clearest regression row. The task was to rework GraphQL subscription concurrency so response writes are serialized per subscription, heartbeat writes move out of shared global state,...

review high equivalent regression code pass

Related Articles