DeepSWE Benchmark updated with GLM 5.2 and updated results for other models

DeepSWEToggle theme

Leaderboard

v1.1v1

CostOutput tokensAgent steps

113 tasks · updated June 20, 2026Models(9/9)

DeepSWE score$0$5.00$10$15$200%10%20%30%40%50%60%70%80%Avg cost per taskmost efficient ↗claude-fable-5 [high]Defaultgpt-5.5 [medium]Defaultclaude-opus-4.8 [high]Defaultgpt-5.4 [xhigh]glm-5.2 [max]gemini-3.5-flash [medium]kimi-k2.7-codeclaude-sonnet-4.6 [high]gemini-3.1-pro [high]

v1.1v1

BestAll effort levels

Models(9/9) Model

Pass@1 Avg cost Out tok Steps

claude-fable-5[max] 70%±4%

Avg cost $21.63Out tok 119kSteps 88 70%±4% $21.63 119k 88

gpt-5.5[xhigh] 67%±6%

Avg cost $7.23Out tok 46kSteps 82 67%±6% $7.23 46k 82

claude-opus-4.8[max] 59%±2%

Avg cost $13.22Out tok 135kSteps 120 59%±2% $13.22 135k 120

gpt-5.4[xhigh] 52%±2%

Avg cost $5.65Out tok 71kSteps 70 52%±2% $5.65 71k 70

glm-5.2[max] 44%±2%

Avg cost $3.92Out tok 78kSteps 129 44%±2% $3.92 78k 129

gemini-3.5-flash[medium] 37%±2%

Avg cost $7.34Out tok 276kSteps 86 37%±2% $7.34 276k 86

kimi-k2.7-code 31%±1%

Avg cost $2.82Out tok 59kSteps 149 31%±1% $2.82 59k 149

claude-sonnet-4.6[high] 30%±4%

Avg cost $5.52Out tok 76kSteps 134 30%±4% $5.52 76k 134

gemini-3.1-pro[high] 12%±2%

Avg cost $9.48Out tok 196kSteps 81 12%±2% $9.48 196k 81

0%20%40%60%80%

All models run on mini-swe-agent for consistency. Read why.

Today's leading public coding benchmarks are starting to saturate at the frontier: top models cluster within a narrow score band where adjacent configurations often overlap on confidence intervals. DeepSWE is a long-horizon software engineering benchmark built to separate them. It delivers four advances over existing public benchmarks: Contamination free : Tasks are written from scratch, not adapted from existing commits or PRs, so no model has seen the solution during pretraining. High diversity : Tasks span a broad pool of 91 repositories across 5 languages. Real-world complexity : Prompts are ~half the length of SWE-bench Pro's, yet solutions require 5.5x more code and ~2x more output tokens. Reliable verification : Verifiers are hand-written to test software behavior rather than implementation details. The result is a benchmark that reflects how today's frontier coding agents actually perform in software engineering work.

Task Examples Abort pending body reads on shutdown Ensure interrupted request and response body reads, formData parsing, and discarded timers abort cleanly during shutdown. capricorn86/happy-domtypescript Fix PromQL label sorting across typed and untyped values PromQL label sorting must order mixed typed and untyped label values with stable typed comparison rules. prometheus/prometheusgo Add config file parsing to Cliffy commands Add command-level config file loading, parsing, merging, and precedence handling. c4spar/cliffytypescript Add deterministic map conflict detection to Y.Map writes Add strict, deterministic conflict detection for Y.Map key writes with collect and error policies. yjs/yjsjavascript Add trap coredump generation to wasmi Generate opt-in Wasm coredumps on traps and attach the bytes to errors. wasmi-labs/wasmirust Add XML diff, patch, and merge operations to etree Add recursive XML diffing, patch generation and application, reverse patching, three-way merge, and diff summaries. beevik/etreego

All 113 tasks

Read the full blog Open 01IntroductionWhy a new benchmark

02OverviewWhat separates DeepSWE

03MethodologyHow tasks and verifiers are built

04ResultsWhere frontier models diverge

05Qualitative analysisHow each frontier model fails

06Limitations & future workWhat we'd do differently

Notify me

DeepSWE Benchmark updated with GLM 5.2 and updated results for other models

Related Articles

Apple WWDC 2026 Livestream

Claude Fable 5

US Government directive to suspend access to Fable 5 and Mythos 5

Is AI ruining our skills? Early results are in – and they're not good

German ruling declares Google liable for false answers in AI Overviews