DeepSWEToggle theme
Leaderboard
v1.1v1
CostOutput tokensAgent steps
113 tasks · updated June 20, 2026Models(9/9)
DeepSWE score$0$5.00$10$15$200%10%20%30%40%50%60%70%80%Avg cost per taskmost efficient ↗claude-fable-5 [high]Defaultgpt-5.5 [medium]Defaultclaude-opus-4.8 [high]Defaultgpt-5.4 [xhigh]glm-5.2 [max]gemini-3.5-flash [medium]kimi-k2.7-codeclaude-sonnet-4.6 [high]gemini-3.1-pro [high]
v1.1v1
BestAll effort levels
Models(9/9)<br>Model
Pass@1<br>Avg cost<br>Out tok<br>Steps
claude-fable-5[max]<br>70%±4%
Avg cost $21.63Out tok 119kSteps 88<br>70%±4%<br>$21.63<br>119k<br>88
gpt-5.5[xhigh]<br>67%±6%
Avg cost $7.23Out tok 46kSteps 82<br>67%±6%<br>$7.23<br>46k<br>82
claude-opus-4.8[max]<br>59%±2%
Avg cost $13.22Out tok 135kSteps 120<br>59%±2%<br>$13.22<br>135k<br>120
gpt-5.4[xhigh]<br>52%±2%
Avg cost $5.65Out tok 71kSteps 70<br>52%±2%<br>$5.65<br>71k<br>70
glm-5.2[max]<br>44%±2%
Avg cost $3.92Out tok 78kSteps 129<br>44%±2%<br>$3.92<br>78k<br>129
gemini-3.5-flash[medium]<br>37%±2%
Avg cost $7.34Out tok 276kSteps 86<br>37%±2%<br>$7.34<br>276k<br>86
kimi-k2.7-code<br>31%±1%
Avg cost $2.82Out tok 59kSteps 149<br>31%±1%<br>$2.82<br>59k<br>149
claude-sonnet-4.6[high]<br>30%±4%
Avg cost $5.52Out tok 76kSteps 134<br>30%±4%<br>$5.52<br>76k<br>134
gemini-3.1-pro[high]<br>12%±2%
Avg cost $9.48Out tok 196kSteps 81<br>12%±2%<br>$9.48<br>196k<br>81
0%20%40%60%80%
All models run on mini-swe-agent for consistency. Read why.
Today's leading public coding benchmarks are starting to saturate at the frontier: top models cluster within a narrow score band where adjacent configurations often overlap on confidence intervals. DeepSWE is a long-horizon software engineering benchmark built to separate them. It delivers four advances over existing public benchmarks:<br>Contamination free : Tasks are written from scratch, not adapted from existing commits or PRs, so no model has seen the solution during pretraining.<br>High diversity : Tasks span a broad pool of 91 repositories across 5 languages.<br>Real-world complexity : Prompts are ~half the length of SWE-bench Pro's, yet solutions require 5.5x more code and ~2x more output tokens.<br>Reliable verification : Verifiers are hand-written to test software behavior rather than implementation details.<br>The result is a benchmark that reflects how today's frontier coding agents actually perform in software engineering work.
Task Examples<br>Abort pending body reads on shutdown<br>Ensure interrupted request and response body reads, formData parsing, and discarded timers abort cleanly during shutdown.<br>capricorn86/happy-domtypescript<br>Fix PromQL label sorting across typed and untyped values<br>PromQL label sorting must order mixed typed and untyped label values with stable typed comparison rules.<br>prometheus/prometheusgo<br>Add config file parsing to Cliffy commands<br>Add command-level config file loading, parsing, merging, and precedence handling.<br>c4spar/cliffytypescript<br>Add deterministic map conflict detection to Y.Map writes<br>Add strict, deterministic conflict detection for Y.Map key writes with collect and error policies.<br>yjs/yjsjavascript<br>Add trap coredump generation to wasmi<br>Generate opt-in Wasm coredumps on traps and attach the bytes to errors.<br>wasmi-labs/wasmirust<br>Add XML diff, patch, and merge operations to etree<br>Add recursive XML diffing, patch generation and application, reverse patching, three-way merge, and diff summaries.<br>beevik/etreego
All 113 tasks
Read the full blog<br>Open<br>01IntroductionWhy a new benchmark
02OverviewWhat separates DeepSWE
03MethodologyHow tasks and verifiers are built
04ResultsWhere frontier models diverge
05Qualitative analysisHow each frontier model fails
06Limitations & future workWhat we'd do differently
Sign up for leaderboard updates<br>New frontier models are added to the DeepSWE leaderboard as they're released.
Notify me