DeepSWEToggle theme
Today's leading public coding benchmarks are starting to saturate at the frontier: top models cluster within a narrow score band where adjacent configurations often overlap on confidence intervals. DeepSWE is a long-horizon software engineering benchmark built to separate them. It delivers four advances over existing public benchmarks:<br>Contamination free : Tasks are written from scratch, not adapted from existing commits or PRs, so no model has seen the solution during pretraining.<br>High diversity : Tasks span a broad pool of 91 repositories across 5 languages.<br>Real-world complexity : Prompts are ~half the length of SWE-bench Pro's, yet solutions require 5.5x more code and ~2x more output tokens.<br>Reliable verification : Verifiers are hand-written to test software behavior rather than implementation details.<br>The result is a benchmark that reflects how today's frontier coding agents actually perform in software engineering work.
Leaderboard<br>Models(12/16)<br>gpt-5.5[xhigh]<br>70%±4%
gpt-5.4[xhigh]<br>56%±5%
claude-opus-4.7[max]<br>54%±5%
claude-sonnet-4.6[high]<br>32%±4%
gemini-3.5-flash[medium]<br>28%±4%
gpt-5.4-mini[xhigh]<br>24%±4%
kimi-k2.6<br>24%±4%
mimo-v2.5-pro<br>19%±4%
glm-5.1<br>18%±4%
gemini-3.1-pro<br>10%±3%
deepseek-v4-pro<br>8%±2%
gemini-3-flash<br>5%±2%
0%20%40%60%80%
gpt-5.5[xhigh]
70%±4%
gpt-5.4[xhigh]
56%±5%
claude-opus-4.7[max]
54%±5%
claude-sonnet-4.6[high]
32%±4%
gemini-3.5-flash[medium]
28%±4%
gpt-5.4-mini[xhigh]
24%±4%
kimi-k2.6
24%±4%
mimo-v2.5-pro
19%±4%
glm-5.1
18%±4%
gemini-3.1-pro
10%±3%
deepseek-v4-pro
8%±2%
gemini-3-flash
5%±2%
0%25%50%75%100%
All models are run with mini-swe-agent; see Why mini-swe-agent for a comparison against other harnesses.
Task Examples<br>Abort pending body reads on shutdown<br>Ensure interrupted request and response body reads, formData parsing, and discarded timers abort cleanly during shutdown.<br>capricorn86/happy-domtypescript<br>Fix PromQL label sorting across typed and untyped values<br>PromQL label sorting must order mixed typed and untyped label values with stable typed comparison rules.<br>prometheus/prometheusgo<br>Add config file parsing to Cliffy commands<br>Add command-level config file loading, parsing, merging, and precedence handling.<br>c4spar/cliffytypescript<br>Add deterministic map conflict detection to Y.Map writes<br>Add strict, deterministic conflict detection for Y.Map key writes with collect and error policies.<br>yjs/yjsjavascript<br>Add trap coredump generation to wasmi<br>Generate opt-in Wasm coredumps on traps and attach the bytes to errors.<br>wasmi-labs/wasmirust<br>Add XML diff, patch, and merge operations to etree<br>Add recursive XML diffing, patch generation and application, reverse patching, three-way merge, and diff summaries.<br>beevik/etreego
All 113 tasks
Read the full blog<br>Open<br>01IntroductionWhy a new benchmark
02OverviewWhat separates DeepSWE
03MethodologyHow tasks and verifiers are built
04ResultsWhere frontier models diverge
05Qualitative analysisHow each frontier model fails
06Limitations & future workWhat we'd do differently