DeepSWE Benchmark

DeepSWEToggle theme

Today's leading public coding benchmarks are starting to saturate at the frontier: top models cluster within a narrow score band where adjacent configurations often overlap on confidence intervals. DeepSWE is a long-horizon software engineering benchmark built to separate them. It delivers four advances over existing public benchmarks: Contamination free : Tasks are written from scratch, not adapted from existing commits or PRs, so no model has seen the solution during pretraining. High diversity : Tasks span a broad pool of 91 repositories across 5 languages. Real-world complexity : Prompts are ~half the length of SWE-bench Pro's, yet solutions require 5.5x more code and ~2x more output tokens. Reliable verification : Verifiers are hand-written to test software behavior rather than implementation details. The result is a benchmark that reflects how today's frontier coding agents actually perform in software engineering work.

Leaderboard Models(12/16) gpt-5.5[xhigh] 70%±4%

gpt-5.4[xhigh] 56%±5%

claude-opus-4.7[max] 54%±5%

claude-sonnet-4.6[high] 32%±4%

gemini-3.5-flash[medium] 28%±4%

gpt-5.4-mini[xhigh] 24%±4%

kimi-k2.6 24%±4%

mimo-v2.5-pro 19%±4%

glm-5.1 18%±4%

gemini-3.1-pro 10%±3%

deepseek-v4-pro 8%±2%

gemini-3-flash 5%±2%

0%20%40%60%80%

gpt-5.5[xhigh]

70%±4%

gpt-5.4[xhigh]

56%±5%

claude-opus-4.7[max]

54%±5%

claude-sonnet-4.6[high]

32%±4%

gemini-3.5-flash[medium]

28%±4%

gpt-5.4-mini[xhigh]

24%±4%

kimi-k2.6

24%±4%

mimo-v2.5-pro

19%±4%

glm-5.1

18%±4%

gemini-3.1-pro

10%±3%

deepseek-v4-pro

8%±2%

gemini-3-flash

5%±2%

0%25%50%75%100%

All models are run with mini-swe-agent; see Why mini-swe-agent for a comparison against other harnesses.

Task Examples Abort pending body reads on shutdown Ensure interrupted request and response body reads, formData parsing, and discarded timers abort cleanly during shutdown. capricorn86/happy-domtypescript Fix PromQL label sorting across typed and untyped values PromQL label sorting must order mixed typed and untyped label values with stable typed comparison rules. prometheus/prometheusgo Add config file parsing to Cliffy commands Add command-level config file loading, parsing, merging, and precedence handling. c4spar/cliffytypescript Add deterministic map conflict detection to Y.Map writes Add strict, deterministic conflict detection for Y.Map key writes with collect and error policies. yjs/yjsjavascript Add trap coredump generation to wasmi Generate opt-in Wasm coredumps on traps and attach the bytes to errors. wasmi-labs/wasmirust Add XML diff, patch, and merge operations to etree Add recursive XML diffing, patch generation and application, reverse patching, three-way merge, and diff summaries. beevik/etreego

All 113 tasks

Read the full blog Open 01IntroductionWhy a new benchmark

02OverviewWhat separates DeepSWE

03MethodologyHow tasks and verifiers are built

04ResultsWhere frontier models diverge

05Qualitative analysisHow each frontier model fails

06Limitations & future workWhat we'd do differently

DeepSWE Benchmark

Related Articles

Amazon, Facebook, FBI have access to a private intelligence-sharing network

SpaceX not the behemoth everyone thought

The Mirror Is Part of the Machine

Elevated error rates on requests to multiple models

Donald Trump and sons to be 'forever' exempt from tax audits