DeepSWE v1.1 - A revision of DeepSWE v1Toggle theme
← All posts<br>DeepSWE v1.1 keeps the same long-horizon engineering tasks as v1, but updates how agents are executed and scored by grading their committed code in a clean, isolated environment, making results easier to reproduce, audit, and analyze. We also fixed dependency drift and removed flaky tests on some tasks.<br>With the updated setup, now including Claude Fable 5 and Kimi K2.7 Code, aggregate pass rates and model ordering remain close to v1.
v1.1v1
CostOutput tokensAgent steps
113 tasks · updated June 14, 2026Models(8/8)
DeepSWE score$0$5.00$10$15$200%10%20%30%40%50%60%70%80%Avg cost per taskmost efficient ↗claude-fable-5 [high]Defaultgpt-5.5 [medium]Defaultclaude-opus-4.8 [high]Defaultgpt-5.4 [xhigh]gemini-3.5-flash [medium]kimi-k2.7-codeclaude-sonnet-4.6 [high]gemini-3.1-pro [high]
v1.1v1
BestAll effort levels
Models(8/8)<br>Model
Pass@1<br>Avg cost<br>Out tok<br>Steps
claude-fable-5[max]<br>70%±4%
Avg cost $21.63Out tok 119kSteps 88<br>70%±4%<br>$21.63<br>119k<br>88
gpt-5.5[xhigh]<br>67%±6%
Avg cost $7.23Out tok 46kSteps 82<br>67%±6%<br>$7.23<br>46k<br>82
claude-opus-4.8[max]<br>59%±2%
Avg cost $13.22Out tok 135kSteps 120<br>59%±2%<br>$13.22<br>135k<br>120
gpt-5.4[xhigh]<br>52%±2%
Avg cost $5.65Out tok 71kSteps 70<br>52%±2%<br>$5.65<br>71k<br>70
gemini-3.5-flash[medium]<br>37%±2%
Avg cost $7.34Out tok 276kSteps 86<br>37%±2%<br>$7.34<br>276k<br>86
kimi-k2.7-code<br>31%±1%
Avg cost $2.82Out tok 59kSteps 149<br>31%±1%<br>$2.82<br>59k<br>149
claude-sonnet-4.6[high]<br>30%±4%
Avg cost $5.52Out tok 76kSteps 134<br>30%±4%<br>$5.52<br>76k<br>134
gemini-3.1-pro[high]<br>12%±2%
Avg cost $9.48Out tok 196kSteps 81<br>12%±2%<br>$9.48<br>196k<br>81
0%20%40%60%80%
Note: 73 of Claude Fable 5's 2,260 trials did not complete due to access being suspended by a US government directive partway through our sweep. Pass rates are computed over the completed trials.<br>Wall-clock time is no longer reported as it is highly dependent on external variables like host machine performance and provider load, making it an inconsistent metric.<br>Explore<br>View the benchmark on GitHub, browse every rollout behind the numbers above, or run your own agent against the benchmark.<br>Browse trajectoriesRun DeepSWEGitHub<br>What changed<br>Agent container<br>Checks out main (with future git history deleted), commits to feature branch.
committed diff only<br>Verifier container<br>Fresh container. Applies the diff, executes tests, produce CTRF report.
Isolated Verification: The agent commits its proposed changes, and we extract the git patch to evaluate in an isolated container, separate from where the agent worked. This follows the same approach as SWE-bench, and keeps grading independent of the agent's runtime environment for reproducible results.<br>Structured test reports: Tests now emit a CTRF report, recording each test that defines a task by name and status. This gives us a per-test view of what passed and failed, useful for analyzing results and spotting partial progress on a task.<br>Natural Git Environment: Rather than operating in detached HEAD mode, we set the main branch to the task's starting commit and ensure no future commits are visible. This enables agents to work more naturally from the main branch, formulate feature branches, and explicitly commit its changes as it would in normal development.<br>To elaborate on the git environment: one concern with our previous environment construction method was that if an implementation similar to a task had been merged upstream, the agent could potentially cheat by finding it through git log. We conducted a sweep of the tasks' upstream repos to check whether any had implementations similar to our tasks as of June 5th. We found no such instances, meaning results from v1.0 remain free of this form of cheating.<br>Together, these changes make tasks harder to game. Because we grade only the committed patch in a separate container, some easier shortcuts no longer work: an agent can't monkey-patch the test framework, and because the CTRF report records each task-defining test by name, dropping tests or forcing an early exit shows up as missing or failed results rather than a pass.<br>Impact on Results<br>The chart below compares each model's pass rate under v1 and v1.1. Scores stay close: the ordering at the top is unchanged, and most configurations land within a few points of their v1 result.
gpt-5.5[xhigh]
70%→67%-3.0%<br>gpt-5.5[high]
62%→64%+2.4%<br>claude-opus-4.8[max]
58%→59%+0.8%<br>claude-opus-4.8[xhigh]
58%→54%-3.4%<br>gpt-5.5[medium]
48%→54%+6.0%<br>claude-opus-4.8[high]
51%→52%+1.1%<br>gpt-5.4[xhigh]
56%→52%-3.8%<br>claude-opus-4.8[medium]
47%→49%+1.3%<br>gemini-3.5-flash[medium]
28%→37%+9.1%<br>claude-sonnet-4.6[high]
32%→30%-1.8%
0%20%40%60%80%
v1v1.1
The 10 configurations run in both versions. Hollow dot is v1, filled dot is v1.1; the change in points is at...