Auditing DeepSWE

kimjune011 pts0 comments

Auditing DeepSWE | june.kim<br>Home Auditing DeepSWE

Last updated 2026-05-31.

A benchmark asks us to trust three things: that the tasks are real, that the grader is fair,<br>and that the answer key works. This audit asks whether those claims hold.

DeepSWE arrived on May 26, 2026 as a<br>contamination-free coding benchmark: 113 tasks drawn from active repositories,<br>each graded by its own verifier. The most basic check is to take the answer key<br>and ask whether it passes its own test.

The check

Every DeepSWE task ships a reference solution and a verifier. The verifier applies<br>a hidden test patch and runs the suite; a passing solution scores 1. The check is<br>simple. Apply the reference solution, run the verifier, confirm it scores 1. If<br>a task’s own gold cannot pass the test built for it, that task cannot be trusted<br>until the contradiction is resolved. No model is involved, so the token cost is zero.<br>One spot machine, ten tasks at a time, the whole set in under an hour, under a dollar.

This is the check you would run before shipping. The authors publish the harness<br>and the data, so anyone downstream can run it. I ran it.

First, I audited myself

The first run failed all 113 tasks. Zero scores across the board.

The tempting headline writes itself: contamination-free benchmark ships 113<br>broken tasks. It would have been wrong. Uniform failure across every task is<br>never 113 independent defects. It is one fault in the thing they share: my<br>harness. The grading framework brings up its sandbox with docker compose, and<br>the bare Amazon Linux image I provisioned ships<br>Docker without the Compose plugin. Every task errored before reaching the verifier.

I installed the plugin, added an assertion so the gap would fail loudly next time,<br>and re-ran. I suspected my own setup before I suspected theirs. That is the only<br>reason the rest of this is worth reporting.

What the corrected run found

With the harness corrected, the goldens started passing one after another, exactly<br>as they should. Then a few did not.

Four of the 113 reference solutions fail their own verifiers:<br>langchain-request-coalescing, narwhals-rolling-window-suite,<br>prometheus-transactional-reload-status, and skrub-duration-encoding. No agent<br>attempted them. The answer key itself, applied verbatim, does not pass. Each was<br>re-run alone to rule out contention; a task is flagged only when its gold fails<br>in isolation.

Three things hold, and no more. Under the benchmark’s published harness at the<br>pinned commit, the reference solution failed its verifier. The failure reproduced<br>in isolated reruns. The cause is unresolved. It could be a broken task, a flaky<br>test, or environment drift since the image was built. Sorting those apart is the<br>maintainer’s job, not the auditor’s. The bounded claim stands: a task in this<br>state cannot be trusted until the diagnosis happens.

Where these failures live

A score implies a clean function: model and task in, pass or fail out. Between them<br>sits an authored apparatus: task selection, the instruction, the held-out test,<br>the reference solution, the acceptance criteria, the environment. A gold that<br>fails its own verifier puts two of those pieces in contradiction. An answer key<br>asserted correct but never run against its own test is, precisely, a confabulation:<br>a plausible artifact nobody checked. Which piece confabulated, and whether by<br>authoring or by drift, is for the maintainer to determine.

The audit at a glance

ClaimObservationAnalysisRecommendation<br>Tasks are original, the benchmark is contamination-free One spot-check holds: matchEach absent from ts-pattern code, PRs, issues; solutions held out. 1 of 113 checked.Verifiable in principle; cleaner substrate than the contaminated SWE-bench Verified. Consistent with the claim, not established across the set.Publish the per-task originality check across all 113.<br>All 113 tasks are gradeable by their own verifiers 4 golds fail their own verifier (langchain-request-coalescing, narwhals-rolling-window-suite, prometheus-transactional-reload-status, skrub-duration-encoding), each confirmed in isolation.A gold failing its own test: untrustworthy until resolved. Cause undetermined (broken task, flaky test, or drift). No gold-passes-verifier check on record.Run gold-passes-verifier before shipping; fix or exclude failures; publish the check.<br>A lighter, standardized harness does not disadvantage any model (popularly inflated to "less prompting is better")3 model families, mini-swe-agent vs each native CLI. Single 10-task slice, one run per cell, no intervals or tests. Claimed: "matches or beats every native harness at comparable token cost."10 tasks, one run, no variance: can't carry a directional claim. "Matches or beats" without error bars is consistent with noise. The "less is more" reading isn't in the data.Paired ablation at scale (all 113), repeated for variance: confidence intervals, significance test, published trajectories.

The harness-comparison claim rests on a specific scaffolding. Pier...

task verifier test tasks check harness

Related Articles