Auditing DeepSWE

Auditing DeepSWE | june.kim Home Auditing DeepSWE

Last updated 2026-05-31.

A benchmark asks us to trust three things: that the tasks are real, that the grader is fair, and that the answer key works. This audit asks whether those claims hold.

DeepSWE arrived on May 26, 2026 as a contamination-free coding benchmark: 113 tasks drawn from active repositories, each graded by its own verifier. The most basic check is to take the answer key and ask whether it passes its own test.

The check

Every DeepSWE task ships a reference solution and a verifier. The verifier applies a hidden test patch and runs the suite; a passing solution scores 1. The check is simple. Apply the reference solution, run the verifier, confirm it scores 1. If a task’s own gold cannot pass the test built for it, that task cannot be trusted until the contradiction is resolved. No model is involved, so the token cost is zero. One spot machine, ten tasks at a time, the whole set in under an hour, under a dollar.

This is the check you would run before shipping. The authors publish the harness and the data, so anyone downstream can run it. I ran it.

First, I audited myself

The first run failed all 113 tasks. Zero scores across the board.

The tempting headline writes itself: contamination-free benchmark ships 113 broken tasks. It would have been wrong. Uniform failure across every task is never 113 independent defects. It is one fault in the thing they share: my harness. The grading framework brings up its sandbox with docker compose, and the bare Amazon Linux image I provisioned ships Docker without the Compose plugin. Every task errored before reaching the verifier.

I installed the plugin, added an assertion so the gap would fail loudly next time, and re-ran. I suspected my own setup before I suspected theirs. That is the only reason the rest of this is worth reporting.

What the corrected run found

With the harness corrected, the goldens started passing one after another, exactly as they should. Then a few did not.

Four of the 113 reference solutions fail their own verifiers: langchain-request-coalescing, narwhals-rolling-window-suite, prometheus-transactional-reload-status, and skrub-duration-encoding. No agent attempted them. The answer key itself, applied verbatim, does not pass. Each was re-run alone to rule out contention; a task is flagged only when its gold fails in isolation.

Three things hold, and no more. Under the benchmark’s published harness at the pinned commit, the reference solution failed its verifier. The failure reproduced in isolated reruns. The cause is unresolved. It could be a broken task, a flaky test, or environment drift since the image was built. Sorting those apart is the maintainer’s job, not the auditor’s. The bounded claim stands: a task in this state cannot be trusted until the diagnosis happens.

Where these failures live

A score implies a clean function: model and task in, pass or fail out. Between them sits an authored apparatus: task selection, the instruction, the held-out test, the reference solution, the acceptance criteria, the environment. A gold that fails its own verifier puts two of those pieces in contradiction. An answer key asserted correct but never run against its own test is, precisely, a confabulation: a plausible artifact nobody checked. Which piece confabulated, and whether by authoring or by drift, is for the maintainer to determine.

The audit at a glance

ClaimObservationAnalysisRecommendation Tasks are original, the benchmark is contamination-free One spot-check holds: matchEach absent from ts-pattern code, PRs, issues; solutions held out. 1 of 113 checked.Verifiable in principle; cleaner substrate than the contaminated SWE-bench Verified. Consistent with the claim, not established across the set.Publish the per-task originality check across all 113. All 113 tasks are gradeable by their own verifiers 4 golds fail their own verifier (langchain-request-coalescing, narwhals-rolling-window-suite, prometheus-transactional-reload-status, skrub-duration-encoding), each confirmed in isolation.A gold failing its own test: untrustworthy until resolved. Cause undetermined (broken task, flaky test, or drift). No gold-passes-verifier check on record.Run gold-passes-verifier before shipping; fix or exclude failures; publish the check. A lighter, standardized harness does not disadvantage any model (popularly inflated to "less prompting is better")3 model families, mini-swe-agent vs each native CLI. Single 10-task slice, one run per cell, no intervals or tests. Claimed: "matches or beats every native harness at comparable token cost."10 tasks, one run, no variance: can't carry a directional claim. "Matches or beats" without error bars is consistent with noise. The "less is more" reading isn't in the data.Paired ablation at scale (all 113), repeated for variance: confidence intervals, significance test, published trajectories.

The harness-comparison claim rests on a specific scaffolding. Pier...

Auditing DeepSWE

Related Articles

Amazon, Facebook, FBI have access to a private intelligence-sharing network

Show HN: GoPeek – open links in live mini browser windows without new tabs

Agent Memory: An Anatomy

SpaceX not the behemoth everyone thought

Naphtha Shortages Having a Growing Impact in Japan