Deterministic Guardrails Against AI Code Duplication

Dupehound: deterministic guardrails against AI code duplication - Technology Org

-->

Science & technology news

Developments

Competitions

Ideas

Our articles

About

Submit

Science and Technology Spotlight News

Latest news

Space news

Physics news

Information processing news

Life Sciences News

Military news

Technologies news

-->

AI coding agents produce code faster than it can be reviewed. A common result is AI slop and most of AI slop is not bad or broken code, but duplicated code. GitClear’s analysis of 211 million changed lines found cloned code blocks have roughly quadrupled since 2022.

Duplicated code happens because an AI coding agents often cannot hold the whole repository in its context window, especially in large codebases. So when it needs a function that already exists, it usually does not find it and writes another copy. Asking an AI coding agent to spot the duplicated code does not work, mostly for the same reason the copies get created: it cannot properly see the parts of the repository that are not in its context. The larger the codebase, the bigger the problem.

A deterministic guardrail check in the agent loop works better. It does not use a model, runs locally, and is fast enough to run on every change (scans roughly 1.5 million lines per second).

Building deterministic guardrails against AI code duplication

I packaged this as dupehound, a single-binary CLI in Rust.

Dupehound is an index. But a plain text index doesn’t work here, because the copies are not textual: renamed functions share almost no tokens and almost all structure. Dupehound fingerprints the structure instead.

To fingerprint structure, we used a technique called winnowing, which was worked out in 2003 for a different scenario: students who rename variables before submitting copied homework. Stanford’s MOSS plagiarism detector is built on it (Schleimer, Wilkerson & Aiken, 2003), and it transfers to AI-renamed code almost unchanged.

The pipeline has four stages:

Parse. tree-sitter splits each file into functions; the function body is the unit, so imports and signatures never cause a match.

Normalize . Identifiers, strings, and numbers become sentinels, comments are dropped, and keywords and control flow stay.

Fingerprint. 10-token windows are hashed and winnowed. The test suite checks it as a property.

Match . Shared fingerprints produce candidate pairs, so there is no all-pairs pass. Boilerplate fingerprints are dropped, similarity is exact Jaccard, and union-find groups the clusters.

Using dupehound to avoid duplicated code

dupehound has 2 main commands.

Scan: reports every duplicate cluster and a repo-level slop score

Check: fails CI when a change duplicates existing code, naming the original to reuse

Scan reports the clusters of duplicated code and a slop score.The slop score is the percentage of code you could delete if every cluster kept a single copy. The largest copy is exempt, and test files are excluded by default, since table-driven tests are repetitive by design.

Check is the part that runs in the loop. It indexes the codebase at the base revision, looks only at the functions a change touches, and exits non-zero with the location of the original:

$ dupehound check --diff main .<br>src/api/orders.ts:1 calculateOrderAmount() is a 100% duplicate of<br>src/billing/invoice.ts:1 computeInvoiceTotal() — reuse it

Moved functions and in-place edits do not fire. The one-line output exists so it can go back to the agent that wrote the duplicate. The lighter way to wire that is an instruction in CLAUDE.md or AGENTS.md:

Before committing, run `dupehound check .`. If it reports that a function<br>you wrote duplicates existing code, delete your version and reuse the<br>original at the reported location.

The tighter way is the MCP server. dupehound mcp. The MCP exposes check_duplication and scan_duplication , so the agent can call them while it edits:

claude mcp add dupehound -- dupehound mcp

It is a local pipe with no AI in it. A model in the loop does not work. A deterministic index in the loop does, and the agent is the one calling it: it writes a function, asks whether that function already exists, and reuses the original when it does

Evaluation: the hide-and-seek benchmark

To test dupehound against a model, I planted 39 known duplicate function pairs into real code from microsoft/vscode, a 3.3-million-line TypeScript codebase, and grew the host from 10,000 to 1,000,017 lines.

I gave each agent run a fixed budget of 150 turns (one turn is a single read or search) and 15 minutes. At this scale the budget is the limiting factor. The recall numbers below are what an agent finds under this budget.

dupehound recovers 36 of 39 at every size. The agents recover about half at 10,000 lines and fewer as the tree grows. Opus recovers none at a million lines, and both Sonnet runs hit the cap before returning a result, which is what "did not...

Deterministic Guardrails Against AI Code Duplication

Related Articles

(no title)

Is AI ruining our skills? Early results are in – and they're not good

The Anatomy of an AI-Native Org

ZCode – Harness for GLM-5.2

Apertus – Open Foundation Model for Sovereign AI