I audited 162 agent-written PRs – 27% were the AI fixing itself

GitHub - commensa-ai/commensa-audit: What % of your AI engineering effort went to fixing the AI's own work? One-page rework report from git history. Read-only, local-first. · GitHub

/" data-turbo-transient="true" />

Search or jump to...

Search code, repositories, users, issues, pull requests...

-->

Clear

Search syntax tips

Provide feedback

--> We read every piece of feedback, and take your input very seriously.

Include my email address so I can be contacted

Cancel

Submit feedback

Saved searches

Use saved searches to filter your results more quickly

-->

Name

Query

To see all available qualifiers, see our documentation.

Cancel

Create saved search

/;ref_cta:Sign up;ref_loc:header logged out"}" Sign up

Appearance settings

Resetting focus

You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session. You switched accounts on another tab or window. Reload to refresh your session.

Dismiss alert

commensa-ai

commensa-audit

Public

Notifications You must be signed in to change notification settings

Fork

Star

main

BranchesTags

Go to file

CodeOpen more actions menu

Folders and files NameNameLast commit message Last commit date Latest commit

History 11 Commits 11 Commits

commensa_audit

quality

reference

reviews

sweep

tests

.gitignore

BUILD_LOG.md

LICENSE

PICKUP.md

README.md

SPEC.md

pyproject.toml

View all files

Repository files navigation

commensa-audit

What % of your AI engineering effort went to fixing your AI's own work?

commensa-audit answers that from your git history. Point it at a GitHub repo; get a one-page report:

Rework tax — share of PRs (and changed lines) that corrected earlier work, vs. net-new value

Superseded work — PRs whose output was entirely replaced later (shown separately — discarded ≠ correcting)

Abandoned attempts — PRs closed without merging: the waste merge-based metrics never see

Churn clusters — chains of PRs rewriting each other ("it took 10 PRs to get dark mode right")

Line survival — how much merged code is still alive at the end of the window

Hotspots — rework share by module, against the repo-wide rate

Agent-marked share — "at least X% of PRs carry agent markers" (Co-Authored-By trailers, body signatures) — a stated lower bound, never an attribution claim

We built it because we needed it: our own agent-built product shipped 162 PRs in 13 days, and the audit showed 27% of them were the AI correcting itself .

Install & run

pip install commensa-audit commensa-audit --repo owner/name --token $GH_TOKEN

Or straight from source:

pip install git+https://github.com/commensa-ai/commensa-audit

Output: report_.html (self-contained, forwardable), audit_.json (raw numbers), units.csv (per-PR data).

Scoping large repos

By default the audit covers the newest 500 PRs — a safety cap so a naive run on a huge repo stays fast and bounded. When it truncates, the run prints a notice telling you how to raise it. Two optional flags control the window (both newest-first):

commensa-audit --repo owner/name --since 2026-03-14 --max-prs 150

--since YYYY-MM-DD — only PRs created on/after this UTC date

--max-prs N — cap to the N newest PRs (default 500; use --max-prs 0 for no cap)

Both early-stop pagination, so --max-prs 150 costs ~150 PRs' worth of API calls, not the repo's entire history. Run with no flags on a repo under 500 PRs and you get everything, exactly as before.

Privacy, by architecture

Read-only. GET requests only; a token with read scope is sufficient.

Local-first. Everything runs and stays on your machine. No telemetry, no phone-home, nothing leaves your network.

Inspectable. Pure Python, stdlib + requests + jinja2. Read every line before you run it.

How classification works (and its honest limits)

Every PR is classified by a transparent signal cascade — explicit corrective titles/reverts → self-correction (a PR predominantly undoing lines added in the prior N days) → churn-cluster membership → otherwise generative. Every classification in the output carries the signal that fired and a human-readable why. Thresholds live in one config block; tune them and re-run offline with --reuse.

Known limits (also printed in the report footer): classification is heuristic; squash merges blur attribution; survival windows mean young repos read optimistic; agent-marked share is a lower bound — absence of a marker is not evidence of human authorship. We grade our own certainty rather than fake precision — that's the whole point of the project.

Why "rework tax"?

Agent-era teams measure activity — PRs merged, lines shipped, velocity. None of that distinguishes progress from cleanup. The rework tax does: it's the share of motion that was correction, the closest git-only...

I audited 162 agent-written PRs – 27% were the AI fixing itself

Related Articles

The Newest Instagram "Exploit" Is the Goofiest I've Seen

Apple WWDC 2026 Livestream

Claude Fable 5

US Government directive to suspend access to Fable 5 and Mythos 5

It's Not Just X. It's Y