A maintainability ratchet for AI-assisted Python

keynha1 pts0 comments

Letting Agents Write Code Without Ratcheting Up Risk - Kayhan BabaeePostsLetting Agents Write Code Without Ratcheting Up Risk

May 23, 2026<br>Share

v0.2.2riskratchet is open source. Try it, star it, or install from PyPI.<br>GitHubPyPI<br>In this post, I will walk through why I built a maintainability ratchet for AI-assisted Python, what it actually<br>measures, and how it fits into the review loop.

The short version is this: tests passing is necessary, but it is not the same as keeping the codebase easy to change.<br>AI agents make that gap more visible because they can add correct-looking code quickly. A function can keep passing its<br>tests while quietly gaining branches, public surface, file sprawl, and maintenance history that make the next change<br>harder.

riskratchet is my attempt to make that drift show up as a diff.

The Moment That Started This

I asked an agent to add one branch to a function I owned. The function already had a handful of branches. The agent<br>added the new behavior, kept the signature, kept the existing tests passing, and added one happy-path test for the new<br>case.

The PR looked fine. CI was green. Coverage went up.

A week later I had to change the same function again. I opened it and realized the easy-to-review diff had left behind a<br>function that was now much harder to reason about. The tests were not wrong. The agent was not obviously wrong. The<br>normal signals just were not measuring the thing I cared about.

The thing I wanted to catch was not:

Is this function bad?

It was:

Did this change make a function riskier than the version we had already accepted?

That distinction is the whole product.

What I Wanted To Catch

I wanted a check that could fail a PR when a single function moved into a worse maintainability state:

Cyclomatic complexity went up but tests did not follow.

Line coverage stayed high while branch coverage dropped.

A public function lost coverage.

A function crossed a length or file-sprawl threshold.

A hot file accumulated more complexity.

A new function landed already above the team's risk threshold.

Those are all measurable from data a Python CI job can already produce: source files, coverage JSON, and optionally git<br>history.

The goal was not to build another static quality dashboard. The goal was to build a ratchet:

Measure the current state.

Save it as a baseline.

Fail only when future changes move risk up past a tolerance.

That makes adoption much easier. A mature codebase does not have to become clean in one sweep. It just has to stop<br>getting worse silently.

Why Coverage Alone Is Not Enough

Coverage is useful, but it is very easy to overread.

A line can execute without the test asserting the behavior that matters. A happy-path test can touch every line in a<br>function while leaving half the branch exits untested. A public API can be "covered" only through incidental calls from<br>another test. A file can have respectable project-level coverage while one risky function has almost none.

This is one of the reasons I care about function-level output. Project coverage answers a broad question:

Did the test suite execute this much of the repository?

The review question is narrower:

Did the function changed by this PR become harder to change safely?

Those are not the same question.

Why CRAP Alone Was Not Enough

The CRAP score is still useful:

CC^2 * (1 - line_coverage)^3 + CC

It catches the classic bad shape: complex code with weak line coverage. riskratchet keeps CRAP in the output because<br>it is a good familiar ranking signal.

But CRAP does not see everything I wanted this tool to care about:

Shape<br>What CRAP sees<br>What I still care about

100% line coverage, 50% branch coverage<br>Looks mostly fine<br>Half the exits were never tested

A 2-line public function with no tests<br>Low score<br>Public contract has no direct coverage

A function in a 950-line module<br>Only the function's CC and line coverage<br>File sprawl makes every change more expensive

A file touched 6 times in the churn window<br>Nothing<br>Hot code is where small changes accumulate

A baseline score moved from 10 to 41<br>Only the new absolute score<br>The regression is the useful signal

So the score in riskratchet is a weighted blend of six normalized components:

Component<br>Default weight<br>What it measures

coverage_gap<br>30%<br>Missing line coverage inside the function span

structural_complexity<br>25%<br>Cyclomatic complexity, saturating at high values

branch_gap<br>15%<br>Missing branch coverage when branch data exists

churn<br>10%<br>Recent commits touching the file, default 90-day window

public_surface<br>10%<br>Missing coverage on functions treated as public API

sprawl<br>10%<br>Function length and surrounding file length

Weights are configurable in [tool.riskratchet.weights], but they are validated and renormalized. A typo or negative<br>weight should not silently weaken a CI gate.

A Real Fixture: The Agent Spaghetti Case

The repo has a fixture named tests/fixtures/agent_generated_spaghetti. It is the canonical shape I wanted the tool...

function coverage line file tests change

Related Articles