A maintainability ratchet for AI-assisted Python

Letting Agents Write Code Without Ratcheting Up Risk - Kayhan BabaeePostsLetting Agents Write Code Without Ratcheting Up Risk

May 23, 2026 Share

v0.2.2riskratchet is open source. Try it, star it, or install from PyPI. GitHubPyPI In this post, I will walk through why I built a maintainability ratchet for AI-assisted Python, what it actually measures, and how it fits into the review loop.

The short version is this: tests passing is necessary, but it is not the same as keeping the codebase easy to change. AI agents make that gap more visible because they can add correct-looking code quickly. A function can keep passing its tests while quietly gaining branches, public surface, file sprawl, and maintenance history that make the next change harder.

riskratchet is my attempt to make that drift show up as a diff.

The Moment That Started This

I asked an agent to add one branch to a function I owned. The function already had a handful of branches. The agent added the new behavior, kept the signature, kept the existing tests passing, and added one happy-path test for the new case.

The PR looked fine. CI was green. Coverage went up.

A week later I had to change the same function again. I opened it and realized the easy-to-review diff had left behind a function that was now much harder to reason about. The tests were not wrong. The agent was not obviously wrong. The normal signals just were not measuring the thing I cared about.

The thing I wanted to catch was not:

Is this function bad?

It was:

Did this change make a function riskier than the version we had already accepted?

That distinction is the whole product.

What I Wanted To Catch

I wanted a check that could fail a PR when a single function moved into a worse maintainability state:

Cyclomatic complexity went up but tests did not follow.

Line coverage stayed high while branch coverage dropped.

A public function lost coverage.

A function crossed a length or file-sprawl threshold.

A hot file accumulated more complexity.

A new function landed already above the team's risk threshold.

Those are all measurable from data a Python CI job can already produce: source files, coverage JSON, and optionally git history.

The goal was not to build another static quality dashboard. The goal was to build a ratchet:

Measure the current state.

Save it as a baseline.

Fail only when future changes move risk up past a tolerance.

That makes adoption much easier. A mature codebase does not have to become clean in one sweep. It just has to stop getting worse silently.

Why Coverage Alone Is Not Enough

Coverage is useful, but it is very easy to overread.

A line can execute without the test asserting the behavior that matters. A happy-path test can touch every line in a function while leaving half the branch exits untested. A public API can be "covered" only through incidental calls from another test. A file can have respectable project-level coverage while one risky function has almost none.

This is one of the reasons I care about function-level output. Project coverage answers a broad question:

Did the test suite execute this much of the repository?

The review question is narrower:

Did the function changed by this PR become harder to change safely?

Those are not the same question.

Why CRAP Alone Was Not Enough

The CRAP score is still useful:

CC^2 * (1 - line_coverage)^3 + CC

It catches the classic bad shape: complex code with weak line coverage. riskratchet keeps CRAP in the output because it is a good familiar ranking signal.

But CRAP does not see everything I wanted this tool to care about:

Shape What CRAP sees What I still care about

100% line coverage, 50% branch coverage Looks mostly fine Half the exits were never tested

A 2-line public function with no tests Low score Public contract has no direct coverage

A function in a 950-line module Only the function's CC and line coverage File sprawl makes every change more expensive

A file touched 6 times in the churn window Nothing Hot code is where small changes accumulate

A baseline score moved from 10 to 41 Only the new absolute score The regression is the useful signal

So the score in riskratchet is a weighted blend of six normalized components:

Component Default weight What it measures

coverage_gap 30% Missing line coverage inside the function span

structural_complexity 25% Cyclomatic complexity, saturating at high values

branch_gap 15% Missing branch coverage when branch data exists

churn 10% Recent commits touching the file, default 90-day window

public_surface 10% Missing coverage on functions treated as public API

sprawl 10% Function length and surrounding file length

Weights are configurable in [tool.riskratchet.weights], but they are validated and renormalized. A typo or negative weight should not silently weaken a CI gate.

A Real Fixture: The Agent Spaghetti Case

The repo has a fixture named tests/fixtures/agent_generated_spaghetti. It is the canonical shape I wanted the tool...

A maintainability ratchet for AI-assisted Python

Related Articles

Amazon, Facebook, FBI have access to a private intelligence-sharing network

SpaceX not the behemoth everyone thought

The Mirror Is Part of the Machine

Elevated error rates on requests to multiple models

Donald Trump and sons to be 'forever' exempt from tax audits