Letting Agents Write Code Without Ratcheting Up Risk - Kayhan BabaeePostsLetting Agents Write Code Without Ratcheting Up Risk
May 23, 2026<br>Share
v0.2.2riskratchet is open source. Try it, star it, or install from PyPI.<br>GitHubPyPI<br>In this post, I will walk through why I built a maintainability ratchet for AI-assisted Python, what it actually<br>measures, and how it fits into the review loop.
The short version is this: tests passing is necessary, but it is not the same as keeping the codebase easy to change.<br>AI agents make that gap more visible because they can add correct-looking code quickly. A function can keep passing its<br>tests while quietly gaining branches, public surface, file sprawl, and maintenance history that make the next change<br>harder.
riskratchet is my attempt to make that drift show up as a diff.
The Moment That Started This
I asked an agent to add one branch to a function I owned. The function already had a handful of branches. The agent<br>added the new behavior, kept the signature, kept the existing tests passing, and added one happy-path test for the new<br>case.
The PR looked fine. CI was green. Coverage went up.
A week later I had to change the same function again. I opened it and realized the easy-to-review diff had left behind a<br>function that was now much harder to reason about. The tests were not wrong. The agent was not obviously wrong. The<br>normal signals just were not measuring the thing I cared about.
The thing I wanted to catch was not:
Is this function bad?
It was:
Did this change make a function riskier than the version we had already accepted?
That distinction is the whole product.
What I Wanted To Catch
I wanted a check that could fail a PR when a single function moved into a worse maintainability state:
Cyclomatic complexity went up but tests did not follow.
Line coverage stayed high while branch coverage dropped.
A public function lost coverage.
A function crossed a length or file-sprawl threshold.
A hot file accumulated more complexity.
A new function landed already above the team's risk threshold.
Those are all measurable from data a Python CI job can already produce: source files, coverage JSON, and optionally git<br>history.
The goal was not to build another static quality dashboard. The goal was to build a ratchet:
Measure the current state.
Save it as a baseline.
Fail only when future changes move risk up past a tolerance.
That makes adoption much easier. A mature codebase does not have to become clean in one sweep. It just has to stop<br>getting worse silently.
Why Coverage Alone Is Not Enough
Coverage is useful, but it is very easy to overread.
A line can execute without the test asserting the behavior that matters. A happy-path test can touch every line in a<br>function while leaving half the branch exits untested. A public API can be "covered" only through incidental calls from<br>another test. A file can have respectable project-level coverage while one risky function has almost none.
This is one of the reasons I care about function-level output. Project coverage answers a broad question:
Did the test suite execute this much of the repository?
The review question is narrower:
Did the function changed by this PR become harder to change safely?
Those are not the same question.
Why CRAP Alone Was Not Enough
The CRAP score is still useful:
CC^2 * (1 - line_coverage)^3 + CC
It catches the classic bad shape: complex code with weak line coverage. riskratchet keeps CRAP in the output because<br>it is a good familiar ranking signal.
But CRAP does not see everything I wanted this tool to care about:
Shape<br>What CRAP sees<br>What I still care about
100% line coverage, 50% branch coverage<br>Looks mostly fine<br>Half the exits were never tested
A 2-line public function with no tests<br>Low score<br>Public contract has no direct coverage
A function in a 950-line module<br>Only the function's CC and line coverage<br>File sprawl makes every change more expensive
A file touched 6 times in the churn window<br>Nothing<br>Hot code is where small changes accumulate
A baseline score moved from 10 to 41<br>Only the new absolute score<br>The regression is the useful signal
So the score in riskratchet is a weighted blend of six normalized components:
Component<br>Default weight<br>What it measures
coverage_gap<br>30%<br>Missing line coverage inside the function span
structural_complexity<br>25%<br>Cyclomatic complexity, saturating at high values
branch_gap<br>15%<br>Missing branch coverage when branch data exists
churn<br>10%<br>Recent commits touching the file, default 90-day window
public_surface<br>10%<br>Missing coverage on functions treated as public API
sprawl<br>10%<br>Function length and surrounding file length
Weights are configurable in [tool.riskratchet.weights], but they are validated and renormalized. A typo or negative<br>weight should not silently weaken a CI gate.
A Real Fixture: The Agent Spaghetti Case
The repo has a fixture named tests/fixtures/agent_generated_spaghetti. It is the canonical shape I wanted the tool...