Prove Alignment, Then Let the Agent Merge

Let the Agent Merge - Isaac Kabuika

Isaac Kabuika

SubscribeSign in

Let the Agent Merge We still have room to speed up, and it's stuck behind human approval. We keep it there because we haven't automated alignment yet.

Isaac Kabuika Jun 02, 2026

I’ve been thinking about how teams build software once AI can write whole features well. We already ship more code than ever, but reviewing it feels more and more like theater. Not because people stopped caring. Because review is aimed at the wrong thing. The old rules made sense: small PRs, a human in the loop, careful line-by-line review. They made sense because humans were the bottleneck in turning product intent into code. With good coding models, that’s less and less true. The implementation isn’t the fragile part anymore. The fragile part is whether the product intent was clear in the first place. So I don’t think the right question for a PR is “did a human inspect this diff?” That question is becoming theater. The better question is whether the change is aligned on three axes: product alignment, architectural sanity, and release risk. Product alignment means the change actually does what we wanted. It matches the user journey, preserves the existing behaviors that matter, handles the weird cases, and moves the system from the old state to the new one in a way we can explain. Architectural sanity means the change fits the system. It respects the boundaries we care about, follows the tradeoffs we chose, and doesn’t make the next version harder for no reason. Release risk means the change has enough proof around it: tests, migrations, observability, rollback paths, security and privacy checks, and known failure modes. That is the real review. Not “a human clicked approve after scrolling through files.” The system should prove the change is aligned, and when confidence is high across those axes, the agent should approve and merge. A lot of this can be automated if the company has the right records. Product tickets, design notes, ADRs, UX examples, prior behavior, incidents, support threads, existing code patterns: all of it is review material. If those records are current, an agent can often review a PR better than a tired human looking at a GitHub diff. If they’re missing or stale, that’s not a reason to keep manual review forever. It’s a sign the product isn’t well communicated. There is a popular belief that some domains are too sensitive for full automation though. Billing, permissions, security, customer data. But I don’t buy that as a general rule. The problem with billing isn’t that agents can’t write safe billing code. It’s that billing is hard to define. Billing has too many states, exceptions, migrations, customer promises, Stripe details, internal assumptions, and old weird decisions. You can’t define the ideal billing system in one sitting and hand it to an agent. But that isn’t an argument for humans reviewing every billing diff. It’s an argument for making the product slice smaller. Build baby-billing v1, then v2, then v3. Each version should be small enough that we can define it thoroughly. What changes? What must not change? What happens to existing customers? What happens on failure? What do we log? What do we test? What’s the rollback? What would make us stop the rollout? Once the slice is that clear, I trust the agent to implement it. The sensitivity of the domain doesn’t mean we need more humans staring at code. It means we need sharper definition, stronger invariants, and better proof. This is the part I didn’t fully see before. For years we talked about small PRs and modular systems as ways to make code easier to reason about. That was true, but incomplete. Smallness matters because it makes product intent easier to reason about. Large products are hard not because they contain many files, but because they contain many hidden promises. Think about this flow: we go from idea → … → code. Every arrow in the middle is an explanation of what we’re building, and each one adds assumptions, because that explanation is based on a previous stage that doesn’t cover every possible scenario. Even the original idea itself is probably incomplete. My claim is that engineering was proactively confirming assumptions with stakeholders before coding. And with coding agents, that’s now gone. Product passes specs to engineering, which passes them to a bot. The bot doesn’t push back on ambiguities and builds with assumptions. And that’s why we’re scared to touch Billing. So the job changes. We should spend less time babysitting code and more time communicating the product. That means keeping the product change small, and not just the diff small. A tiny diff in a vague product area can still be dangerous. A larger diff that implements a sharply defined product slice can be safer. It means writing more about the user and system journey: before and after states, affected users, edge cases, failure modes, invariants, and what must never happen. Then we...

Prove Alignment, Then Let the Agent Merge

Related Articles

The Newest Instagram "Exploit" Is the Goofiest I've Seen

It's Not Just X. It's Y

Amazon, Facebook, FBI have access to a private intelligence-sharing network

Show HN: GoPeek – open links in live mini browser windows without new tabs

Agent Memory: An Anatomy