What I'd audit on an AI-built SaaS before its first paying customer

What I'd audit on an AI-built SaaS before its first paying customer — Ken Barlow

What I'd audit on an AI-built SaaS before its first paying customer<br>23 May, 2026<br>An AI-built MVP shipped to production last month. Two weeks after launch, one of their customers read another customer's data. The model had written thirteen of fourteen handlers with the correct tenant-isolation check, and the reviewer didn't catch the missing one because it looked exactly like the other thirteen — same shape, same length, same idiomatic structure. The test suite was green. The architecture diagram was clean. The bug shipped because nothing in the codebase mechanically refused it.

This is not a model failure. This is not a prompt failure. This is the boring, predictable failure of behavioral gates at scale — the CRITICAL: validate tenant access line in CLAUDE.md, the code review checklist, the system prompt instruction, the design-review handwave. All of them were betting that the model would remember the rule on every handler, every refactor, every session, indefinitely. The model does not, in fact, reliably remember things at boundaries. Neither does the reviewer at 1am on a Friday after looking at thirteen near-identical functions.

I think the thing that hasn't been said clearly enough is this. The 2026 ship-with-Claude-Code ethos is producing the most ship-and-pray code modern software has seen — and the people writing it don't realize it, because the code looks fine. It compiles. It runs. The tests pass. It's good-looking code. What's missing is the substrate's refusal surface — the boundary at which the code itself, mechanically, will not accept a wrong version. Most AI-built MVPs I've reviewed have a refusal surface of approximately zero, and they ship that way because the failure modes don't show up until paying customers do.

I'll make a wager. On a randomly-picked AI-built SaaS approaching its first paying customers, I'd bet at least three of the seven items below fail an audit. Probably four. I'd run the audit myself and we'd see who was right.

I've been building Allset solo with Cursor and Claude Code for about nine months — roughly 12,000 lines of AI-built code, multi-tenant infrastructure, built single-handed at a velocity that would have taken a small team five years ago. Most of what I'll describe below I've implemented in Allset and would describe to you as battle-tested under self-imposed adversarial review. Some of it I've added after auditing other teams' code and seeing what its absence cost them. The audit is the same exercise in both cases — a search for what the code, in fact, refuses.

Authorization at the boundary, not inside the handler

This is the single most common failure mode I see and the one most likely to ship to production unnoticed. The model writes a handler that loads a resource by ID, checks the tenant matches the current user, and returns the data. It writes twelve more handlers that look identical. On handler fourteen the tenant check is missing — maybe a long-context drift, maybe a refactor that moved the check into a helper that didn't get called from the new path, maybe just one of those things. Tests cover what the team wrote tests for; the handler with the missing check ships green because nobody wrote a test for the case that exposes it.

The behavioral fix is what almost every team reaches for first. Add CRITICAL: always validate tenant access to CLAUDE.md. Add it to the code review template. Slack-pin it. Tell the team to be careful. I think this is misplaced effort, and I'd argue it's worse than nothing — it creates the feeling of having addressed the problem without actually addressing it. Reviewers see the line in CLAUDE.md and read more permissively because they trust the model to have followed it. The model didn't. Nobody catches it.

The structural fix is to make authorization a type, not a runtime check. The handler signature takes a TenantAccess value. That value can only be constructed by one function — the boundary function — which does the database lookup and refuses if membership isn't proven. The model literally cannot write a handler that returns tenant-scoped data without going through the boundary, because the code that skips the boundary doesn't compile. Allset is built with SpiceDB at this exact boundary. Every read flows through a permission check that returns a typed value. Handlers don't see raw tenant IDs. Skipping the check stops being a discipline problem and starts being a "code doesn't compile" problem — which is the only kind of problem you can actually solve at the speed AI-built code is being written.

If your AI-built SaaS has N handlers each independently verifying tenant access, you have N chances to ship a bug. If your handlers receive a typed access proof that can only come from one place, you have one. That's a different number, and the difference compounds the more handlers the model writes.

Defense in depth at the database layer

The...

What I'd audit on an AI-built SaaS before its first paying customer

Related Articles

Elevated error rates on requests to multiple models

Donald Trump and sons to be 'forever' exempt from tax audits

PopuLoRA: Co-Evolving LLM Populations for Reasoning Self- Play

Old Reddit Is Down

The ultimate female fantasy – A feminist critique of Beauty and the Beast