The Agentic Test Pyramid — Matthew Boston
One Axis Isn’t Enough Anymore
Martin Fowler’s test pyramid — and Ham Vocke’s practical write-up of it on Fowler’s site — sorts tests along a single axis: integration scope. Unit at the bottom, integration in the middle, end-to-end (E2E) at the top. Lots of fast, deterministic unit tests; fewer slow, end-to-end ones. None of that is wrong, and none of it has expired. For the deterministic parts of your system — which is still most of any real system — it remains the right model, and you should keep following it exactly as written.
It works because it quietly assumes every test is deterministic: the same input always yields the same pass or fail. That assumption dies the moment part of your system is a large language model. The system becomes non-deterministic : the same input no longer guarantees the same output. Run the same prompt twice and you might get different words, a different number of findings, a different tool call. You can’t assertEqual your way through that, and you can’t block a merge on a check that’s red 8% of the time for no reason.
So this isn’t a replacement for Fowler’s pyramid — it’s an extension of it. You keep the scope axis intact and add a second one: determinism and cost. Every test now lives at a coordinate of (scope, determinism). The deterministic layers are the original pyramid, unchanged. The new work is bolting two more layers on top for the non-deterministic parts — the model-driven pieces that won’t answer the same way twice. The whole discipline becomes pushing each check as far down and as far toward “deterministic” as it will go — back into Fowler’s pyramid wherever possible — because the cheapest, most reliable place to catch a bug is almost never an expensive model eval.
The Six Layers
The two axes give you six layers. The bottom four are Fowler’s pyramid intact — free, deterministic, runs on every commit — just renamed for the agentic context. The top two are the part most test suites don’t have: graded checks on behavior that changes from one run to the next.
Pure unit — f(input) -> output. Parsing, formatting, validation, version math. No I/O, no network, no model. This is the wide base, and almost anything that can be refactored into this shape should be, precisely so it can live here.
Static-invariant tripwires — the highest-leverage layer, and the one most suites lack. More on these below.
In-process integration — real components wired together, no external surface. Fast and free because nothing leaves the box.
Real-dependency E2E — launch the real browser, daemon, or child process, but only when the integration is the thing under test. Mock the world, never the subject.
Behavioral E2E against the live model — here the unit under test is a prompt or a policy, and you have to call the real model. “Given a planted vulnerability, the review must flag it.” “Given read-only mode, the agent must never call the write tool.”
Quality evals (model-as-judge) — when correctness is genuinely subjective, use a second model to grade the first. Grade the output against a rubric and pass it on a range, never on an exact match.
Tripwires Are the Multiplier
Layer two earns its own section. A static-invariant tripwire doesn’t run behavior at all. It reads your own source or config and asserts a contract with a pattern match. It’s executable architecture documentation that fails the build in milliseconds.
Every load-bearing rule in your codebase — “never import X from Y,” “all writes route through this helper,” “importing this module must have no side effects” — gets a test that fails when someone violates it, without executing anything. The discipline is simple: when you write a comment that says “NEVER do X here,” write the tripwire in the same change. A constraint without a test is just a suggestion.
This is what lets a product with a model in it stay correct cheaply. A huge fraction of “regressions” are really contract violations, and contracts stay deterministic even when behavior doesn’t.
Determinism Decides the Gate, Cost Decides the Cadence
Here’s the rule that keeps the whole thing sane. Classify every paid test — one that spends real money calling the model — by asking one question:
Can this test be red for a legitimate, non-bug reason — model variance, a flaky external service, a subjective threshold?
If no, it’s a gate : it blocks merge and runs on every PR. If yes, it’s periodic : it never blocks merge and runs on a schedule. You can never gate CI on a check that’s sometimes-red-for-no-reason, because that trains everyone to ignore red. Gate on what’s stable, monitor what’s fuzzy.
For these checks, the trick is to test against a range, not an exact answer. An exact-match assertion — “the agent found exactly 10 issues” — will fail at random the moment the model words things differently, a false alarm that teaches everyone to ignore red. So you plant a known set of problems as your answer key and check that the...