Test passed. Would it have failed if the feature were broken?

Your test passed. Would it have failed if the feature were broken? — Plumbline

Field notes · Falsifiability

Your test passed. Would it have failed if the feature were broken?

by Shane Cope · Plumbline

I’ve spent a long time on a hard, unglamorous corner of building: proving that software actually does what it’s supposed to before it ships. Over several months and a few different builds, I kept hitting variations of the same thing — a green test suite sitting on top of a system that was, in some specific way, broken. Three of those taught me something I now think is the most important question you can ask of any test. Here they are, shortest to most interesting.

43 passing tests for a query that couldn’t run

A piece of code set a session variable for row-level security — morally SET LOCAL app.current_user = :user_id. 43 passing tests, every one asserting the code produced that string. None ran it against Postgres.

The string is valid Python and invalid SQL: you can’t bind a parameter to SET LOCAL, and current_user is reserved. It threw the instant it touched a real database. The fix:

SELECT set_config('app.current_user', $1, true)<br>— which takes a real parameter. But the fix isn’t the point. 43 tests “covered” the line, and all 43 tested a mock of the database, so the database’s opinion — the only one that mattered — was never heard.

53 passing tests and no row-level security

Different system. A migration enabled row-level security on a new table — one op.execute() with three statements. 53 tests, all green.

Those were metadata tests: import the models, build the in-memory schema graph, assert tables, columns, and constraints. But a migration is a different artifact from the models, and nothing in the schema graph represents how the policy gets emitted. The project’s only driver is asyncpg, which sends every statement as a prepared statement over the extended-query protocol — which forbids multiple commands in one. The bundled execute runs fine under psycopg2 and dies under asyncpg. On a clean production build it would have thrown on apply and rolled the whole migration back, leaving a table with no row-level security, discovered at go-live. Split into one statement per execute, fixed. Again: “53 green” and “the security control is absent on a real deploy” were true at the same moment.

Both of these are the same species: the test exercised a stand-in — a mock, a string, an in-memory model — and the real defect lived one boundary past it. The fix is the same too: at least one test has to cross the real boundary.

The one that changed how I think about tests

This one is different, and it’s the reason I’m writing.

A deployed auth service, multi-tenant, row-level security enforcing that you can only read your own tenant’s data. The proof: a real request, on real Postgres, with a real token — “can user A read their own tenant’s rows?” It returned 200. Green. Isolation works.

Except. The table mapping users to tenants was empty. Every test of the resolver had used a mocked or hand-seeded store; nothing ever checked that the deployed resolver, reading the real table, had any row to find. And here’s the trap: if that mapping is missing, the request returns a clean 403 — and a 403 from “deny by default” looks exactly like correct fail-closed behavior. A valid token. A tidy denial. “Working as intended.”

So the success path and the silently-broken path produced nearly identical, plausible-looking output. The 200 only happened because the proof setup had populated the table first — but nothing in the test asserted that precondition, so it was luck, not proof. If the setup had skipped that step, I’d have gotten a 403 and quite possibly called it a pass.

What rescued it was a second case: an unmapped user who must be denied. Run both, and they diverge — the mapped user gets 200, the unmapped user gets 403. That contrast is the proof. Either one alone proves nothing, because either one alone is consistent with a totally broken system.

The principle

A green is only evidence if a broken system would have gone red

This is just the negative control from experimental science, or falsifiability from Popper, dressed in test-runner clothes. A positive result carries information only if the negative state would have looked different. A test that passes no matter what — because the success output and the failure output are the same, or because the precondition it depends on is silently always-true in your test environment — is not a weak test. It’s a decorative one. It tells you nothing.

Putting the three together, a green check is evidence only if all of these hold:

it exercised the real thing , not a stand-in;

from a reproducible state a clean rebuild could recreate (pin your inputs; “it worked before” on an unpinned environment...

Test passed. Would it have failed if the feature were broken?

Related Articles

The Newest Instagram "Exploit" Is the Goofiest I've Seen

Apple WWDC 2026 Livestream

It's Not Just X. It's Y

Show HN: GoPeek – open links in live mini browser windows without new tabs

Agent Memory: An Anatomy