Your test passed. Would it have failed if the feature were broken? — Plumbline
Field notes · Falsifiability
Your test passed. Would it have failed if the feature were broken?
by Shane Cope · Plumbline
I’ve spent a long time on a hard, unglamorous corner of building: proving that software actually does what it’s supposed to before it ships. Over several months and a few different builds, I kept hitting variations of the same thing — a green test suite sitting on top of a system that was, in some specific way, broken. Three of those taught me something I now think is the most important question you can ask of any test. Here they are, shortest to most interesting.
01
43 passing tests for a query that couldn’t run
A piece of code set a session variable for row-level security — morally SET LOCAL app.current_user = :user_id. 43 passing tests, every one asserting the code produced that string. None ran it against Postgres.
The string is valid Python and invalid SQL: you can’t bind a parameter to SET LOCAL, and current_user is reserved. It threw the instant it touched a real database. The fix:
SELECT set_config('app.current_user', $1, true)<br>— which takes a real parameter. But the fix isn’t the point. 43 tests “covered” the line, and all 43 tested a mock of the database, so the database’s opinion — the only one that mattered — was never heard.
02
53 passing tests and no row-level security
Different system. A migration enabled row-level security on a new table — one op.execute() with three statements. 53 tests, all green.
Those were metadata tests: import the models, build the in-memory schema graph, assert tables, columns, and constraints. But a migration is a different artifact from the models, and nothing in the schema graph represents how the policy gets emitted. The project’s only driver is asyncpg, which sends every statement as a prepared statement over the extended-query protocol — which forbids multiple commands in one. The bundled execute runs fine under psycopg2 and dies under asyncpg. On a clean production build it would have thrown on apply and rolled the whole migration back, leaving a table with no row-level security, discovered at go-live. Split into one statement per execute, fixed. Again: “53 green” and “the security control is absent on a real deploy” were true at the same moment.
Both of these are the same species: the test exercised a stand-in — a mock, a string, an in-memory model — and the real defect lived one boundary past it. The fix is the same too: at least one test has to cross the real boundary.
03
The one that changed how I think about tests
This one is different, and it’s the reason I’m writing.
A deployed auth service, multi-tenant, row-level security enforcing that you can only read your own tenant’s data. The proof: a real request, on real Postgres, with a real token — “can user A read their own tenant’s rows?” It returned 200. Green. Isolation works.
Except. The table mapping users to tenants was empty. Every test of the resolver had used a mocked or hand-seeded store; nothing ever checked that the deployed resolver, reading the real table, had any row to find. And here’s the trap: if that mapping is missing, the request returns a clean 403 — and a 403 from “deny by default” looks exactly like correct fail-closed behavior. A valid token. A tidy denial. “Working as intended.”
So the success path and the silently-broken path produced nearly identical, plausible-looking output. The 200 only happened because the proof setup had populated the table first — but nothing in the test asserted that precondition, so it was luck, not proof. If the setup had skipped that step, I’d have gotten a 403 and quite possibly called it a pass.
What rescued it was a second case: an unmapped user who must be denied. Run both, and they diverge — the mapped user gets 200, the unmapped user gets 403. That contrast is the proof. Either one alone proves nothing, because either one alone is consistent with a totally broken system.
The principle
A green is only evidence if a broken system would have gone red
This is just the negative control from experimental science, or falsifiability from Popper, dressed in test-runner clothes. A positive result carries information only if the negative state would have looked different. A test that passes no matter what — because the success output and the failure output are the same, or because the precondition it depends on is silently always-true in your test environment — is not a weak test. It’s a decorative one. It tells you nothing.
Putting the three together, a green check is evidence only if all of these hold:
it exercised the real thing , not a stand-in;
from a reproducible state a clean rebuild could recreate (pin your inputs; “it worked before” on an unpinned environment...