Test passed. Would it have failed if the feature were broken?

copest1 pts0 comments

Your test passed. Would it have failed if the feature were broken? — Plumbline

Field notes · Falsifiability

Your test passed. Would it have failed if the feature were broken?

by Shane Cope · Plumbline

I’ve spent a long time on a hard, unglamorous corner of building: proving that software actually does what it’s supposed to before it ships. Over several months and a few different builds, I kept hitting variations of the same thing — a green test suite sitting on top of a system that was, in some specific way, broken. Three of those taught me something I now think is the most important question you can ask of any test. Here they are, shortest to most interesting.

01

43 passing tests for a query that couldn’t run

A piece of code set a session variable for row-level security — morally SET LOCAL app.current_user = :user_id. 43 passing tests, every one asserting the code produced that string. None ran it against Postgres.

The string is valid Python and invalid SQL: you can’t bind a parameter to SET LOCAL, and current_user is reserved. It threw the instant it touched a real database. The fix:

SELECT set_config('app.current_user', $1, true)<br>— which takes a real parameter. But the fix isn&rsquo;t the point. 43 tests &ldquo;covered&rdquo; the line, and all 43 tested a mock of the database, so the database&rsquo;s opinion — the only one that mattered — was never heard.

02

53 passing tests and no row-level security

Different system. A migration enabled row-level security on a new table — one op.execute() with three statements. 53 tests, all green.

Those were metadata tests: import the models, build the in-memory schema graph, assert tables, columns, and constraints. But a migration is a different artifact from the models, and nothing in the schema graph represents how the policy gets emitted. The project&rsquo;s only driver is asyncpg, which sends every statement as a prepared statement over the extended-query protocol — which forbids multiple commands in one. The bundled execute runs fine under psycopg2 and dies under asyncpg. On a clean production build it would have thrown on apply and rolled the whole migration back, leaving a table with no row-level security, discovered at go-live. Split into one statement per execute, fixed. Again: &ldquo;53 green&rdquo; and &ldquo;the security control is absent on a real deploy&rdquo; were true at the same moment.

Both of these are the same species: the test exercised a stand-in — a mock, a string, an in-memory model — and the real defect lived one boundary past it. The fix is the same too: at least one test has to cross the real boundary.

03

The one that changed how I think about tests

This one is different, and it&rsquo;s the reason I&rsquo;m writing.

A deployed auth service, multi-tenant, row-level security enforcing that you can only read your own tenant&rsquo;s data. The proof: a real request, on real Postgres, with a real token — &ldquo;can user A read their own tenant&rsquo;s rows?&rdquo; It returned 200. Green. Isolation works.

Except. The table mapping users to tenants was empty. Every test of the resolver had used a mocked or hand-seeded store; nothing ever checked that the deployed resolver, reading the real table, had any row to find. And here&rsquo;s the trap: if that mapping is missing, the request returns a clean 403 — and a 403 from &ldquo;deny by default&rdquo; looks exactly like correct fail-closed behavior. A valid token. A tidy denial. &ldquo;Working as intended.&rdquo;

So the success path and the silently-broken path produced nearly identical, plausible-looking output. The 200 only happened because the proof setup had populated the table first — but nothing in the test asserted that precondition, so it was luck, not proof. If the setup had skipped that step, I&rsquo;d have gotten a 403 and quite possibly called it a pass.

What rescued it was a second case: an unmapped user who must be denied. Run both, and they diverge — the mapped user gets 200, the unmapped user gets 403. That contrast is the proof. Either one alone proves nothing, because either one alone is consistent with a totally broken system.

The principle

A green is only evidence if a broken system would have gone red

This is just the negative control from experimental science, or falsifiability from Popper, dressed in test-runner clothes. A positive result carries information only if the negative state would have looked different. A test that passes no matter what — because the success output and the failure output are the same, or because the precondition it depends on is silently always-true in your test environment — is not a weak test. It&rsquo;s a decorative one. It tells you nothing.

Putting the three together, a green check is evidence only if all of these hold:

it exercised the real thing , not a stand-in;

from a reproducible state a clean rebuild could recreate (pin your inputs; &ldquo;it worked before&rdquo; on an unpinned environment...

rsquo test real broken tests ldquo

Related Articles