Testing a Security Tool Like It Can Hurt People

Testing a Security Tool Like It Can Hurt People | Emphere

Assurance<br>Testing a Security Tool Like It Can Hurt People<br>A security tool cannot be tested like a normal CLI. We built a standing assurance platform around deterministic fixtures, real-kernel runners, preserved artifacts, and red runs that prove the system can fail loudly.<br>Emphere EngineeringAssuranceJune 8, 20269 min read

01Security tools fail quietly

A security tool cannot be tested like a normal CLI. When a normal CLI is wrong, it usually crashes, exits non-zero, or produces output so malformed that something downstream rejects it.

A security tool can fail much more quietly. It can return a clean report with the wrong reachability label, the wrong process attribution, or the wrong confidence. The UI renders, the pipeline passes, and someone makes a decision on top of a result that should never have been trusted.

The failure mode we care about most is output that looks certain when it is wrong.

One of our runners went red because the tool guessed. The fixture had multiple Python worker processes, and in that situation the correct behavior is to abstain. If more than one process could have caused an import, the tool should say it cannot attribute the event, not pick a process and make the report look more precise than it is. In the run, that mistake showed up as 157 inferred attributions where the invariant allowed 0.

A security product cannot afford to ship that kind of bug. The assurance system has a concrete job: catch the moment the tool starts sounding more certain than the evidence allows.

We are building a container security tool that answers questions across static analysis, runtime evidence, and vulnerability data: what is installed, what is linked, what application code can reach, and what was actually observed when the container ran. If that evidence is wrong, the product is wrong quietly.

The only useful defense is an oracle: a deterministic check that knows what the answer should be and fails loudly when the tool disagrees. The oracle is not a model and it is not a dashboard. It is a repeatable artifact with an expected value, an actual value, and a failure mode.

02Reading guide

Before the rest of the post gets technical, here is the vocabulary in plain English.

An oracle is a check with a known right answer. If the tool disagrees, the run fails.

A fixture is a small test container built to prove one behavior: a static Go binary, a dynamic cgo binary, or a Python app with multiple worker processes. An invariant is the rule that fixture must always satisfy. A static binary should not show a runtime shared-library load. A multiprocess app should not get guessed attribution.

A collector is the mechanism that observes runtime behavior. In this post, that means a local proc-maps collector and a Linux eBPF collector. Runtime evidence is what the tool actually observed when the container ran. Process attribution is tying that observation back to the process that caused it. If we cannot do that safely, the tool should say it does not know.

03What we test

Correctness for this kind of product is not one property, so the assurance work cannot be one test. The first slice we have pushed hard is runtime evidence, process attribution, fixture invariants, and collector agreement. The larger map is what we have to keep proving over time:

1. Static graph construction. Packages, native extensions, OS layers, ownership.

2. Runtime evidence. What loads during the observed window.

3. Process attribution. Which process caused an observation, and when to refuse to guess.

4. Collector agreement. Different observers should agree on the deterministic verdict even when raw signal volume differs.

5. Vulnerability database behavior. Feed drift must not masquerade as a tool regression.

6. Fixture invariants. Behavioral expectations, not brittle counts.

7. Exploit confirmation. Exercise a vulnerable path and verify that the patched twin goes silent.

8. AI triage. Agents read and explain evidence; they never produce truth.

The product still has to be right about the rest. The graph it builds, the vulnerability data it joins in, exploit confirmation, and AI triage all stay open until they have the same shape of proof: a green run, a red run, and an artifact behind the claim.

04The first proven slice

We started with runtime evidence because that is where correctness gets subtle.

If a library was loaded by the app server, that is useful evidence. If it was loaded by a shell helper or an injected reader, that is a different claim. If a Python import is attributed to the wrong worker in a multiprocess app, the report looks more precise while becoming less true.

So the first oracle checks process attribution. In a single-process runtime, inferred attribution is acceptable when there is exactly one candidate. In a multiprocess runtime, the system must abstain rather than guess.

The current corpus is intentionally small enough that each fixture...

Testing a Security Tool Like It Can Hurt People

Related Articles

The Newest Instagram "Exploit" Is the Goofiest I've Seen

Apple WWDC 2026 Livestream

Claude Fable 5

It's Not Just X. It's Y

Show HN: GoPeek – open links in live mini browser windows without new tabs