Fail loudly: a plea to stop hiding bugs

afc1 pts0 comments

Fail loudly: a plea to stop hiding bugs

Fail loudly: a plea to<br>stop hiding bugs

Posted: 2025-09-20

We often make the mistake of hiding logical errors in our software to<br>make it seem more robust. The thinking is simple and seductive: ignore<br>the unexpected condition, prevent a crash, let the program continue.<br>This hampers software maintainability and correctness.

Logical errors ≠ normal<br>error conditions

Logical errors are conditions that should never be allowed<br>to happen.

This is different from normal error conditions, which are expected.<br>Examples:

The user specified a path that doesn’t exist.

A URL could not be loaded.

The configuration denied the requested operation.

This text doesn’t say anything about those conditions, which should<br>be handled gracefully. This text, instead, focuses on the bugs we should<br>not handle.

Two examples

A few recent cases during the implementation of an AI agent at Google<br>motivated me to write this.

In both cases the intention was laudable: make the software<br>more reliabile ; prevent runtime errors; avoid losing<br>information (in invalid conditions).

But the end result is the opposite : it may mask very<br>unexpected situations, making it harder to ensure the reliability of our<br>software:

Example 1: Silent UI Failure

We had to implement a TypeScript browser UI that fetches “messages”<br>from a local backend and renders them in the browser. We had an<br>if condition validating an important expected invariant<br>(regarding order of these messages). When the invariant doesn’t hold, we<br>just log the unexpected situation to the console and recover; otherwise,<br>this would have caused a runtime error (“reading properties of<br>undefined”, or similar).

But if the invariant doesn’t hold… something must have gone seriously<br>wrong in the backend or in the protocol, where maybe messages have not<br>been propagated correctly.

Example<br>2: Observability and unregistered conversations

In the backend we added an “observability” layer to track the success<br>rate of all invocations by logging conversation messages to a central<br>database,

We wrote some code handling impossible invalid cases, where a<br>conversation isn’t registered with the observability layer, by inventing<br>(mostly empty) metadata for it. A comment explained the logic:

Rather than log an error, we create new metadata entry for this<br>conversation. This allows us to monitor “orphan” messages.

But if a conversation isn’t registered… something must have gone<br>seriously wrong. Are there places where conversations are created that<br>we didn’t identify? Are our listeners being called correctly?

Ironically, this attempt to make the observability layer more robust<br>compromises its integrity. How can we trust a monitoring layer that<br>silently fixes its own data?

As systems<br>grow, correctness drops exponentially

Hiding potential logical errors offers a tempting short-term<br>productivity boost; it feels easier than fixing them. Instead of<br>crashing, the software kind-of works.

But unreliability compounds. Consider a simplified model of a system<br>with n independent components. If each has a probability<br>P of being correct, the chance of entire system being correct<br>is Pⁿ. While the probability of failure of subcomponents of<br>real systems is rarely truly independent, this illustrates how small<br>imperfections can cascade into massive system-wide failures.

A system with 50 sub-components, each with a 99.5% correctness<br>probability, has an overall correctness probability of roughly 78%<br>(0.995⁵⁰).

At 500 sub-components, that probability plummets to a shocking 8%<br>(0.995⁵⁰⁰).

Depending on how you count, Google systems consist of thousands of<br>sub-components.

Large systems demand extreme rigor . We must do<br>everything possible to ensure the invariants of each component are never<br>broken. The first step is to never hide evidence of incorrectness.

Default to crashing loudly

Once you declare something an invariant, your code must treat<br>it as one. Do not add complexity to handle cases where it<br>breaks. Let exceptions bubble up. Let the program crash. A<br>crash provides a clean, immediate, and unmissable signal that a<br>fundamental assumption has been violated.

Technically, an exception to this rule is adding logic to validate<br>the invariants to your program. Checking that your invariants hold<br>can be justifiable complexity. Just make sure you don’t try to<br>recover; simply raise an exception or crash.

Prefer compile-time guarantees. Enforcing invariants through the type<br>system is much better than through tests or runtime checks.

Anti-patterns

This section lists a few specific anti-patterns that I’ve seen in<br>practice.

Dictionary access

If an element is expected in a dictionary/map, use an access API that<br>raises an exception rather than silently return.

Python: Use map[key] (raises KeyError<br>exception if absent), not map.get(key) (silently returns<br>None).

C++: Use map.at(key) (raises<br>std::out_of_range if key is absent), not<br>map[key] (silently inserts a new element).

Catch-all...

hiding make errors software conditions messages

Related Articles