Fail loudly: a plea to stop hiding bugs

Posted: 2025-09-20

We often make the mistake of hiding logical errors in our software to make it seem more robust. The thinking is simple and seductive: ignore the unexpected condition, prevent a crash, let the program continue. This hampers software maintainability and correctness.

Logical errors ≠ normal error conditions

Logical errors are conditions that should never be allowed to happen.

This is different from normal error conditions, which are expected. Examples:

The user specified a path that doesn’t exist.

A URL could not be loaded.

The configuration denied the requested operation.

This text doesn’t say anything about those conditions, which should be handled gracefully. This text, instead, focuses on the bugs we should not handle.

Two examples

A few recent cases during the implementation of an AI agent at Google motivated me to write this.

In both cases the intention was laudable: make the software more reliabile ; prevent runtime errors; avoid losing information (in invalid conditions).

But the end result is the opposite : it may mask very unexpected situations, making it harder to ensure the reliability of our software:

Example 1: Silent UI Failure

We had to implement a TypeScript browser UI that fetches “messages” from a local backend and renders them in the browser. We had an if condition validating an important expected invariant (regarding order of these messages). When the invariant doesn’t hold, we just log the unexpected situation to the console and recover; otherwise, this would have caused a runtime error (“reading properties of undefined”, or similar).

But if the invariant doesn’t hold… something must have gone seriously wrong in the backend or in the protocol, where maybe messages have not been propagated correctly.

Example 2: Observability and unregistered conversations

In the backend we added an “observability” layer to track the success rate of all invocations by logging conversation messages to a central database,

We wrote some code handling impossible invalid cases, where a conversation isn’t registered with the observability layer, by inventing (mostly empty) metadata for it. A comment explained the logic:

Rather than log an error, we create new metadata entry for this conversation. This allows us to monitor “orphan” messages.

But if a conversation isn’t registered… something must have gone seriously wrong. Are there places where conversations are created that we didn’t identify? Are our listeners being called correctly?

Ironically, this attempt to make the observability layer more robust compromises its integrity. How can we trust a monitoring layer that silently fixes its own data?

As systems grow, correctness drops exponentially

Hiding potential logical errors offers a tempting short-term productivity boost; it feels easier than fixing them. Instead of crashing, the software kind-of works.

But unreliability compounds. Consider a simplified model of a system with n independent components. If each has a probability P of being correct, the chance of entire system being correct is Pⁿ. While the probability of failure of subcomponents of real systems is rarely truly independent, this illustrates how small imperfections can cascade into massive system-wide failures.

A system with 50 sub-components, each with a 99.5% correctness probability, has an overall correctness probability of roughly 78% (0.995⁵⁰).

At 500 sub-components, that probability plummets to a shocking 8% (0.995⁵⁰⁰).

Depending on how you count, Google systems consist of thousands of sub-components.

Large systems demand extreme rigor . We must do everything possible to ensure the invariants of each component are never broken. The first step is to never hide evidence of incorrectness.

Default to crashing loudly

Once you declare something an invariant, your code must treat it as one. Do not add complexity to handle cases where it breaks. Let exceptions bubble up. Let the program crash. A crash provides a clean, immediate, and unmissable signal that a fundamental assumption has been violated.

Technically, an exception to this rule is adding logic to validate the invariants to your program. Checking that your invariants hold can be justifiable complexity. Just make sure you don’t try to recover; simply raise an exception or crash.

Prefer compile-time guarantees. Enforcing invariants through the type system is much better than through tests or runtime checks.

Anti-patterns

This section lists a few specific anti-patterns that I’ve seen in practice.

Dictionary access

If an element is expected in a dictionary/map, use an access API that raises an exception rather than silently return.

Python: Use map[key] (raises KeyError exception if absent), not map.get(key) (silently returns None).

C++: Use map.at(key) (raises std::out_of_range if key is absent), not map[key] (silently inserts a new element).

Catch-all...

Fail loudly: a plea to stop hiding bugs

Related Articles

The Newest Instagram "Exploit" Is the Goofiest I've Seen

Apple WWDC 2026 Livestream

Claude Fable 5

It's Not Just X. It's Y

Show HN: GoPeek – open links in live mini browser windows without new tabs