Fail loudly: a plea to stop hiding bugs
Fail loudly: a plea to<br>stop hiding bugs
Posted: 2025-09-20
We often make the mistake of hiding logical errors in our software to<br>make it seem more robust. The thinking is simple and seductive: ignore<br>the unexpected condition, prevent a crash, let the program continue.<br>This hampers software maintainability and correctness.
Logical errors ≠ normal<br>error conditions
Logical errors are conditions that should never be allowed<br>to happen.
This is different from normal error conditions, which are expected.<br>Examples:
The user specified a path that doesn’t exist.
A URL could not be loaded.
The configuration denied the requested operation.
This text doesn’t say anything about those conditions, which should<br>be handled gracefully. This text, instead, focuses on the bugs we should<br>not handle.
Two examples
A few recent cases during the implementation of an AI agent at Google<br>motivated me to write this.
In both cases the intention was laudable: make the software<br>more reliabile ; prevent runtime errors; avoid losing<br>information (in invalid conditions).
But the end result is the opposite : it may mask very<br>unexpected situations, making it harder to ensure the reliability of our<br>software:
Example 1: Silent UI Failure
We had to implement a TypeScript browser UI that fetches “messages”<br>from a local backend and renders them in the browser. We had an<br>if condition validating an important expected invariant<br>(regarding order of these messages). When the invariant doesn’t hold, we<br>just log the unexpected situation to the console and recover; otherwise,<br>this would have caused a runtime error (“reading properties of<br>undefined”, or similar).
But if the invariant doesn’t hold… something must have gone seriously<br>wrong in the backend or in the protocol, where maybe messages have not<br>been propagated correctly.
Example<br>2: Observability and unregistered conversations
In the backend we added an “observability” layer to track the success<br>rate of all invocations by logging conversation messages to a central<br>database,
We wrote some code handling impossible invalid cases, where a<br>conversation isn’t registered with the observability layer, by inventing<br>(mostly empty) metadata for it. A comment explained the logic:
Rather than log an error, we create new metadata entry for this<br>conversation. This allows us to monitor “orphan” messages.
But if a conversation isn’t registered… something must have gone<br>seriously wrong. Are there places where conversations are created that<br>we didn’t identify? Are our listeners being called correctly?
Ironically, this attempt to make the observability layer more robust<br>compromises its integrity. How can we trust a monitoring layer that<br>silently fixes its own data?
As systems<br>grow, correctness drops exponentially
Hiding potential logical errors offers a tempting short-term<br>productivity boost; it feels easier than fixing them. Instead of<br>crashing, the software kind-of works.
But unreliability compounds. Consider a simplified model of a system<br>with n independent components. If each has a probability<br>P of being correct, the chance of entire system being correct<br>is Pⁿ. While the probability of failure of subcomponents of<br>real systems is rarely truly independent, this illustrates how small<br>imperfections can cascade into massive system-wide failures.
A system with 50 sub-components, each with a 99.5% correctness<br>probability, has an overall correctness probability of roughly 78%<br>(0.995⁵⁰).
At 500 sub-components, that probability plummets to a shocking 8%<br>(0.995⁵⁰⁰).
Depending on how you count, Google systems consist of thousands of<br>sub-components.
Large systems demand extreme rigor . We must do<br>everything possible to ensure the invariants of each component are never<br>broken. The first step is to never hide evidence of incorrectness.
Default to crashing loudly
Once you declare something an invariant, your code must treat<br>it as one. Do not add complexity to handle cases where it<br>breaks. Let exceptions bubble up. Let the program crash. A<br>crash provides a clean, immediate, and unmissable signal that a<br>fundamental assumption has been violated.
Technically, an exception to this rule is adding logic to validate<br>the invariants to your program. Checking that your invariants hold<br>can be justifiable complexity. Just make sure you don’t try to<br>recover; simply raise an exception or crash.
Prefer compile-time guarantees. Enforcing invariants through the type<br>system is much better than through tests or runtime checks.
Anti-patterns
This section lists a few specific anti-patterns that I’ve seen in<br>practice.
Dictionary access
If an element is expected in a dictionary/map, use an access API that<br>raises an exception rather than silently return.
Python: Use map[key] (raises KeyError<br>exception if absent), not map.get(key) (silently returns<br>None).
C++: Use map.at(key) (raises<br>std::out_of_range if key is absent), not<br>map[key] (silently inserts a new element).
Catch-all...