The Other Side of On-Call

The Other Side of On-Call - Aniruddha

Aniruddha

SubscribeSign in

Aniruddha May 25, 2026

Last week I read this post by Yao on being on-call and I couldn't agree more! The best engineers are also the best debuggers, and vice versa. But there's another side to on-call that rarely gets written about: the times when the bug isn't somebody else's, it's yours. The part nobody talks about is the conviction, in the moment, that you've just ended your career, or worse, your company. I joined Twitter right out of grad school. In my first few weeks I was tasked with running a load test for a new graph store. The load test ran fine. It also caused a SEV0 that took down our observability stack, leaving the whole company running blind. I sat at my desk for an hour genuinely calculating which friends to call about a new job. Nothing came of it. I stayed at Twitter for years. Twitter had its share of incidents. I vividly remember being part of a mock PagerDuty readiness check for the whole infra org, immediately followed by a real SEV0: our Mesos cluster went down and paged every team that depended on it, which was effectively all of them. Made good use of the readiness check. I was still at Twitter when the bug shipped that deleted Ellen DeGeneres’s famous tweet from the Oscars. I was convinced that there was no coming back. Surprisingly, when Twitter went public some months later, no one remembered this. I left Twitter to join a small (but mighty) engineering team at Mixpanel. During my first year there, we shipped a bug that effectively deleted the last five days of customer event data — for an analytics company whose entire job is preserving it. We spent hours trying to recover what we could, then drove to our company retreat at Lake Tahoe convinced that was the end of the team and that many of our customers would churn. A few years later, a change to our React SDK caused auto-track to capture data from sensitive elements like password fields. I was convinced that was it and that the company was cooked. We disclosed, fixed the issue, kept growing. Right before I left, an engineer on the team racked up a high six-figure GCP bill in a single day from a runaway job. I'm sure they felt the same dread I had felt in my first week at Twitter. Two summers ago, I sat at SeaTac after a Pocus offsite watching the CrowdStrike sensor update take down half the airlines on the planet. From a terminal floor it looked like the end of CrowdStrike. The stock has nearly tripled from its post-incident low. Every single time, my prediction of the catastrophic outcome was wildly worse than the actual one. The same miscalculation runs at organizational scale. Companies overestimate the downside of new sources of risk just as reliably as engineers overestimate the downside of their own mistakes. Companies like Knight Capital are rare. They lost more than $400 million in 45 minutes from a single deploy and were gone within days. That kind of failure mode exists in a narrow set of fields — trading, payments, medical devices, anything where someone's money is on the line, or their life. In those, the worst case really is the worst case. Everywhere else, GitLab is more typical. They deleted their primary production database, lost six hours of customer data, and IPO'd four years later. Safeguards exist for good reason. Every one of those incidents would have been worse without the feature flags, dashboards, on-call rotations, and rollback paths that caught them. The point isn't that engineering safety doesn't matter. It's that the gap between what you fear will happen and what actually happens is consistently enormous. The first instinct after an incident, personal or organizational, is to reach for process. Another review, another approver. But process is almost always the worst kind of safeguard. It taxes every future change and tends to feel safer than it actually is. The better instinct is to ask what would have actually caught it: a test, a feature flag, a stricter type, a product change. The strongest engineering teams aren't those with the most gates between code and production. They invest in capabilities that make moving fast safer. At Mixpanel my team built Miffy, a regression-testing tool that shadowed production traffic against candidate API code and caught issues before they shipped. The point wasn't to slow deploys down. It was to make confident deploys possible. The cost of moving slow is real, and almost nobody prices it in. The deal you didn't close because the integration took six months. The feature your competitor shipped first. The engineer who stopped shipping because the last incident reshaped their risk appetite. Risk assessment is asymmetric in the wrong direction at every scale. The downside of a bad release is loud, public, visible on a dashboard, while the downside of moving slowly is silent and shows up nowhere except in what you didn't build. On-call teaches you how systems break....

The Other Side of On-Call

Related Articles

Amazon, Facebook, FBI have access to a private intelligence-sharing network

Agent Memory: An Anatomy

SpaceX not the behemoth everyone thought

The Mirror Is Part of the Machine

Elevated error rates on requests to multiple models