A debugging story: Learning debugging principles from a production outage

A debugging story | iinuwa's Blog

Recently, I sat down to play a round of Clue with my family. It was one of those exciting rounds where everyone thought they knew enough for an accusation, so we all rushed to the center of the game board at the same time. It turns out that it took 4 of us guessing, one after another, to finally get the right answer. When we found out the real answer, I reviewed my notes (I like to keep extensive notes) and realized I could have known the answer about 5 turns earlier.

I often read debugging stories and postmortems because, let's be honest, it's fun to watch things break in interesting ways up when it's not your problem to solve. But more importantly it gives an intuition on how things can go wrong and how to avoid and recover from faults. But sometimes I feel similarly about the software system postmortems as the Clue, ... err, postmortem, above. I often read some of them wondering, "Could I have figured that out?"

Now that I have a few more years of experience debugging, I know two things:

Reading a debugging story is like reading a mystery novel backwards. Clues become obvious once you know the answer. Actually experiencing a debugging session is totally different.

Debugging is a skill you can learn with guiding principles to help you.

Here is a story of debugging issue that I encountered earlier in my career.

The Problem

During a time when my role focused more on system administration than development, the developer team approached us with a problem. We had two real-time applications that supported our live radio campaigns: a call center monitoring application and a dashboard application. Our call-center application used WebSocket connections to record operator status in real-time and enabled other features like in-app messaging to contact call-center supervisors or initiate call transfers. The dashboard application displayed incoming donations and statistics for our radio announcers to see live during the campaign.

During our live radio campaigns (but never during our normal call-center operations), all of the call center operator and dashboard connections would drop. The actual phone audio was sent over a separate, unaffected channel, so calls were uninterrupted. But it required a refresh for the operators and announcers, so it was pretty disruptive. The issue would occur intermittently: one day it would be fine, and then it would “crash” the next day, then fine one day, then crash sometime later. Never the same time of day though. This issue was tricky to debug because we also couldn’t reproduce it: load testing the server didn’t result in the same issue.

The Environment

We had recently replaced our hardware load balancer with NGINX, and so our developers assumed that there was something wrong with NGINX because they hadn’t changed anything fundamental in their systems, and there were no errors in the application logs. It also affected two separate services, so we assumed that the reverse proxy was the common point between them. I verified on the proxy that the TCP connection closes were initiated by the application server-side, not the proxy, so it was something in the application.

I didn't know it yet, but there were also three key facts needed for understanding the cause:

The previous load balancer was both relatively slow and did not support WebSockets (so the application used Server-Sent Events), but NGINX was fast and did support WebSockets.

The developer team reached a milestone of stability and did not need to do any mid-day updates to the dashboard application.

During live campaigns, the dashboard application runs constantly as users load the application to display on giant TV screens.

You might be able to guess already what the problem was, but it took us a bit of time to correlate these facts.

Discovering the solution

For a while we were content to shrug our shoulders about it, but I really wanted to resolve the problem once and for all. So before another campaign, I took out the big guns: I added Prometheus metrics to the application to show its internal state, and I wired up the application server to ship application and OS-level logs to our new central observability platform that I had been working on. I also added some OS-level CPU and network metrics (thanks to the community building Prometheus Windows Exporter).

Not all of these were necessary; in hindsight, we could have figured out the problem without some, but they helped to eliminate solutions. After observing the failure again, I helped to look into the OS logs on the application server and discovered the problem: IIS was restarting the dashboard application. That'll do it.

Because the dashboard application was stable, it stayed alive for the first day, we never did any manual restarts. Previously, we would plan to restart the application to push updates during campaign downtimes, but we didn't need to anymore. What we didn't realize is that was resetting the default 29 hour...

A debugging story: Learning debugging principles from a production outage

Related Articles

US Government directive to suspend access to Fable 5 and Mythos 5

Is AI ruining our skills? Early results are in – and they're not good

The Anatomy of an AI-Native Org

Apertus – Open Foundation Model for Sovereign AI

How to Earn a Billion Dollars