A debugging story | iinuwa's Blog
Recently, I sat down to play a round of Clue with my family. It was one of<br>those exciting rounds where everyone thought they knew enough for an<br>accusation, so we all rushed to the center of the game board at the same time.<br>It turns out that it took 4 of us guessing, one after another, to finally get<br>the right answer. When we found out the real answer, I reviewed my notes (I<br>like to keep extensive notes) and realized I could have known the answer about<br>5 turns earlier.
I often read debugging stories and postmortems because, let's be honest, it's<br>fun to watch things break in interesting ways up when it's not your problem to<br>solve. But more importantly it gives an intuition on how things can go wrong<br>and how to avoid and recover from faults. But sometimes I feel similarly about<br>the software system postmortems as the Clue, ... err, postmortem, above. I<br>often read some of them wondering, "Could I have figured that out?"
Now that I have a few more years of experience debugging, I know two things:
Reading a debugging story is like reading a mystery novel backwards. Clues<br>become obvious once you know the answer. Actually experiencing a debugging<br>session is totally different.
Debugging is a skill you can learn with guiding principles to help you.
Here is a story of debugging issue that I encountered earlier in my career.
The Problem
During a time when my role focused more on system administration than<br>development, the developer team approached us with a problem. We had two<br>real-time applications that supported our live radio campaigns: a call center<br>monitoring application and a dashboard application. Our call-center application<br>used WebSocket connections to record operator status in real-time and enabled<br>other features like in-app messaging to contact call-center supervisors or<br>initiate call transfers. The dashboard application displayed incoming donations<br>and statistics for our radio announcers to see live during the campaign.
During our live radio campaigns (but never during our normal call-center<br>operations), all of the call center operator and dashboard connections would<br>drop. The actual phone audio was sent over a separate, unaffected channel, so<br>calls were uninterrupted. But it required a refresh for the operators and<br>announcers, so it was pretty disruptive. The issue would occur intermittently:<br>one day it would be fine, and then it would “crash” the next day, then fine one<br>day, then crash sometime later. Never the same time of day though. This issue<br>was tricky to debug because we also couldn’t reproduce it: load testing the<br>server didn’t result in the same issue.
The Environment
We had recently replaced our hardware load balancer with NGINX, and so our<br>developers assumed that there was something wrong with NGINX because they<br>hadn’t changed anything fundamental in their systems, and there were no errors<br>in the application logs. It also affected two separate services, so we<br>assumed that the reverse proxy was the common point between them. I verified on<br>the proxy that the TCP connection closes were initiated by the application<br>server-side, not the proxy, so it was something in the application.
I didn't know it yet, but there were also three key facts needed for<br>understanding the cause:
The previous load balancer was both relatively slow and did not support<br>WebSockets (so the application used Server-Sent Events), but NGINX was fast<br>and did support WebSockets.
The developer team reached a milestone of stability and did not need to do<br>any mid-day updates to the dashboard application.
During live campaigns, the dashboard application runs constantly as users<br>load the application to display on giant TV screens.
You might be able to guess already what the problem was, but it took us a bit<br>of time to correlate these facts.
Discovering the solution
For a while we were content to shrug our shoulders about it, but I really<br>wanted to resolve the problem once and for all. So before another campaign, I<br>took out the big guns: I added Prometheus metrics to the application to show<br>its internal state, and I wired up the application server to ship application<br>and OS-level logs to our new central observability platform that I had been<br>working on. I also added some OS-level CPU and network metrics (thanks to the<br>community building Prometheus Windows<br>Exporter).
Not all of these were necessary; in hindsight, we could have figured out the<br>problem without some, but they helped to eliminate solutions. After observing<br>the failure again, I helped to look into the OS logs on the application server<br>and discovered the problem: IIS was restarting the dashboard application.<br>That'll do it.
Because the dashboard application was stable, it stayed alive for the first<br>day, we never did any manual restarts. Previously, we would plan to restart the<br>application to push updates during campaign downtimes, but we didn't need to<br>anymore. What we didn't realize is that was resetting the default 29 hour...