Being oncall taught me everything

Being oncall taught me everything - Yao Yue Though I have never officially owned the title of DevOps or SRE, the majority of my engineering hours in the first few years of my career were spent on understanding and mitigating incidents. Without exageration, being oncall made me the engineer I am. I was oncall for distributed caching at Twitter for 7.5 years (2010-2017), including the 15 months or so when I managed the team, and the week that officially ended one day past the due date of my first-born. It was not just any service that I was responsible for—Cache had the highest throughput by far, measured by requests per second, of any services at Twitter. And as a load-bearing service, it was far from problem-free—by the time Dan Luu and I co-authored the cache incident survey, we counted no fewer than a dozen high profile (tier 0 or tier 1, which generally meant prolonged site-wide degradation) incidents that were attributed fully or significantly to cache. And I loved it. It’s true that I was pretty young and for most of that period childless. And waking up at night didn’t bother me because I could fall asleep whenever I wanted to. But I genuinely loved the thrill, the weight of reality tugging on my understanding and capabilities, and the significance of the moment that allowed my ADHD brain to concentrate. Like a good firefighter, I usually ran toward fire. Being oncall taught me most of the important lessons I learned as a young infrastructure engineer, both technically and behaviorally. It taught me that being fast most of the time is not as important as being predictable, and that tail latencies are often far more important than median or mean in a system of scale. It taught me the value of simple and clear architecture in moments of crisis, the tradeoffs of common design patterns that textbooks never told you about, the interaction between an application and its environment, and the deeply connected nature of services in a datacenter. It taught me that operational excellence—thorough observability, consistent configuration, automation readiness, sensible default values—comes from good design decisions instead of last-minute patching two weeks before launch. My taste as a software engineer is that of a highly skeptical operator, and it has remained so long after I stopped being oncall. What was perhaps more important though, was what it taught me about working with people. When I started, Twitter timelines were still stored in Memcached, and these timelines were objects of variable lengths that tended to grow over time. Periodic restart of Memcached was necessary to prevent the site from a full meltdown while we worked on a fix to recycle memory allocated to different sizes. But the site was so fragile, that the act of restarting cache itself could also set off a partial or full site failure due to timing or sheer luck. In other words, my routine maintenance was like chaos engineering for the rest of the engineering org, and I had to put everybody on notice each time. I learned to perform this task during hours of minimal risk, which often meant late nights. Yet mistakes still happened from time to time. And I quickly learned the best way to get other engineers to forgive and trust me was to own my mistakes. There was more than one occasion when my screwup led to hours of heroic work by other people, but I was never blamed for any of them. I also ran into some really tricky incidents where the root cause resided in parts of the system that I didn’t understand (much of the kernel stuff) or even knew existed (hello, BMC). The only reason we solved them was because of a dogged pursuit on my part to narrow down the possibilities, and the patient help from some very competent peers from every walk of infrastructure. You could say that a demanding moment tends to foster camaraderie, and there were plenty of opportunities for camaraderie in the early days of Twitter. And I miss those people: my teammates who were in the same trench, bouncing ideas and covering each other’s blind spots; my first manager who told a clueless new grad that she was doing a good job and should keep growing; the ops engineers who offered tips that felt like black magic so I didn’t have to spend half an hour learning 6 tools in the middle of an incident; owners of upstream services who diverted load to allow my service to ramp up instead of immediately falling over; kernel engineers who translated my handwavy questions into a clear explanation of kernel quirks. When everything fell into place and the machines were happily chirping out bytes again, we all celebrated solving yet another problem with a toast. Those were the good days. I would argue that one doesn’t really understand how software works until they have watched it work, and inevitably fail, in production. And one doesn’t really know how to create software until they have patched together someone else’s broken pieces. I have no respect for any software engineer who somehow...

Being oncall taught me everything

Related Articles

Amazon, Facebook, FBI have access to a private intelligence-sharing network

SpaceX not the behemoth everyone thought

The Mirror Is Part of the Machine

Elevated error rates on requests to multiple models

Donald Trump and sons to be 'forever' exempt from tax audits