Rare things become common at scale
I wrote a book! Click for details.
Subscribe
By Jason Cohen on
January 28, 2014
Reading time: 5 min
ePub (Kindle)
Printable PDF
Rare things become common at scale
by Jason Cohen on January 28, 2014
Software doesn’t scale through architecture and automation alone. New, more difficult problems appear that didn’t exist before, causing new downstream consequences.
source
Something interesting happens when you run more than 1,000 servers, as we do at WP Engine,1 powering hundreds of thousands of websites.
Editor’s Note: As of 2024, this article is ten years old; we now run twenty times as many servers, and the lesson of this article continues to be accurate.
Suppose that on average a server experiences one fatal failure every three years. The kernel panics (the Linux equivalent of the Blue Screen of Death), or both the main and redundant power supply fails, or some other rare event that causes outage. This isn’t a quality issue—this is normal. This isn’t something to “fix.”
Windows NT crashed.<br>I am the Blue Screen of Death.<br>No one hears your screams.
—Haiku<br>from FSF
But remember, we have 1,000 servers. Three years is about 1,000 days. So that means, on average, every single day we have a fatal server error.
Not to mention 10 minor incidents with degraded performance, or a DDoS attack somewhere in the data center affecting our network traffic, or some other thing that sets pagers a-buzzing in our DevOps team and mobilizes our Customer Support team to notify and help customers.
“Well sure,” you say, “that’s normal as you grow. If you had just 10 servers and 100 customers, you’d have fewer problems and many fewer employees. Today you have more customers, more servers, and more employees. What’s so hard about that?”
The insight is that scale causes rare events to become common. Things happen with 2000 servers that you never saw even once with 50 servers, and things which used to happen once in a blue moon, where a shrug and a manual reboot every six months was in fact an appropriate “process,” now happen every week, or even every day.
Things as rare as, well, you know…
source
It’s not only problems that morph with scale, but your ability to handle problems .
For example, a dozen minor and major events every day means 20-50 customers affected every day. Now consider what happens as we try to inform 50 customers. For some we won’t have current email addresses, so they don’t get notified. Some of those will notice the problem and create extra customer support load; at worst they’ll post on Twitter about how their website was slow or offline today and WP Engine “didn’t even know it.” Then our social media team has to piece all this together, attempt to respond, maybe put together a special phone call with that customer, and so on. Those customers are also more likely to leave a bad review on some review site, compared with the 99.99% of customers who experience no such incident, but also had no reason to decide that “today is the day I will go to a review site and leave a good review.”
Or consider the scale-ramifications of on-boarding 1,000 new customers a month. In that case, it’s likely that any given server issue will affect a customer who has only been with us for a month or two. Thus the issue causes a “bad first impression,” which is harder to address than a customer who has been with us for three years and has built up a bank account of patience.
So, rare things being common isn’t just difficult from the operational side, but also when you try to handle those problems with customers or other downstream consequences, causing much more work to solve than when the company was small.
The usual response to this is “automate everything.”
As with most knee-jerk responses, there’s truth in it, but it’s not the whole story.
Sure, without automated monitoring we’d be blind, and without automated problem-solving we’d be overwhelmed. So yes, “automate everything.”
But some things you can’t automate. You can’t “automate” a knowledgable, friendly customer support team. You can’t “automate” responding to a complaint on social media. You can’t “automate” the recruiting, training, rapport, culture, and downright caring of teams of human beings who are awake 24/7/365, with skills ranging from multi-tasking on support chat to communicating clearly and professionally over the phone to logging into servers and identifying and fixing issues as fast as (humanly?) possible.
And you can’t “automate” away the rare things, even the technical ones. By their nature they’re difficult to define, hence difficult to monitor, and difficult to repair without the forensic skills of a human engineer.
Does this mean all our customers have a worse experience? No, just the opposite. Any one customer of ours has fewer problems per year now than a year ago, because we’re constantly improving our processes, automation, hardware, and human service. It’s when you look across...