AOL was down (1996) | ngrok blog
Skip to main contentSearch…Control⌃KNewsletterRSS
Disclaimer: We, ngrok, have sponsored Mac to write this post because we think it's an underexplored perspective on the topic of reliability. We're glad to have the opportunity to give writers the space and time to do this, but the opinions are Mac's, not the company's. Enjoy!
Artwork by arthurxmedic.<br>Picture yourself traveling back to August 7th, 1996. Close your eyes and imagine a world where tensions are high with Russia, China, and in the Middle East, people are concerned about a tech bubble, and bell-bottoms are back in style. Difficult to imagine, I know.
Open your eyes, you're in 1996 now. You probably just got back from work or school, hoping to unwind. Maybe you put something on the stereo, still clinging to the waning grunge era. You sit down in your squeaky desk chair and are welcomed by the Windows 95 boot screen. But this time, when you try to connect to America Online, rather than seeing your email inbox, info about popular sitcoms, or NASA announcing evidence of life on Mars, instead you see:
Image credit to CBS News.<br>America Online was down, and it would stay down for 19 hours. It pushed that news of life on Mars right off the front page of the New York Times.
Now, technically this outage shouldn't have been that notable. America Online went down for maintenance regularly. This regular maintenance was what triggered the outage in the first place. There was even a similar outage during peak hours a few months prior that didn't make the news at all (I only found out about it through oral history which I'll get into later). Why did this one make the front page?
At that time, the world was joining the internet in droves. The number of people online was beginning to hockey-stick. My theory is that we had clearly passed some kind of inflection point where the internet was starting to become integral to our daily lives. And us humans really don't like when we are reminded of the fragility of things we depend on.
As someone who works in the field of site reliability engineering (SRE), I became a little obsessed with researching this outage. It was essentially the first example of people outside of the industry realizing how important it is for internet stuff to keep running. And that collective desire is what keeps me employed.
So what does this 30-year-old outage have to do with today? I think it can teach us a lot about the way we experience outages, the economic forces we're subject to, and how the modern field of site reliability engineering should account for that. This article is my chance to write a more human postmortem, one that asks more than just five "why's" and digs into our messy techno-social reality that isn't captured by golden signals and SLOs.
If I'm going to write a postmortem, I guess I should start with some technical details. Contemporary reporting just has statements from spokespeople and pundits, so the language used is pretty vague. If I wanted to track down something more specific, I needed to talk to an AOL employee. I found some old financial documents from AOL on archive.org, which listed all of the board members, executives, and VPs in 1996, and that's where I found the VP of Operations: Matt Korn. All I could find was his LinkedIn, so I signed up for (and immediately cancelled) LinkedIn Premium so I could send him a message. And he responded!
He sent a lovely message where he mentioned digging up his old paper calendars from 1996 to jog his memory! The things I'd do to see those calendars in person… Anyway, he didn't have technical notes on the August outage but did have notes about a similar (never-before-reported-on) outage in May. He found it odd that no newspaper wasted a single inch of column space on the May outage, but they were suddenly all over the August outage.
In case you're curious, he said the May outage happened at Westwood Center Drive, the location of the old AOL headquarters. Only one phase of the three-phase power feed cut out, which meant the generators didn't notice the power was out and so didn't kick on, knocking the whole datacenter out once the batteries drained. Funnily enough, I had a similar thing happen to me a few years ago. I guess generator manufacturers need to do some postmortems of their own.
But all he remembered about the August 7th outage is that the system went down for maintenance and didn't come back online properly. Eventually they improved the system so it didn't need to be taken down for maintenance anymore, silently resolving the original issue. How mundane, right?
I could have kept searching for other AOL employees, but I started to realize: why was I so focused on the technicals? Here we have an event of national interest, where millions of people each have a story to tell about it, and I'm focused on what was happening inside one building in Virginia?
I bet people are happier to be leaving a place like this than entering....