Analyzing the AWS Outage

Where were you when Amazon Web Services had the biggest outage of recent years? The question may not be quite as memorable as “Where were you when JFK was shot?” but it’s pretty close. And if you work for one of the many companies taken down by this calamity, you’ll probably remember it for years to come.

Here’s what happened, in case you need a reminder: on February 28, Amazon Simple Storage Service suffered a disruption in the U.S. East-1 region that lasted more than four hours. S3 is one of the most popular AWS cloud services; U.S. East-1 is one of the most widely used AWS regions. Not only did the outage affect 54 of the top 100 internet retailers, Apica, a web monitoring provider, says they saw a 20% or greater decrease in performance. It allegedly cost Standard & Poor’s 500 Index companies $150 million.

So in our opinion, this was a true disaster – far more severe than the “service disruption” label AWS gave it. The cause? Human error. A team had to shut a few servers down while repairing the S3 billing system, and an incorrect command ended up taking down more servers. The team had to reboot them to get them online again, which left the affected sites without access for almost four hours.

You might be asking yourself (as many of our observant customers asked us) why the outage was so long. Why AWS couldn’t just failover. After all, they define the term “leading cloud vendor,” which is why so many companies feel safe handing over responsibility for continuity and recovery to them. If anyone can offer speed and reliability, it’s Amazon. Right?

Well, it’s a bit more complicated than that – both the problem and solution. Let’s look at some of the driving factors in avoiding and solving this level of downtime.

The Power of Hybrid Backup and Disaster Recovery

One tactic getting discussed in preventing this kind of outage is limiting dependency on a particular region. Spread workloads across regions and you’ve got a contingency plan; if servers go down in one region, your system can take the information it needs from another one. Netflix does this and what do you know – they weren’t impacted by S3 going down.

But the surest way to mitigate downtime is adding more redundancy safeguards. There’s a reason so many CIOs contact Quorum asking about our hybrid cloud options: they want to avoid this kind of catastrophe. We pair on-premises protection with our private cloud for multiple levels of replication. Cloud copies insulate companies from a local datacenter loss, but the reverse is true too. When there’s an outage in the cloud, that local HA copy can step in. It is the safest backup and disaster recovery strategy, and the only real way to avoid this level of disaster. We often talk about the advantages of hybrid cloud. Such as the way its elasticity empowers you to design a backup and recovery strategy that works for you, or the way it can accelerate recovery from letting you immediately failover to an accurate replica of your environment. But ultimately what hybrid cloud backup and recovery does best is what’s most important: keeping you up and running no matter what. Because if we learned anything from the AWS outage, it’s that without the right protection, even the titans of industry can go dark in expensive, destructive ways.