Reliability and uptime are major driving factors for every customer looking for a data center or to host “in the cloud.” So, why is it that Amazon seems to get away with so much down-time?
The short answer is they don’t… or at least, we feel can’t for long.
Clearly some companies have been better than others at architecting stable systems and network to withstand an outage in any one AWS datacenter. Netflix being a prime example of a company that has used Amazon’s failures to design distributed and stable systems. But where they have learned the hard way and now succeeded, many companies do not have the resources available to compensate for a hosting provider routinely failing. (Sorry Amazon, but as pointed out below in detail, you’re not backing up your own EC2 SLA and S3 SLA)
To this extent we want to bring your attention to an article and detailed chart of AWS outages since 2008 that our friends at ENKI have compiled and discuss at length.
submitted and written by Eric Novikoff of ENKI.co
The two recent Amazon outages in June have once again brought the topic of cloud reliability to the attention of cloud users and the media, but on top of numerous Amazon failures in 2011, pundits are starting to ask if the actual architecture of Amazon’s cloud is the source of its recurring problems. Up until recently, the average cloud user was firmly convinced that Amazon could do no wrong. This has been a challenge for us at ENKI, since we end up competing against a myth instead of a real service with real problems. So I thought it was time to sum up the problems at Amazon over the last few years and see what they mean.
If you’ve been following my blog, you’ll know that I don’t believe in the myth of a failure-free cloud service, and that ultimately going beyond cloud providers’ uptime guarantees requires that the users themselves take responsibility for uptime by designing applications and deployments that take advantage of geographic diversity. And this has been borne out by Netflix, which has experienced far fewer downtimes than its host, Amazon, has had. However, the bulk of cloud users still deploy on non-redundant, single-location servers, which is what they compare the reliability of. So let’s look at that reliability from Amazon…
Amazon guarantees 99.95% uptime. If you get less than that, you can apply for a minimum 10% refund per month. However, they can not possibly achieve their own guarantee: 99.95% translates to about 4.38 hours per year of downtime. If you just look at their mammoth failure (documented below) in April 2011, which was officially declared repaired after 37 hours (though some customers experienced significantly longer or shorter downtimes), that would require them to have had no failures for another nine years! As you can see from the list below, that hasn’t been the case. In fact, adding up just the published failure times in Amazon’s East Coast datacenter for 2011 and 2012 yields an uptime of only 99.7% – which has been clearly visible to millions of users of systems such as 4Square and other sites (and mobile apps) using Amazon.
This is far, far lower than what is possible with well-designed cloud services or a well-designed redundant colocated server farm. And there is a disturbing trend to the root causes of these failures: over and over, power failures have taken Amazon datacenters down, and recurring software failures have kept them down. Is the AWS architecture too complicated to be reliable?
Here are a few recent examples from the list on ENKI’s original post