The entire universe is abuzz and atwitter about the big Amazon EC2 outage this past week. A cascading series of glitches in their Elastic Block Storage (EBS) system took down several high profile websites hosted in their Eastern Region data centers. The AWS Status Dashboard has a considerable write-up on the outage as it progressed over the latter half of last week.
Responses to the outage were mixed. As question-and-answer service Quora posted on their outage page: “we’d point fingers, but we wouldn’t be where we are today without EC2.” This is true: Amazon and its ilk provide relatively affordable and scalable hosting for applications, and relieve the current wave of startups of the burden of having to invest in and operate their own hosting. However, when you host your application on Amazon, you still have a single point of failure unless you very specifically engineer it to be resilient under failures. EC2 offers many features that can take you beyond a single host deployment. Customers who have adapted their deployment to take advantage of these features withstood last week’s outage with little or no customer-visible impact. Without such adaptations, your web application is no better off than if it were hosted on a conventional web hosting platform.
Amazon operates multiple Availability Zones that are supposed to isolate failures… which did not work too well last week because the issues cascaded across availability zones until the entire Eastern Region was affected. Resilience across geographic regions is not straightforward because the CAP Theorem kicks in: Consistency, Availability, Partition Tolerance, pick two. You can’t have all three at the same time. Engineering an application to withstand outage by distributing it across different availability zones, across regions, or even across different providers is a considerable and costly undertaking, which is not lightly embarked upon by a cash-strapped startup trying to get swiftly to market. Whether to spend this time and money, or whether to tolerate and respond to the occasional outage is a determination that every company will have to make for themselves.