09/11/2012

Amazon’s Infamous Downtime – Mother Nature Wasn’t the Only One To Blame

Cloud computing has changed the World Wide Web forever – and in many ways it has been for the better. There is, however, a downside to computing in the “clouds”. Some businesses have begun to implement cloud solutions without truly understanding what steps they need to take to ensure that the servers and infrastructures that manage these “clouds” remain up and running properly. Amazon.com’s service outages are a perfect example of cloud computing gone wrong, and there are lessons to be learned here.

When some of Amazon’s cloud services went down back in July of this year, the simple answer to “Why did it happen?” would have been “Mother Nature struck hard”. Unfortunately, that answer doesn’t paint an accurate picture of what actually happened, nor does it portray the part that Amazon played in the fiasco. Yes, lightning was definitely what triggered the service outage. The failure to get the services back up and running promptly, however, had nothing to do with Mother Nature and everything to do with flaws in Amazon’s cloud infrastructure.

Apparently, according to Amazon, while lightning did indeed cause the affected services to initially go down, the inability to get those services back up and running in a timely manner was a direct result of problems with Amazon’s software and cloud infrastructure. Had the software and servers been bug-free and running properly, the outages could have been resolved quickly. That, unfortunately, was not how the story played out.

Instead of being up and running in a timely manner, it was reported that some customers were unable to access the affected services for up to 15 hours. As customers tweeted about the outage, it became apparent that many were becoming increasingly frustrated. Amazon, a once-trusted service provider, was losing ground quickly in terms of reputation as they failed to get their EWS services back up. As Twitter member Ed Morrissey tweeted, “The climate change that scares us most is an Amazon Cloud outage.” It’s apparent that people depend on Amazon’s cloud. It’s up to Amazon to provide reliable services.

One can understand that as technology changes, companies will face challenges. The issues with Amazon’s cloud infrastructure, however, are something that should have been addressed a long time ago. Amazon had already experienced a frustrating service outage back in April of 2011 – an outage which again involved the company’s cloud servers. A 10-day credit was given to the customers who were affected by this outage as well. Add up the costs of all of these 10-day credits and it’s not hard to see that these outages may have cost Amazon millions in lost revenue, not to mention the impact on the company’s reputation.

What can a company do to ensure that things like this don’t happen in the future? Testing of software and monitoring of servers are crucial to maintaining website and service uptime. You have to be prepared for the worst, because the worst might just happen. While there will be instances when downtime is truly unavoidable, the key is to minimize the downtime that customers experience and provide valid reasons for any outages that may occur.

While Amazon has taken measures to test their services and increase performance in an effort to maximize uptime, those steps may not be enough. If the company truly hopes to gain the trust of customers who have been affected by these service outages, they need to turn to a third party for assessment and testing of their software and cloud infrastructure. Cloud computing is not without its risks. Flaws in infrastructure can indeed affect uptime, as the Amazon outages have proven. A company must continually monitor its servers and test its software and infrastructure if it hopes to achieve maximum uptime. After all, when every minute of downtime means lost revenue, uptime needs to be a company’s top priority.