Posted by: Stephen Wildstrom on February 24, 2009
Update, Thursday Feb. 26
Today’s Gmail outage naturally raised questions about the reliability of trusting mission-critical applications to the vagaries of cloud computing. But just how bad a blow to Gmail’s reliability was the outage, which Google puts at 2 1/2 hours, but user report made seem somewhat longer. The answer: Not that bad, as long as it doesn’t happen often.
In the good old days, AT&T used to promise “five nines” of reliability. That meant you could expect your phone service to be up 99.999% of the time, a standard that allowed for just a bit more than 5 minutes of downtime as year.
But five nines is really, really expensive to deliver and led to a phone network that was, by most standards, massively over-engineered. Google's service level agreement for paid business Gmail promises "three nines," or 99.9% up time. That actually leaves room for nearly 9 hours of outage a year, or three failures of the same magnitude as today's. Amazon makes a slightly higher promise for its Elastic Compute Cloud service, 99.95% uptime, or nearly 4 1/2 outage hours a year.
What's more disturbing than the Gmail outage is Google's lack of transparency about it. The most recent post on Google's official blog declares the problem over, apologizes for the inconvenience, and explains why some users had to prove to Google that they were human beings before being allowed to log in to their Gmail accounts. But it provides no explanation whatever of what went wrong or what had been done to fix it or prevent its recurrence.
Amazon, by contrast, maintains a Service Health Dashboard for its Amazon Web Services with both a report on the current status of each service and a 35-day history of any problems (I can't tell you how good the reports are because the current time frame shows no incidents.) At a minimum, Google should maintain a similar site for the folks who have come to depend on its services.