Google’s Gmail Outage Is a Sign of Things to Come
Photograph by Jetta Productions/Gallery Stock
Google’s (GOOG) Gmail was down for 18 minutes last week after a “routine update” (PDF) briefly broke the e-mail service. The search giant reported that it conducted an update of its load-balancing software from 8:45 a.m. to 9:13 a.m. U.S. West Coast time, and after the problems were detected it managed to quickly roll back the buggy code. But this didn’t stop some people from questioning why Google would roll out a software update during peak e-mail hours on the West Coast.
The answer is that most of the coders behind today’s popular websites and services are deploying their code when it’s ready—not at some pre-determined point when downtime may not be noticed. It’s called continuous code deployment, or some variation on that theme, and everyone from Facebook (FB) and Netflix (NFLX) to smaller services do it. While it may occasionally cause a few blips, those blips should be shorter and less catastrophic.
The rationale for performing continuous deployments vary, but most fall into three categories. The first is that there really is no good time for downtime anymore, but if you break it, wouldn’t you rather have happy, alert staff on the clock ready to fix it? Jesse Robbins, chief community officer of Opscode, points out that even good times for downtime can vary with customers.
“One of Opscode’s earliest customers is a popular dating website, and their peak traffic is on Friday night, when people are exchanging phone numbers to go on dates … the exact opposite of peak time for a CRM,” says Robbins.
Plus, as Robert Treat, chief operating officer of OmniTI, a consulting firm that helps websites scale out their business, points out, sometimes deploying at off hours means little because the site won’t actually break until it experiences peak loads. For many of these sites using continuous code deployments, scaling its users is what caused the need for new code in the first place. Until the site experiences that load, they don’t know whether the fixes worked.
The second category is economic. When you wait to deploy your code in these massive quarterly installs, you’re deciding to avoid the efficiencies that the new code could bring to the site immediately. This thinking is more common to companies that view their Web operations as a fundamental cost of doing business, as opposed to some sort of cost center that keeps e-mail up and running.
“Code that has been written but not yet deployed is very similar to inventory,” says Mark Imbriaco of GitHub. “You’ve paid the cost to develop the software but are not yet getting any of the benefit from it. Shipping that code to production sooner means that you and your customers can benefit from it much faster. This is a pretty serious competitive advantage for companies that deliver features faster than their competitors.”
Thinking of code deployment as a Big Fat Hairy Deal adds layers of stress and process to getting it into production, but if it’s a routine part of the job, developers can try things out, deploy code, and move on with their lives. This reduces stress around the deployment, but it also frees their minds up for new problems and jobs, says John Allspaw of Etsy. He adds: “Fast and frequent feedback is what allows for developers to be productive. Developers hate being bored.”
The third school of thought is popularized by Netflix and is basically an invitation to break things, because a system that is so fragile that one code upgrade brings it down clearly isn’t resilient enough. In many ways, Netflix takes the idea of building out an architecture that’s dependent on a genius IT professional’s version of delicate pieces and crazy glue and flips it on its head. Instead of a fragile model car, Netflix is building the Tonka (HAS) trucks of IT—ready to take a few glitches while continuing to serve up videos.
“Systems that contain and absorb many small failures without breaking and get more resilient over time are ‘antifragile,’ as described in [Nassim] Taleb’s latest book,” explains Adrian Cockcroft of Netflix. “We run chaos monkeys and actively try to break our systems regularly so we find the weak spots. Most of the time our end users don’t notice the breakage we induce, and as a result we tend to survive large-scale outages better than more fragile services.”
That’s the rationale behind those software updates that might cause a momentary Web service outage or two. As the DevOps movement spreads, more businesses will likely find reasons to move toward continuous code deployment. Plus, as Allspaw of Etsy points out, the tools to test code and instantly monitor the effects of new deployments are getting better and faster. That means if you accidentally break a site, the dev team notices it faster and fixes it. So maybe there are more outages, but they shouldn’t last as long.
Also from GigaOM:
Breaking Down Barriers and Reducing Cycle Times with DevOps and Continuous Delivery (subscription required)