There is a widespread belief in our industry that our data-centers are supremely reliable, capable of providing us true five-9s service, and that by hosting our platforms in these expensive massively redundant locations, we will be safe from all ills. As a result, business continuity planning is often approached from a classic DR "Airplane Into Building" perspective, where insufficient thought, energy, planning, and practice is put into a solution never expected to be used.
The truth is that data-centers DO fail, sometimes for the oddest of reasons. This is not a knock against the infrastructure designers of our industry, who have created some of the most efficient, redundant, and innovative buildings in history; it is merely a statement of fact. As redundancy is added, complexity increases, adding more links to a system at the mercy of the weakest. Something as simple as a border router failure can effectively knock an entire building of compute offline, regardless of how many generators it has.
By starting under the assumption that a particular data-center WILL fail at some point, it becomes significantly easier to build platforms that can be quickly and transparently shifted from location to location. Where designing for the unthinkable leads to poorly thought out solutions, designing for the every-day leads to useable and indeed useful tooling. Highly available internet platforms are not nearly as technically difficult as they are culturally difficult.
I'll interleave a history of massive outages during the last decade with proven recovery solutions and strategies, with the hope of swaying our collective industry from a disaster insurance approach to a truly always-on philosophy. This talk is loosely based on Chapter 17 of O'Reilly's Web Operations, by the same author/speaker.
Mike has spent the last 8 years building highly available systems for Yahoo!, from global replication of petabyte data sets, to massively distributed CDN and traffic routing mechanisms dispersed to points throughout the world. Prior to that, he spent 9 years building interactive television systems at Oracle and Thirdspace, wrestling bus sized parallel supercomputing systems and building fast lightweight DVR client applications. He particularly enjoys solving unsolvable problems.