A little while back Someone in Some Slack Channel Somewhere posted this little gem. For those too lazy to click it’s a “Disaster Recovery Runbook” circa 2012 - well before multi-master services and trafficshift failouts and all of that goodness that we more or less take for granted today.

A few specific callouts:

  1. There are 11 manual tasks with multiple sub-tasks, all taking between 30 and 90 minutes. I love how well-documented it all is…but…

  2. There is an estimated total site-wide downtime of 4-6 hours. The arithmetic doesn’t quite add up given (1), but I actually think the 4-6 hours of “the site is completely hard-down” is probably a reasonable estimate. I strongly suspect this would’ve played out as a typical application of the 80/20 rule. What I mean is this: after 6 hours of All Hands On Deck probably 80% of the site would’ve been up…and the remaining 20% would’ve been a long tail of bustication for (at least) the following 2 months.

  3. “last modified on Jul 23, 2018”_ Incredible. I haven’t looked at the edit history so I have no idea what Greg Mcmillan edited, but…Greg, what’s _ the deal, buddy?

Perhaps the thing that’s most interesting to me is that this wiki page represented the State of the Art, and in very recent history. One way of looking at this: “Everything I am doing right now will be obsolete within 5-7 years.” This is almost certainly true. But another way of thinking about it invokes my second- most overused phrase: “It’s a journey, not a destination.” We’re about to go _Blueshifting into Azure, and nobody really knows what that is going to look like. _ It’s worth noting that even in the absence of Blueshift we’d have no idea what the future will look like 5 years from now.

Think about your 8th-grade self looking back at the “artwork” you produced in the 3rd grade. What would 8th-grade you say about 3rd-grade you? …this is exactly that. So go do the best possible thing that you can do with the world being what it is right now, yeah?