Tuesday, August 18, 2015

Explaining my Title

I believe I got the quote from Cory Doctorow, and he credited a designer who's name I've never been able to find or credit, so feel to let me know if you know the source. But the quote as I remember it is this:

"The default state of technology, any technology from stone axes to modern computers, the default state of technology is 'broken'."

It's true, too; technology as a definition is something that is created and thus must be maintained. The axe is possibly the earliest and easiest example: axes that aren't sharp are spectacularly bad axes. For really, really long time we survived on technology and tools that were only moderately more complicated than our own fingers and teeth, and thus technology was relatively easy to maintain and manage, but still: a dull knife is a failure mode. A snapped bowstring is a failure mode. And it requires time, attention, and effort to keep the technology of life out of failure mode and in a usable condition. And this is a condition that becomes more and more true as systems become more complex and civilization becomes, well... civilization.

The thing about the modern world is that it's become so complex that the average person, while entirely capable of maintaining their own technology, just doesn't have time to do it on their own. In point of fact, there's a better-than-average chance that they're out doing maintenance on someone else's technology so that person can do maintenance on someone else's technology... it's maintenance all the way down, in our society. The miracle isn't that computers make our lives easier; the miracle is that they manage to not fail on a reasonably regular basis.

An anecdote: the FAA requires that every plane that flies meet a strict policy on maintenance -- the industry standard as I understand it is the "five nines", which means that 99.999% of the parts and functionality of the aircraft must be working for the aircraft to be certified as air-worthy. If you accept the idea that the average 737 has a million moving parts (and I personally think that's low), then that means that every Southwest flight you take there's as many as 10 things on the plane that are broken. The good news is that they often aren't major things -- a seatbelt doesn't lock, a cabin compartment doesn't latch, etc. -- but again, the miracle isn't that planes fly, but rather that planes don't fall out of the sky on a regular basis.

The difference between older, more "reliable" technology and the new experience of the Internet Of Things and our Software-based interface with the world is that most older technology has had the edges shaved down and sanded off. By default, these systems have been redesigned and redesigned until the understanding is that the technology persists in a system where the failure mode is understandable and easy to manage (though sometimes the timing of that failure mode is less-than-ideal -- witness anyone who's had a car run out of gas between mileposts on the freeway).

Much of Operational Thinking involves planning for Failure Modes -- how does it fail, why does it fail, what happens to the user / customer / involved systems when it fails -- and working with management and development teams to determine risk matrices for a given situation and the likelihood of business impact. Often the most important question an Ops team member can ask any developer is "how does it fail," because many developers (rightly enough) are extremely focused on delivery modes and success, and it's the job of the Ops person to make sure that failure is a mode the business as a whole and every partner in the business thinks about in order to reduce time spent in that mode.

Another Anecdote: Disaster Recovery methodology is a very-low-reward value. Often thinking about DR is boring and weird, because it often involves situations that just plain don't happen...until they do. The DR plan for the Datacenter flooding is not something anyone wants to work out, until it's June of 2011 and your company is looking at an emergency relocation of your production environment because your current datacenter is just outside Council Bluffs and there's a record water-release upstream on the Missouri that's about to sweep through and put the first two floors of the building underwater. Then it becomes really valuable to have that white binder with the carefully-laid-out plans for system migration. And it can be both expensive and panic-inducing when it turns out the white binder is empty / out of date, especially since your clients in New York and California aren't really on board with you taking a week off to fix the problem...

Software (and nearly all modern tools are to some extent married to some sort of software) sometimes breaks. Sometimes it breaks in extremely predictable ways, and sometimes it breaks in ways that are not only impossible to predict but sometimes nearly-impossible to replicate (want to have fun? Google "Leap Second Bug" and head down that particular Wikipedia rabbit hole). As an Operational-minded person, I am often looking for new and interesting failure possibilities in the various tools I use. But most people don't think about different types of failure modes; they have a mindset that all tools are either "working" or "broken". And it's important to think about that. And to recognize that the default state of modern civilization and life is much more often "broken".