Tuesday, August 18, 2015

Explaining my Title

I believe I got the quote from Cory Doctorow, and he credited a designer who's name I've never been able to find or credit, so feel to let me know if you know the source. But the quote as I remember it is this:

"The default state of technology, any technology from stone axes to modern computers, the default state of technology is 'broken'."

It's true, too; technology as a definition is something that is created and thus must be maintained. The axe is possibly the earliest and easiest example: axes that aren't sharp are spectacularly bad axes. For really, really long time we survived on technology and tools that were only moderately more complicated than our own fingers and teeth, and thus technology was relatively easy to maintain and manage, but still: a dull knife is a failure mode. A snapped bowstring is a failure mode. And it requires time, attention, and effort to keep the technology of life out of failure mode and in a usable condition. And this is a condition that becomes more and more true as systems become more complex and civilization becomes, well... civilization.

The thing about the modern world is that it's become so complex that the average person, while entirely capable of maintaining their own technology, just doesn't have time to do it on their own. In point of fact, there's a better-than-average chance that they're out doing maintenance on someone else's technology so that person can do maintenance on someone else's technology... it's maintenance all the way down, in our society. The miracle isn't that computers make our lives easier; the miracle is that they manage to not fail on a reasonably regular basis.

An anecdote: the FAA requires that every plane that flies meet a strict policy on maintenance -- the industry standard as I understand it is the "five nines", which means that 99.999% of the parts and functionality of the aircraft must be working for the aircraft to be certified as air-worthy. If you accept the idea that the average 737 has a million moving parts (and I personally think that's low), then that means that every Southwest flight you take there's as many as 10 things on the plane that are broken. The good news is that they often aren't major things -- a seatbelt doesn't lock, a cabin compartment doesn't latch, etc. -- but again, the miracle isn't that planes fly, but rather that planes don't fall out of the sky on a regular basis.

The difference between older, more "reliable" technology and the new experience of the Internet Of Things and our Software-based interface with the world is that most older technology has had the edges shaved down and sanded off. By default, these systems have been redesigned and redesigned until the understanding is that the technology persists in a system where the failure mode is understandable and easy to manage (though sometimes the timing of that failure mode is less-than-ideal -- witness anyone who's had a car run out of gas between mileposts on the freeway).

Much of Operational Thinking involves planning for Failure Modes -- how does it fail, why does it fail, what happens to the user / customer / involved systems when it fails -- and working with management and development teams to determine risk matrices for a given situation and the likelihood of business impact. Often the most important question an Ops team member can ask any developer is "how does it fail," because many developers (rightly enough) are extremely focused on delivery modes and success, and it's the job of the Ops person to make sure that failure is a mode the business as a whole and every partner in the business thinks about in order to reduce time spent in that mode.

Another Anecdote: Disaster Recovery methodology is a very-low-reward value. Often thinking about DR is boring and weird, because it often involves situations that just plain don't happen...until they do. The DR plan for the Datacenter flooding is not something anyone wants to work out, until it's June of 2011 and your company is looking at an emergency relocation of your production environment because your current datacenter is just outside Council Bluffs and there's a record water-release upstream on the Missouri that's about to sweep through and put the first two floors of the building underwater. Then it becomes really valuable to have that white binder with the carefully-laid-out plans for system migration. And it can be both expensive and panic-inducing when it turns out the white binder is empty / out of date, especially since your clients in New York and California aren't really on board with you taking a week off to fix the problem...

Software (and nearly all modern tools are to some extent married to some sort of software) sometimes breaks. Sometimes it breaks in extremely predictable ways, and sometimes it breaks in ways that are not only impossible to predict but sometimes nearly-impossible to replicate (want to have fun? Google "Leap Second Bug" and head down that particular Wikipedia rabbit hole). As an Operational-minded person, I am often looking for new and interesting failure possibilities in the various tools I use. But most people don't think about different types of failure modes; they have a mindset that all tools are either "working" or "broken". And it's important to think about that. And to recognize that the default state of modern civilization and life is much more often "broken".

Monday, August 17, 2015

The Question of Sheets (Or: all Metaphors are Faulty)

Think of your business as a bedroom -- let's say you're a homeowner and you're looking to rent out your spare bedroom on AirBnB or something like that (we'll leave the troublesome methodology of companies like AirBnB or Uber for some other day when I have the ability to produce TWO multi-thousand-word rants about business decisions). Your bedroom is a business, and your production environment is the bed -- mattress (front-end), box-spring (back-end), bedframe (infrastructure). Your Operations team is the woman who changes the sheets (product release), and this is where things get tricky.

If you're running a shady, quasi-illegal operation out of your spare bedroom, the woman who changes the sheets is probably you, and you're probably not a professional housekeeper. You just want clean sheets that keep the mattress from getting horked up by the weirdo from Brooklyn with the Macbook Pro who leaves beard trimmings in the sink. In this case, you do what any reasonable homeowner does: you go out and buy a set of sheets off the shelf, throw on the fitted sheet, and ignore it until the next person comes along and you have to change the sheets again. You're trying to make some spare scratch on the side, not make a business of it, so this model is fine; you can probably get by with two or three sheet sets and you just pull them off and toss them in the laundry as needed, and most of the time you keep your treadmill with the hangers on it in the corner and there's no problem.

But then you've got some spare cash, so you take out a mortgage on a condo in a building in downtown Portland and rather than moving into it you stage it and decide to rent it out to people visiting PDX for conferences or vacations or whatever, because there's money to be made with spare bedrooms. Now you have a choice: do you become an expert at cleaning? Or do you hire a cleaning service to keep your condo clean between visits? Note that the cleaning service is going to cut into your profits, probably pretty significantly. But you're also going to spend a lot of time and effort on sheets. And if sheets aren't something you want to spend a lot of time on, especially fiddling with fitted sheets on a given mattress, then there's a pretty steep opportunity cost there as well. So either you get to become an expert on sheets and making the bed, or you're going to spend a moderate chunk of the money you're making to have someone else come in and change your sheets for you. Your choice.

But then you realize, you really like managing visitors, and there's lots and lots of people wanting to sleep in Portland, so that's it: you're going to build a hotel in Portland. You're going to have lots and lots of mattresses for lots and lots of visitors. And that means lots and lots of sheets. So now it's time to make some decisions about hiring the people who know something about sheets (and vacuums, and washing machines, and... well, you get the idea).

Like in the hospitality industry, in IT the people who change the sheets and vacuum the floors and fold the corners and spray for bedbugs are fantastically undervalued for the work they do, mostly because when they do it correctly no one notices and when they do it badly companies go under.

This metaphor is getting a little out of control, but you get my point: trust people to know what they're doing, let them do it, and for the gods own sake, pay them reasonably well, or they will desert you in droves the moment that someone else offers them a dollar more an hour to change the sheets.

As an addendum, it was pointed out to me that when you're managing a hotel at scale, no one uses fitted sheets. Instead, the proprietor goes to a special wholesaler and buys a metric ton of flat sheets of a uniform colour and size, which the staff then folds and fits to the particular mattress as necessary based on size and usage. I'll leave the parsing of that as a metaphor for Operations Teams as an exercise for the reader...