Managed risk
In this week’s issue of the New Yorker (dated 1/22/96) writer Malcom Gladwell examines the Challenger disaster from a new perspective, in an article rather tactlessly titled “Blowup.” In it, Gladwell debunks the traditional “ritual of reassurance, “ whereby we sift through the remnants of a disaster, piece together the clues, and hope that by learning what happened we can avoid it in the future. Gladwell argues that in the case of the Challenger disaster, the massive search for evidence and clues was really all for naught. Because no one thing was at fault. Rather, the nature of complex systems (like the shuttle) and the culture of large organizations (like NASA) are to blame.
He argues that…
- Technological disasters happen not because of human error, but because of the very complexity of the technology involved. Small problems combined with other small problems have a tendency to cause chain reactions that are impossible to plan for.
- The culture of a large organization like NASA can contribute to what logically are faulty decisions, but that when examined are completely “by the book.” NASA’s definition of what are “acceptable risks” for launching the Shuttle fill six volumes. The decision to launch the Shuttle, despite evidence of O-ring trouble, was literally made “by the book.”
I don’t know much about the technical issues surrounding the failure of the O-rings. And I don’t work at NASA, so I’m not familiar with their culture. But Gladwell’s argument is compelling. A system like the space shuttle (or a nuclear power plant, or an operating system, for that matter) is very, very complex. And to build complex things, you need some level of “managed risk.”
In Out of Control, Kevin Kelly writes about how it is nearly impossible to build a complex system that is error free. In a complex piece of software, for example, 10% of the code would actually be doing the work that the software was intended to do. The other 90% would be searching for errors, trapping them, and making sure that they don’t affect the crucial 10% that’s doing the real work. Not very efficient, for sure, and you’d never ship (or launch) anything built that way, because it would take too long to build.
The answer, Kelly argues, is to build complex systems from the bottom up, out of simpler, more independent pieces that interact with one another in a way that “breeds” complexity. Kelly, of course, cites the Internet as a perfect example of a complex system that’s built from the bottom up. TCP/IP is a set of relatively simple rules about how computers should communicate with one another. It works well whether there are two computers involved, two hundred, or two million. But you knew that.
There are systems where “bottom up” evolution makes perfect sense. The Internet is one of them. But there are obvious challenges in applying that to day-to-day development projects. What do you tell the politicians (or the taxpayers for that matter) when your new space shuttle doesn’t “evolve” quickly enough?
And let’s say that we do get to the point where a complex system like the space shuttle can evolve from the bottom up (like the Internet has). What happens if it fails anyway? What do we do then?
About half way through Gladwell’s New Yorker article, I started to feel sick to my stomach. Were the relatives of Christa McAullife (the schoolteacher on the Challenger) reading this? Did they really want to know that a culture of “managed risk,” or the “nature of complex systems” were really to blame?
There is something to be said for the “ritual of reassurance.” There’s a reason that the first question the reporter asks the FAA official is, “Have you recovered the flight recorder yet?” On an intellectual level, we need to feel that we have some level of control over the systems we create, even if that control is illusory. Ronald Reagan, in his speech following the shuttle disaster, said “We’ve grown used to wonders in this century. It’s hard to dazzle us.” Those were the most intelligent words of his presidency.
And on an emotional level, we need a sense of closure. When I was in college, plastic explosives blew apart Pan Am flight 103 over Lockerbie, Scotland. Two of my classmates were on that flight. I remember seeing on television the Scottish police searching the violated countryside for clues. And I remember muttering under my breath, “They better find the fuckers that did this.”