Why We’re in this BP Gulf Oil Mess, and What We Should Do About It

I’ve talked a lot about human nature in my articles. I believe that human nature is the biggest challenge to most successful management, and especially the biggest challenge for IT managers.

Information technology is all very logical. Software does exactly what you tell it to do. Computers — for the most part — behave the way we expect them to behave. But people are on the opposite end of the behavior spectrum from software and hardware, and people behavior follows a different set of rules.

Let’s take the BP Gulf oil spill as an example. Putting an oil well in the ocean is an inherently dangerous process. Actually, putting an oil well anywhere is an inherently dangerous process, but putting a well thousands of feet below the surface of the ocean is much more complicated and therefore has a much higher risk.

No one — not BP, not BP workers, not the government and not the population of the planet — wants to see a leak in an oil well. In all of the criticism of BP no one has even suggested that the leak was deliberate. Instead, the discussion about blame and fault has centered around the safeguards that were employed: Were they safe enough? Were they properly maintained? Were warning signs ignored?

In my book, Boiling the IT Frog, I talk about the unfortunate use of magic in IT — the suspension of reality that occurs when a technological process is so sophisticated that people develop a blind trust in its capabilities. I believe that this same magic plays a large role in the BP Gulf oil disaster. Here’s how:

Because of the inherent risk in drilling so deep in the ocean, a number of safeguards were installed to automatically stop the oil flow in case of disaster. There is no such thing as a 100% reliable system, especially one that includes hardware that is exposed to the elements. The ocean is a hostile environment to most machinery. Thus any system installed in such an environment must be regularly inspected and maintained, and certain parts that are subject to wear and corrosion must be regularly replaced. Because of assurances from engineers, safety experts and scientists, the BP employees working on the oil rig developed a sense of magic around the safeguards. They trusted the safeguards to protect them from any disaster. And because there were multiple safeguards (primary systems as well as backup systems), the workers developed an especially unreasonable trust. This is, unfortunately, human nature: when you have to live day-to-day in a risky environment, you suppress your fear by developing an unreasonable trust in things going right. That trust persists right up until the point when things go wrong. When the safeguards began to malfunction, there was probably some concern. But because there were multiple safeguards, the concern was not severe. Each worker on the oil rig theoretically had the ability to stop operations if there were safety concerns. But as long as there were multiple backup systems, the workers reasoned that things weren’t that dangerous. No one wanted to be the “bad guy” who stopped production. Thus the multiple safeguards — which are good for reliability — are bad for human nature: multiple safeguards lead to unreasonable trust, which leads to a lack of concern when one of the safeguards fails. One by one the safeguards malfunctioned, went out of maintenance (in some cases waiting on parts), or developed small problems that would later snowball into something big. But the “magic” persisted and no one recognized how close they were coming to a disaster. Finally a relatively small problem triggered a sequence of failures which caused the rig to explode and the well to spew oil into the Gulf. It wasn’t one thing that caused the disaster — it was a series of things that were ignored because not one of those things was considered significant enough to cause alarm.

This Isn’t the Only Case

There are many examples of this kind of thing: a series of cascading small issues that roll together into a single big disaster. Chernobyl followed the same pattern, as did Three Mile Island, the Challenger shuttle failure, the Twin Towers (not the planes, obviously, but the building collapse), and even the Titanic sinking.

All of these disasters could have been avoided, and should have been avoided, but human nature got in the way. It’s human nature to develop a false trust in safeguards (especially multiple safeguards) when we’re told by “experts” that the safeguards are sufficient. It’s human nature to avoid being the “bad guy” who makes a big deal out of something when everyone else assures you it’s OK. And now, in the aftermath of the BP Gulf oil disaster, it’s human nature to believe that all disasters have a single cause and so we look to find a single person to blame.

What Can We Do?

I’ve thought about this a lot over the last couple of months. I wish there was a simple answer — a “silver bullet” solution that would prevent this sort of thing in the future. Some people call for more regulation, and maybe that would help, but I’m afraid that the inspectors and regulators would also fall victim to the same unreasonable trust that we saw on the oil rig. And personal agendas and politics have a way of twisting regulations and enforcement to provide personal rewards to the occasional person who is corrupt enough to look the other way for a price.

One possible solution is to learn from some of the things that have been done in commercial flight training, in NASA, and in some parts of the military. When you’re learning how to fly a large jetliner you don’t just learn how to do everyday things like take-offs and landings. You spend a huge amount of time learning how to deal with various malfunctions and flight issues. There’s a protocol that you learn on what to do when certain things happen. You learn that if a certain instrument gives a certain result, you’re supposed to do a certain thing. There’s no judgement — except by the people who originally develop the protocol — you learn the protocol so well that you do it automatically when the actual event happens. Flight simulators make the process realistic — you go through the protocol in a simulated cockpit that looks and feels just like the real one.

Commercial aviation also relies on checklists. Before a pilot ever moves a plane away from the gate the flight crew has gone through an extensive checklist designed to make sure that conscious attention is paid to every relevant instrument and flight indicator. The process isn’t perfect, but that checklist — in association with a predetermined protocol of how to deal with each abnormal checklist item — has prevented many potential disasters.

I don’t know enough about the processes and procedures on the oil rig, but I get the impression that:

Clear protocols were not in place to force workers to take certain actions when certain problems occurred. These protocols would have removed the need for a personal decision (and eliminated the “I don’t want to be the bad guy” problem), since the protocols would have dictated the required behavior.

There had been no “flight simulation” type training that took the oil rig workers through simulated problems until the solution was part of the workers’ unconscious thought.

On the other hand, I know that flight-simulation-like techniques have been used for workers on nuclear reactors, and we still had the Chernobyl and Three Mile Island disasters, even though I suspect that many other potential nuclear disasters have been avoided as a result of the simulator training. I don’t know what happened at Chernobyl or Three Mile Island: maybe the problems were outside the training, maybe the simulation training wasn’t used in those facilities, or maybe the workers ignored their training and caused the problems.

Conclusion

I believe that the solution to the BP Gulf oil type of problems lies in:

Getting a better understanding of the safeguards that are in place in any risky situation, with specific focus on whether they’re safe enough. I know that 100% safety is not possible, but we ought to be able to approach that number. Determining the risk when one or more of the safeguards is not working, and making sure that protocols are in place to deal with repair or replacement of the broken safeguard in accordance with the risk. High risk should dictate an expedited repair or even a temporary shutdown. Putting checklists in place for regular inspection and maintenance of the safeguards, and putting protocols in place to deal with any area that doesn’t pass the checklist requirement. Training workers to deal with exceptions, with specific instruction on what steps to take in each situation, and who is supposed to take those steps. This seemed to be missing on the BP oil rig, and so everyone just assumed that everyone else was dealing with the problems. Using simulation training where it makes sense, so that workers get an actual exposure to what would happen in certain unlikely situations, and so that they learn how to deal with those situations.

And for those of you who don’t work in the oil industry (most of you, I’m sure), think about what types of disasters might befall your own business. Are you prepared? Or will your employees fall into the human nature trap and avoid the problems rather than acting? What can you learn from the BP oil disaster? And what can you do to put human nature to work for you instead of having it work against you?