Nobody likes to fail. Despite the Pollyannas of the world, who like to point out that Thomas Edison failed 700, 1,000 or 10,000 times before creating a working light bulb and who talk about failure being another fine opportunity for growth, people don’t like to fail. And there are, of course, any number of articles, books, and consulting contracts on how to keep IT projects from failing.

And yet IT project failure happens. A lot.

Estimates from some organizations say that from 20 to 25 percent of IT projects fail, for a cost of up to $6 trillion. And it can be worse with new technologies. “100 percent of the Big Data projects go over schedule, 100 percent go over budget, and half of them fail,” Jim Kaskade, CEO of Infochimps (perhaps speaking hyperbolically) told Loraine Lawson in IT Business Edge.

A discussion on LinkedIn’s CIO Network about how CEOs should deal with IT project failure drew more than 180 responses, with some putting the blame squarely in the CIO’s court. “The CEO should fire the CIO,” posted one respondent flatly. “IT project failure is due to incompetent IT managers.” And certainly there’s an off-with-her-head mentality about IT project failures, with the poor CIO bravely taking the blame. “A scad [sic] of reasons for the CIO to be thrown out with the bathwater ” posted another respondent.

And then you’ve got government IT failures, which are in a class of their own, with examples such as the U.S. Veterans Administration backlog, the Air Force modernization, and the California Department of Motor Vehicles. The Royal Bank of Scotland, which is 81 percent publically owned, has suffered a string of failures, most recently in its mobile app. The U.K. government has also been warned that several of its major IT projects are facing problems.

Most recently, the BBC admitted that a failed project to create a digital content management system called the Digital Media Initiative (DMI) had to be scrapped after costing £98.4 million ($152 million.) And in another example of the head being blamed, the BBC suspended CTO John Linwood, who was paid £287,800 plus a £70,000 bonus last year, on full salary, at least for now….

But is this because government IT workers are inherently less competent than those in private industry? Is it because they are public agencies and so are less able to hide their gaffes? Or is it because IT projects in government are typically more complex and thus more prone to failure?

There are actually two schools of theory about this, “high reliability” theory and “normal accident” theory. High reliability theory argues that with enough design and management, high-risk technologies can work without mistakes despite human frailty.

On the other hand, normal accident theory, postulated by Charles Perrow in his 1984 book in response to the Three Mile Island accident, basically suggests that as a system gets more complex, the chances of failure increase no matter how careful you are with all the requisite components. In other words, no matter how rigorously you test all the various components, when you put them together, they’re more likely to fail because of unexpected interactions between them.

An example occurs in the space program, with the explosion of the space shuttle Columbia, where the analysis of the accident discussed the role of normal accident theory. “Standing alone, components may function adequately, and failure modes may be anticipated,” the report says. “Yet when components are integrated into a total system and work in concert, unanticipated interactions can occur that can lead to catastrophic outcomes. The risks inherent in these technical systems are heightened when they are produced and operated by complex organizations that can also break down in unanticipated ways.”

No problem, you say, you’ll put in a lot of checks and balances to look for failure. Okay, but those checks and balances themselves add complexity and make a system more prone to failure, NASA warns. Every time you add some sort of a warning system and a light, you then have to consider the possibility that it’s the warning system, or the light itself, that’s failing, rather than the underlying system it’s supposed to check.

NASA, in fact, found that the most useful thing to do — since its programs are inherently complex — is to improve communication. This has been confirmed by other organizations. “A Project Management Institute (PMI) study finds that organizations risk ‘$135 million for every billion dollars’ they spend on projects,” writes Michael Krigsman in ZDNet’s Beyond IT Failure blog. “Of this large sum, related research from PMI concludes that ‘ineffective communications’ drives 56 percent ($75 million) of these at-risk dollars. Based on these numbers and empirical experience, it is clear that communication plays a significant role in the success or failure of projects.”

There’s also the notion of redefining failure as an “incomplete success,” as postulated by Daniel Antion, Vice President of Information Services at American Nuclear Insurers (who, interestingly, also uses the U.S. space program as an analogy). “If we are trying to build a system that is easier to use, how do we know when we are done, and do we really ever want to say that we’re done?”

So there are a couple of lessons to be drawn from this. First of all, if an IT project fails, despite all your best attempts, it’s not necessarily your fault. Second, to the extent that you can, it’s better to keep your projects simple, because the more complex you make them, the more likely they are to fail. Third, it can never hurt to improve communication. And finally, learn to see failure as the beginning of the next iteration, rather than as an endpoint.









