21 August 2009, 07:39

Airplanes today are incomparably safer than 20, 30, 50 years ago: 0.05 deaths per billion kilometers. That’s not by accident.

Rather, it’s by accidents. What has turned air travel from a game of chance into one of the safest modes of traveling is the relentless study of crashes and other mishaps. In the US the National Transportation Safety Board has investigated more than 110,000 accidents since it began its operations in 1967. Any accident must, by law, be investigated thoroughly; airplanes themselves carry the famous “black boxes” whose only purpose is to provide evidence in the case of a catastrophe. It is through this systematic and obligatory process of dissecting unsafe flights that the industry has made almost all flights safe.

Now consider software. No week passes without the announcement of some debacle due to “computers” — meaning, in most cases, bad software. The indispensable Risks forum [1] and many pages around the Web collect software errors; several books have been devoted to the topic.

A few accidents have been investigated thoroughly; two examples are Nancy Leveson’s milestone study of the Therac-25 patient-killing medical device [2], and Gilles Kahn’s analysis of the Ariane 5 crash (which Jean-Marc Jézéquel and I used as a basis for our 1997 article [3]). Both studies improved our understanding of software engineering. But these are exceptions. Most of what we have elsewhere is made of hearsay and partial information, and plain urban legends (like the endlessly repeated story about the Venus probe that supposedly failed because a period was typed instead of a comma — most likely a canard).

Software disasters continue; they attract attention when they arise, and inevitably some kind of announcement is made that the problem is being corrected, or that a committee will study the causes; almost as inevitably, that is the last we hear of it. In the latest issue of Risks alone, you can find several examples (such as [4]). In the past months, breakdowns at Skype, Google and Twitter made headlines; we all learned about the failures, but have you seen precise analyses of what actually happened?

As another typical example, we heard a few months ago from the French press that an “IT error” (une erreur informatique) led to overestimating the pensions of about a million people; since (strangely!) no one was suggesting that they would be asked to pay the money back, the cost to taxpayers will be over 300 million euros. I looked in vain for any follow-up story: what happened? What was the actual error? Were the tools at fault? The quality assurance procedures? The programmers’ qualifications? Or was it a matter of bad deployment? Of erroneous data, and if so, what was the process for validating inputs? And so on. Most likely we will never know.

But we should know. Especially with public money, any such incident should have a post-mortem, with experts called in (surely at a fraction of the cost of the failure) to analyze what happened and produce a public report.

At least this was a public project, for which some disclosure was inevitable. The software engineering community buzzes with unconfirmed reports of huge software-induced errors, that go unreported because they happen in private companies eager to avoid bad publicity. It’s as if we had allowed aircraft manufacturers, decade after decade, to keep mum about accidents. Where then would air travel safety be today?

Progress in software engineering will come from many sources. Research is critical, including on topics which today appear exotic. But if anyone is looking for one practical, low-tech idea that has an iron-clad guarantee of improving software engineering, here it is: pass a law that requires extensive professional analysis of any large software failure.

The details are not so hard to refine. The initiative would probably have to start at the national level; any industrialized country could be the pioneer. (Or what about Europe as whole?) The law would have to define what constitutes a “large” failure; for example it could be any failure that may be software-related and has resulted in loss either of human life or of property beyond a certain threshold, say $50 million. In the latter case, to avoid accusations of government meddling in private matters, the law could initially be limited to cases involving public money; when it has shown its value, it could then be extended to private failures as well. Even with some limitations, such a law would have a tremendous effect. Only with a thorough investigation of software projects gone wrong can we help the majority of projects to go right.

We can no longer afford to let the IT industry get away with covering up its failures. Lobbying for a Software Incident Full Disclosure Law is the single most important step we can take today to make the world’s software better.

Note (2011)

Later articles have come back to the theme discussed here, and there will probably be more in the future as it remains ever current. They can be found by selecting the tag “Advance”.

References

[1] Peter G. Neumann, moderator: The Risks Digest Forum on Risks to the Public in Computers and Related Systems, available online (going back to 1985!).

[2] Nancy Leveson: Medical Devices: The Therac-25, extract from her book Safeware: System Safety and Computers, Addison-Wesley, 1995, available online.

[3] Jean-Marc Jézéquel and Bertrand Meyer: Design by Contract: The Lessons of Ariane, in Computer (IEEE), vol. 30, no. 1, January 1997, pages 129-130, also available here.

[4] Monty Solomon: Computer Error Caused Rent Troubles for Public Housing Tenants, in Risks 25.76, 15 August 2009, see here.

[5] Une erreur informatique à 300 millions d’euros, in Le Point, 12 May 2009, available here.

VN:F [1.9.10_1130]

please wait... Rating: 10.0/10 (5 votes cast)

VN:F [1.9.10_1130]

Rating: +5 (from 5 votes)

, 10.0 out of 10 based on 5 ratings