A Collection of Well-Known Software Failures

Table of contents

Economic Cost of Software Bugs

Report Date: 2/2002 Price Tag: $60 Billion Annually

Improvements in testing could reduce this cost by about a third, or $22.5 billion, but it won't eliminate all software errors, the study said. Of the total $59.5 billion cost, users incurred 64% of the cost and developers 36%.

NIST Report[local copy], News Release,

Out of curiosity of how the study calculated the cost, I skimmed through the report. The following is a summary of their methodology.

It divided software developing process into stages: Requirement Gathering and Analysis, Architectural Design, Coding, Unit Test, Integration and Component, RAISE System Test, Early Customer Feedback, Beta Test Programs, and Post-product Release.

Bugs are generated at each stage of the software development process. The later in the production process that a bug is discovered, the more costly it is to repair the bug. Then impact estimates were developed relative to two counterfactual scenarios. The first scenario investigates the cost reductions if all bugs and errors could be found in the same development stage in which they are introduced. This is inferred to as the cost of an inadequate software testing infrastructure. The second scenario investigates the cost reductions associated with finding an increased percentage (but not 100 percent) of bugs and errors closer to the development stages where they are introduced. This is referred to as a cost reduction from feasible infrastructure improvements.

The study examined the impact of buggy software in several major industries -- automotive, aerospace and financial services -- and then extrapolated the results for the U.S. economy. It then concluded software bugs are costing (the first scenario) the U.S. economy an estimated $59.5 billion each year. Improvements in testing (the second scenario) could reduce this cost by about a third, or $22.5 billion

The report also included interesting tables that show the frequency of which stages errors are found, and relative cost to repair defects when found at different stages (in Ch6 and Ch7).

Incident Date: 8/1/2012 Price Tag: $440 million Ironic Factor: ****

(The Register) Knight Capital, a firm that specialises in executing trades for retail brokers, took $440m in cash losses Wednesday due to a faulty test of new trading software.

...

Unfortunately, the trading algorithm the program was using was a bit eccentric as well. On every stock exchange, there is a "bid" and an "ask" price. The bid price is what you'd like to pay the holder of the stock if you want to buy their shares. The ask price is what they'll pay to buy those same shares from you. There's always a spread between the two prices, with the "ask" being a few cents or more above the "bid". If the stock is thinly traded, then the spread between the ask and the bid is higher than what you.d see for, say, IBM.

Knight Capital's software went out and bought at the "market", meaning it paid ask price and then sold at the bid price--instantly. Over and over and over again. One of the stocks the program was trading, electric utility Exelon, had a bid/ask spread of 15 cents. Knight Capital was trading blocks of Exelon common stock at a rate as high as 40 trades per second--and taking a 15 cent per share loss on each round-trip transaction. As one observer put it: "Do that 40 times a second, 2,400 times a minute, and you now have a system that's very efficient at burning money".

As the program continued its ill-fated test run, Knight's fast buys and sells moved prices up and attracted more action from other trading programs. This only increased the amount of losses resulting from their trades to the point where, at the end of the debacle 45 minutes later, Knight Capital had lost $440m and was teetering on the brink of insolvency.

...

Article

Microsoft Zune's New Year Crash

Incident Date: 12/31/2008 Ironic Factor: ****

(Associated Press) Happy New Year from Microsoft Corp.: Your Zune is dead.

Thousands of Microsoft's Zune media players -- the software company's answer to Apple Inc.'s iPod -- unexpectedly conked out Wednesday and showed users an error message, prompting references to 'Y2K for Zunes'. The problems appeared when people tried to start up their devices.

...

Article [Local copy]

The software bug for the freeze was later isolated. It is a dumb programming bug that causes troubles only on the last day of a leap year.

Air-Traffic Control System in LA Airport

Incident Date: 9/14/2004 Ironic Factor: *****

(IEEE Spectrum) -- It was an air traffic controller's worst nightmare. Without warning, on Tuesday, 14 September, at about 5 p.m. Pacific daylight time, air traffic controllers lost voice contact with 400 airplanes they were tracking over the southwestern United States. Planes started to head toward one another, something that occurs routinely under careful control of the air traffic controllers, who keep airplanes safely apart. But now the controllers had no way to redirect the planes' courses.

...

The controllers lost contact with the planes when the main voice communications system shut down unexpectedly. To make matters worse, a backup system that was supposed to take over in such an event crashed within a minute after it was turned on. The outage disrupted about 800 flights across the country.

...

Inside the control system unit is a countdown timer that ticks off time in milliseconds. The VCSU uses the timer as a pulse to send out periodic queries to the VSCS. It starts out at the highest possible number that the system's server and its software can handle232. It's a number just over 4 billion milliseconds. When the counter reaches zero, the system runs out of ticks and can no longer time itself. So it shuts down.

Counting down from 232 to zero in milliseconds takes just under 50 days. The FAA procedure of having a technician reboot the VSCS every 30 days resets the timer to 232 almost three weeks before it runs out of digits.

Article [Local copy]

Northeast Blackout

Incident Date: 8/14/2003 Price Tag: $7 - $10 Billion Ironic Factor: **

NEW YORK (AP) - A programming error has been identified as the cause of alarm failures that might have contributed to the scope of last summer's Northeast blackout, industry officials said Thursday.

... The failures occurred when multiple systems trying to access the same information at once got the equivalent of busy signals, he said. The software should have given one system precedent.

With the software not functioning properly at that point, data that should have been deleted were instead retained, slowing performance, he said. Similar troubles affected the backup systems.

News Release [local copy], Cost Estimate [local copy],

NASA Mars Climate Orbiter

Incident Date: 9/23/1999 Price Tag: $125 million Ironic Factor: ****

It was a mathematical mismatch that was not caught until after the $125-million spacecraft, a key part of NASA's Mars exploration program, was sent crashing too low and too fast into the Martian atmosphere. The craft has not been heard from since.

...

Noel Henners of Lockheed Martin Astronautics, the prime contractor for the Mars craft, said at a news conference it was up to his company's engineers to assure the metric systems used in one computer program were compatible with the English system used in another program. The simple conversion check was not done, he said.

Article [local copy]

Denver Airport Baggage-handling System

Incident Date: 11/1993 - 6/1994 Price Tag: > $200 million Ironic Factor: *

(Scientific America) -- Scheduled for takeoff by last Halloween (1993), the airport's grand opening was postponed until December to allow BAE Automated Systems time to flush the gremlins out of its $193-million system. December yielded to March. March slipped to May. In June the airport's planners, their bond rating demoted to junk and their budget hemorrhaging red ink at the rate of $1.1 million a day in interest and operating costs, conceded that they could not predict when the baggage system would stabilize enough for the airport to open.

Local copy

Incident Date: 6/1985 - 1/1987 Price Tag: three human lives Ironic Factor: *

(Nancy Leveson, U. of Washington) Between June 1985 and January 1987, a computer-controlled radiation therapy machine, called the Therac-25, massively overdosed six people. These accidents have been described as the worst in the 35-year history of medical accelators.

...

(Barbara Wade Rose) ... It turned our that both Yakima accidents, as well as the one at Hamilton, had been caused by another software error -- different from the Malfunction 54. On the Therac-25, the part of the computer program that is often referred to as the "house-keeper task" continuously checked to see whether the turntable was correctly positioned. A zero on the counter indicated to the technician that the turntable was in the correct position. Any value other than zero meant that it wasn't, and that treatment couldn't begin. The computer would then make the necessary corrections and the counter would reset itself to zero.

But the highest value the counter could register was 255. If the program reached 256 checks, the counter automatically clicked back to zero, the same way that a car odometer turns over to zero after you've driven more than 99,999.99 kilometres. For that split second, the Therac-25 believed it was safe to proceed when, in fact, it wasn't. If the technician hit the "set" button to begin treatment at that precise moment, the turntable would be in the wrong position and the patient would be struck by a raw beam.

Articile 1 | Article 2.

USS Yorktown Incident

Incident Date: 9/1997 Ironic Factor: ****

(Government Computer News) The Navy's systems chief has begun an investigation into the computer failure that left the Aegis cruiser USS Yorktown dead in the water for several hours last fall.

...

On Sept. 21, 1997, the Yorktown experienced what the Navy called .an engineering LAN casualty. [GCN, July 13, Page 1]. A systems administrator fed bad data into the ship's Remote Database Manager, which caused a buffer overflow when the software tried to divide by zero. The overflow crashed computers on the LAN and caused the Yorktown to lose control of its propulsion system, Navy officials said.

Article [local copy]

Ariane 5 Explosion

Incident Date: 9/1997 Price Tag: $500 million Ironic Factor: ****

(By James Gleick) It took the European Space Agency 10 years and $7 billion to produce Ariane 5, a giant rocket capable of hurling a pair of three-ton satellites into orbit with each launch and intended to give Europe overwhelming supremacy in the commercial space business.



All it took to explode that rocket less than a minute into its maiden voyage last June, scattering fiery rubble across the mangrove swamps of French Guiana, was a small computer program trying to stuff a 64-bit number into a 16-bit space.



...



This shutdown occurred 36.7 seconds after launch, when the guidance system's own computer tried to convert one piece of data -- the sideways velocity of the rocket -- from a 64-bit format to a 16-bit format. The number was too big, and an overflow error resulted. When the guidance system shut down, it passed control to an identical, redundant unit, which was there to provide backup in case of just such a failure. But the second unit had failed in the identical manner a few milliseconds before. And why not? It was running the same software.

Article

A List of Security Bugs

(12/17/08) Microsoft issues emergency security updates. Comments: A time-of-check-to-time-of-use (TOCTTOU) bug; looks like the code is not thread safe.

Links to Related Webpages

Thomas Huckle's Collection of Software Bugs

National Vulnerability Database

Security Focus

Jonathan Jacky's Safety-Critical Computing Page

They Write the Right Stuff

Software [In]security: Software Security Demand Rising

Last updated: 8/26/2016

Author : Gang Tan, Penn State University

Please send comments to gtan AT cse dot psu dot edu