Program code started using machines to kill people as early as in 1985.

A standard one-time therapeutic dose of radiation is up to 200 rads.

1000 rads is a lethal dose, and the revolted machine was burning the defenseless humans with 20 000 rads.

Let's look into the case of a system error - the worst software bug in history - that occurred as a result of incremental yet uncoordinated software improvements.

Hardware locks were removed in the Therac-25, and the safety-maintaining functions were passed to the software instead.

In this article, we will talk about how the investigation went and what lessons IT engineers, programmers, and testers should learn from this story not to let something like that happen again.

The murderer

The Therac-25 is a radiation therapy machine, a medical linear accelerator produced by Atomic Energy of Canada Limited (AECL).

The plan of the facility is shown in the figure below.

And here's a commercial for housewives.

The murder

Between June 1985 and January 1987, this machine was the cause of six radiation-overdose accidents, when some of the patients were exposed to dozens of thousands of rads. At least two patients died of the direct consequences of the overdoses.

The technician recalled changing the command 'x' to 'e' that day. It was found that doing it quickly enough resulted in radiation overdose in almost 100% of cases.

The investigation

While prosecuting the cases against AECL, the Smith County District Attorney's office in Tyler, Texas, asked Nancy Leveson (who was a Computer Science professor at the University of California, Irvine, at the time) to assist as an expert in the investigation. She made a considerable contribution to system and software safety. Nancy and Clark Turner spent three years collecting the materials and reconstructing the events related to the Therac-25 accidents. This is an important result, as for most incidents involving safety, information appears to be incomplete, inconsistent, and incorrect.

AECL built three versions of their machine: Therac-6, Therac-20, and Therac-25. The versions 6 and 20 were manufactured in partnership with CGR, a French company. The partnership had dissolved before the Therac-25 was designed, but both companies maintained access to the designs and source code of the earlier models.

The Therac-20 codebase was developed from the Therac-6. All three machines used a PDP-11 computer. Therac-6 and 20 didn't need that computer, though. Both were designed to operate as standalone devices. In manual mode, a radiotherapy technician would manually set up various parts of the machine, including the turntable to place one of three devices in the path of the electron beam.

In electron mode, scanning magnets would be used to spread the beam out to cover a larger area. In X-ray mode, a target was placed in the electron beam with electrons striking the target to produce X-ray photons directed at the patient. Finally, a mirror could be placed in the beam. The electron beam would never switch on while the mirror was in place. The mirror would reflect a light which would help the radiotherapy technician to precisely aim the machine.

On the Therac-6 and 20, hardware locks prevented the operator from doing something dangerous, say selecting a high power electron beam without the x-ray target in place.

Attempting to activate the accelerator in an invalid mode would trigger a protector, bringing everything to a halt. The PDP-11 and associated hardware were added as a convenience. The technician could enter a prescription in on a VT-100 terminal, and the computer would use servos to position the turntable and other devices.

Hospitals loved the fact that the computer was faster at setup than a human. Less setup time meant more patients per day.

When it came time to design the Therac-25, AECL decided to go with computer control only. Not only did they remove many of the manual controls, they also removed the hardware locks. The computer would keep track of the machine setup and shut things down if it detected a dangerous situation.

Well, well...

At least four bugs were found in the Therac-25 software that could cause radiation overdose.

One shared variable was used both for analyzing input values and tracking turntable position. Quickly entering the data on the terminal could, therefore, result in leaving the turntable in the wrong position (race condition).

It took about 8 seconds for the bending magnets to set in place. If the operator changed the beam type and power within that time and moved the cursor to the final position, the system would not detect those changes.

Division by the value of the variable controlling the beam power in some cases led to a zero-division error and, as a result, power increase up to the largest value possible.

Setting a (one-byte) Boolean variable to "true" was done through the "x=x+1" command, so pressing the "Set" button would result in the system failing to identify the message about incorrect turntable position 1 time out of 256.

A number of potential bugs were also found: the multitasking operating system lacked any synchronization.

Fixes

All interruptions related to the dosimetry system would halt the treatment process instead of suspending it. Operators would need to reenter all parameters.

A software single-pulse shutdown was added.

An independent hardware single-pulse shutdown was added.

Cryptic malfunction messages were replaced with meaningful messages and dose-rate messages were displayed on the monitor.

A potentiometer was added to monitor the turntable location.

A motion-enable footswitch (deadman switch) was added so that the turntable and other parts of the machine could move only while the operator was holding this switch closed.

In X-ray mode, interlocking with the 270-degree bending magnet was added to ensure that the target and beam flattener were in position.

Complete list of fixes in English:

Source: Nancy G. Leveson, Therac-25 Accidents

The manufacturer said that the hardware and software had been tested over many years. However, the investigation found that a minimum amount of tests had been run on a simulator, while most of the effort had been directed at the integrated system test. It means that the developers neglected unit testing and did integration testing only.

A naive assumption is often made that reusing software or using commercial off-the-shelf software increases safety because the software has been exercised extensively. Reusing software modules does not guarantee safety in the new system to which they are transferred due to the development specifics of that system. Rewriting the entire software may be safer in many cases.

In this case, the manufacturer chose to reuse the program code from the Therac-6 and Therac-20, though the Therac-6 did not provide X-ray mode at all, while the Therac-20 was equipped with hardware locks.

Since the Therac-25 events, the FDA has changed their attitude to many of the issues involving safety-critical systems and moved to improve the reporting system and to augment their procedures and guidelines to include software. It was an important lesson not only for FDA, but for all industrial safety-critical systems.

Additional resources on the Therac-25 and related accidents

Conclusion

According to Software Engineering Institute's data, there is an average of 1 bug per 100 lines of code, and 98% of device malfunctions caused by software bugs could have been averted through proper testing. Now that I know it, I feel like joining the "let me see the code" movement. Sure, measures were taken after all those big incidents, but I wouldn't want to go to the dentist once and be treated with a drill whose angular velocity is controlled by a variable with "just one extra zero" added by mistake. Dear testers (as well as programmers and developers), please do your job properly.

UPD

The University of California, Berkeley: Computer Science 61A — Lecture 35: Therac-25

http://www.infocobuild.com/education/audio-video-courses/computer-science/CS61A-Spring2011-Berkeley/lecture-35.html

This article was originally published (in Russian) on habrahabr.ru. The original and translated versions were posted on our blog with the permission of the author.