All Circuits are Busy Now: The 1990 AT&T Long Distance Network Collapse

by Dennis Burke CSC440-01

November, 1995

California Polytechnic State University

On a Good Day

What Went Wrong

The Root Problem

C

break

if

switch

In pseudocode, the program read as follows:

1 while (ring receive buffer not empty and side buffer not empty) DO 2 Initialize pointer to first message in side buffer or ring receive buffer 3 get copy of buffer 4 switch (message) 5 case (incoming_message): 6 if (sending switch is out of service) DO 7 if (ring write buffer is empty) DO 8 send "in service" to status map 9 else 10 break END IF 11 process incoming message, set up pointers to optional parameters 12 break END SWITCH

13 do optional parameter work



When the destination switch received the second of the two closely timed messages while it was still busy with the first (buffer not empty, line 7), the program should have dropped out of the if clause (line 7), processed the incoming message, and set up the pointers to the database (line 11). Instead, because of the break statement in the else clause (line 10), the program dropped out of the case statement entirely and began doing optional parameter work which overwrote the data (line 13). Error correction software detected the overwrite and shut the switch down while it couls reset. Because every switch contained the same software, the resets cascaded down the network, incapacitating the system.

Lesson Learned

There is still much to be learned from this incident, however. Clearly, the use of C programs and compilers contributed to the breakdown. A more structured programming language with stricter compilers would have made this particular defect much more obvious. The routine practice of allowing the long-distance switches to shutdown and reset themselves also contributed. A more fault-tolerant hardware and software system that could handle minor problems without shutting down could have greatly reduced the effects of the defect. The final lesson is a positive one; it is worth noting that with AT&T's careful attention to hardware survivability and extensive testing, this is one of the few problems ever to impact their long-distance network so severely. While the break statement flaw could have been avoided with more thorough software engineering techniques, many more problems have already been deterred by the system in place. The AT&T long-distance system crash stands out not just as a software engineering example, but because of it's rarity.

References

Gonzalez, D. & Rogers, M. (1990, January 29). Can we trust out software? Newsweek, pp. 70-73.

Joch, A. (1995, December). How software doesn't work. Byte, pp.49-60.

McCarrel, T. & Witterman, P. (1990, January 29). Ghost in the machine. Time, pp. 58-59.

Neumann, P. G. (1995) Computer related risks. New York: ACM Press.

O'Malley, C. (1990, September). When the computer fails. Popular Science, pp. 84-90.

Peterson, I. (1991, February 16). Finding fault. Science News, pp. 104-106.

Wiener, L. R. (1993). Digital woes, why we should not depend on software. New York: Addison-Wesley.