In the early days of rocketry, when subsystems reliability was low, hard experience led designers to add redundancy for critical functions where they could. Redundancy comes at a cost: increased weight, increased complexity, unintended interactions, complex schemes to manage the redundancy, etc. Now days, subsystem reliability is much higher, especially electronic parts. So today we have sophisticated discussions about dispensing with redundancy and living with single string high reliability systems, even in critical areas.

Whenever I am engaged in one of these discussions I remember STS-93 where a 2 cent screw and a 10 cent length of wire demonstrated the vulnerability of an otherwise highly reliable critical system. “Dualing computers” is not a misspelling, it is a safety concept.

Each Space Shuttle Main Engine has a computer mounted right to its side to run the complicated functions required for safe operation of that very complex, high energy device. And every SSME controller is made up of two redundant computers: DCU A and DCU B (Data Control Unit). The A computer is always in control while the B listens along – until and unless the A computer fails; then the B computer takes over. Each computer has its own way to control every valve and its own set of instrumentation required to run the engine: pressures, temperatures, valve positions, turbine speeds.

However, when both A and B are working, they share data. So there is a pressure measurement for the main combustion chamber wired to DCU A and another one wired to DCU B, but when both computers are working they share data and make computations based on the average of the two chamber pressure measurements. If one of the computers fails, the other carries on but then has only one measurement to make calculations from – no more averaging.

Almost all the telemetry that is sent from the engine to Mission Control comes from the A computer; if it fails the B computer sends only a few data points, not nearly as many as the A side.

When STS-93 had its little problem, every engine kept working just fine even though two computers on two separate engines went silent. There was never another case of SSME computer loss in the entire suite of shuttle flights. These computers were highly reliable. The computers never failed because of an electronic part problem or a software error.

But in systems design, like warfare, defense has to be made at the most vulnerable point. For the SSME controllers this was the power source.

If the shuttle designers hadn’t built in redundancy, two of the three engines would have shut down just after lift off. The results from that would not have been good. The crew has a procedure to run called “2 out First Stage”. It is one of those procedures that Capt. Young used to describe as “keeping busy while you wait to die.”

The next time someone tells me they have a highly reliable system that doesn’t need redundancy, I will remember STS-93. I hope you do too.