A failing computer system can be a problem in some settings, but it’s catastrophic in others. No one likes when their computer crashes out while in the middle of an overnight render, but the cost of that delay is nothing compared to a failure in a mission-critical environment, such as in a hospital, on a satellite, or, in the case of Curiosity, on another planet entirely. These extraordinary settings have led to the creation of specialized computer systems that are designed to work in high-demand situations, often for years on end. They are, not surprisingly, some of the most interesting computer systems out there.

Healthcare

Hospital equipment, first and foremost, must be immune to fluctuations on mains power. Static transfer switches can switch power to backup batteries within a single cycle of a 60Hz mains power failure, and provide seamless emergency power for the few seconds it takes for the generators to be started. Inductive spikes from switching compressor motors need to be filtered or blocked with isolation transformers as well. But these devices do not protect against glitches occurring at more localized levels, which can reset instruments, often freezing or restarting them in an unpredictable state.

Precision robots which assist surgeons during operations are large integrated systems often having several computers managing high speed FPGAs (field-programmable gate arrays) and DSPs (digital signal processors). The Da Vinci surgical system has 7 degrees of freedom and at least as many servo controllers. As a result of many subsystems having different supply requirements, the overall system ends up being designed with tight voltage supply specs and therefore needs sufficient buffering against fluctuations that exceed a few percent.

An even more insidious threat to healthcare/medical tablet computers and Windows PC-based computer systems found in hospital equipment, is infection with conventional malware. Due to strict regulations, manufacturers frequently cannot allow OS security patches or updates leaving many computers vulnerable, with the result that many infected instruments run slower, and many others are shelved altogether.

In space

Computers in spacecraft have their own unique set of challenges. In addition to requiring real-time low latency control for maneuvering and communication, spacecraft also need to be hardened against the effects of cosmic rays and other forms of radiation. Shielding is simple and effective but is prohibitive because of its weight. Therefore efforts have been focused on making the chips themselves resistant to radiation.

FPGAs are frequently found in spacecraft due to their speed and computational efficiency in performing tasks like fast Fourier transforms and beam-forming in communications. They do not require code like a microcontroller, instead they write a particular computation directly into their logic gates. FPGAs based on SRAM can be reconfigured at will, however the same technology which makes them possible also makes them vulnerable to radiation. A charge deposited on a control structure like a transistor can induce it to momentarily change state. If it is part of a persistent circuit like a flip-flop or RAM cell the change can be permanent. One-time programmable FPGAs based on anti-fuse technology are much more resistant but can be over ten times the initial cost.

The microprocessors and memory in spacecraft also need to be able to withstand radiation. The RAD750 PowerPC in Curiosity’s flight computer was designed to survive for 15 years before intervention is required from Earth. These chips are slower than some newer systems but NASA has been using them successfully for some time and likes to stay with what works. Curiosity’s computers run a real-time operating system, VxWorks, which it has used on two previous rovers. The VxWorks microkernel is reportedly better optimized for minimal interrupt and thread switching latency than the the monolithic RTLinux kernel, although it is not as fast overall.

A final consideration for failure resistant computers in spacecraft would be that if they do encounter significant trouble, they should have provision either to be remotely rebooted or some mechanism to be rebooted on-board. With proper consideration to input power fluctuations, interrupt latency (or jitter, variation over time), and external insults such as radiation, computer systems can be designed to be arbitrarily robust.

The examples above represent but a small portion of the range of mission critical systems in service everyday. Other real-time operating systems like QNX, now owned by RIM, can be found in drones and military vehicles like the Crusher tank (pictured above). Recently the security of the Boeing 747 engine control system running Solaris was questioned; engineers on the ground apparently could access control systems for re-tuning en-route. In this case, secure protocols like SSH were not compatible with the parts of the existing software, and insecure Telnet was still being used.

As the tentacles of internet more intimately weave into military and civilian infrastructures, new concerns will present themselves, and increasing vigilance will be required to keep computer systems fail-safe.

Now read: Inside NASA’s Curiosity: It’s an Apple Airport Extreme… with wheels

[Image credit]