Supercomputer Testing at the ICE House

On the ground as in the air, the rate of SEUs is proportional to the number of transistors in a computing system. Thus, the supercomputers used for nuclear weapons simulations and other national security challenges, which contain thousands of microchips each containing many millions of transistors, are big targets for SEUs—even though the neutron flux on the ground is hundreds of times lower than at aircraft cruising altitudes.

Cray-1 Supercomputer

Case in point: During 1976, LANL was given a 6-month free trial of the Cray-1 computer, one of the first "supercomputers" (at 80 megaflops, or 80 million operations per second) and the first Cray design to use integrated circuits. LANL kept close track of its reliability and discovered 152 bit flips in the memory units in the 6-month trial period (900 hours of running time), or one every 6 hours. The cause was unknown, but LANL became a key player in monitoring computer errors and advising on error-correcting codes to solve the problem. Much later (2010), measurements at the ICE House strongly suggested that the cause of those early Cray-1 bit flips were SEUs induced by atmospheric neutrons. Thus, in retrospect, they became the first recorded SEUs on Earth. (SEUs had first been detected in satellites.)

Q Supercomputer

In 2002, a similar situation occurred early in the deployment of LANL's Q supercomputer, which in June 2003 became the world's second fastest supercomputer at nearly 14 teraflops (or 14 trillion operations per second). An unexpectedly high number of crashes were traced to bit flips in a memory unit supporting the processors. Neutron involvement was suspected, in part because LANL is at a 7,200-foot elevation, with a neutron intensity that is several times higher than at sea level. Bit-flip rates measured during ICE House tests and analyzed by LANL statistician Sarah Michalak and colleagues were consistent with the rate of errors observed in Q in the field. Mitigation strategies were developed, allowing scientists to successfully use the Q for state-of-the-art scientific calculations and simulations that help ensure the safety and reliability of the nation's nuclear weapons stockpile.

Roadrunner

In LANL's Roadrunner, the first petaflop supercomputer (1000 trillion operations per second), much of the hardware has built-in protection from SEUs. However, the protection is not perfect. There are still two concerns: vulnerability to SEU-induced crashes, which can cause a calculation to crash, and silent data corruption, in which an undetected error causes the system to deliver computationally incorrect results. These latter errors are termed "silent" since an undetected error cannot produce an error message that would alert a user. Michalak supervised ICE House testing of the Triblade compute-servers used for computation in Roadrunner. Based on the results, the Roadrunner platform is predicted to experience one-neutron-induced crash roughly every 130 hours of operation and one-neutron-induced silent data corruption roughly every 1,100 hours.

SEU Mitigation

The impact of silent data corruption on large simulations will most likely be small, especially because extensive numbers of calculations are used to verify and validate the codes that affect decision making. A LANL team led by Nathan DeBardeleben is investigating silent data corruption by purposely inserting SEU-type errors and tracing how applications respond to these anomalies. The results will guide the development of software that is more resilient to SEUs and other types of errors.

The effects of crashes are typically mitigated by a practice called checkpoint-restart. At various "checkpoints" during a calculation, the state of the computer is stored, and anytime a crash occurs the calculation is halted, data from the most recent checkpoint is loaded, and the calculation is restarted. To reduce the time needed to store checkpoint data, Gary Grider is leading a LANL effort to develop a technique using flash memory to store checkpoint data very rapidly during the calculation and then slowly transfer that data to the parallel file system while the calculation proceeds independently. This technique should be deployed in supercomputers in the next few years.