NASA's Curiosity Rover has had a historic week on the surface of Mars, executing a flawless landing on the Red Planet and firing up for its mission. But under the hood, the interplanetary explorer is powered by a pair of computers built by BAE Systems. They're called RAD750's. And it turns out that the radiation hardening that they need to operate on Mars isn't all that different from the protection that some of today's largest supercomputers need to keep chugging along.

The RAD750 isn't much when measured by terrestrial PC metrics. It's a customized take on a 10-year-old IBM PowerPC chip design, and its 132 MHz clock speed would have been impressive around the time of the Windows 95 launch. It comes with just 120 megabytes of RAM. But like the other electronics components on Curiosity Rover the RAD750 has one thing going for it: It's tough enough to withstand launch-time shaking, wild temperature fluctuations and levels of ionizing radiation that would fry the machine that you're using to read this story.

Curiosity Rover's RAD750's use specially built chips that are built to survive one-off collisions with high energy particles that can flip the energy charge in the computer's memory. And while the cosmic-ray problems that exploration vehicle is facing are many times worse than anything you'd see here on Earth, they're also the kind of problem that chipmakers are increasingly having to confront as they build smaller and smaller components that are used on very large clustered systems.

Researchers at Oak Ridge National Laboratory studied this bit-flipping phenomenon on their Jaguar supercomputer recently. They found that error correction code (ECC) – a technology the chip uses to patch things up when things like cosmic rays, heat, and voltage fluctuations cause bits to flip – was being triggered more than 300 times per minute on the machine's 362 terabytes of memory, according to Al Geist, a research scientist with the laboratory. "The computer continued to run fine despite this continuous stream of errors because these bit flips were all corrected," he says.

"The effect of cosmic rays on large computers is so bad that today's large supercomputers would not even boot up if they did not have ECC memory in them," says Geist.

In the early days, a small community of academics, aerospace companies and medical parts-makers and academics cared about the cosmic ray phenomenon. They'd meet at obscure events like the Nuclear and Space Radiation Effects Conference, but it was unusual to see anyone from a general-purpose chipmaker like Intel or AMD there, says Scott Doyle, a senior engineer with BAE Systems. That's changed as manufacturers started using the 90-nanometer process technology, he adds. "In the last 10 years or so there has been a tremendous amount of interest from the commercial semiconductor companies."

Chipmakers worry that as manufacturing techniques shrink, bit flipping could be come a bigger problem. Charged particles, for example, could conceivable flip more several bits on a chip built with extremely tiny transistors. Most error correcting code can handle one or two flipped bits, but it would have to be rewritten to handle four or 16.

In outer space, the Curiosity Rover's computers need even more tricks to carry on managing peripherals and sending data back to earth. For example, the RAD750 is designed with tie points throughout the processor that catch high energy particles and force them to discharge to ground rather than the chip's memory. There are also radiation hardening techniques built right into the RAD750's custom-made PowerPC chip, but BAE Systems considers those techniques trade secrets and doesn't like to discuss them.

Many computer scientists don't think that cosmic rays pose a serious problem at sea level, but issues can flare up.

Just over a decade ago, bit flips in Sun Microsystems' UltraSparc II processors on-chip memory caused some big problems for the company's customers. Sun's then-CEO Scott McNealy said that the problem was that the chip's memory didn't come with error-correcting code.

BAE Systems RAD750 motherboards. Photo: BAE Systems

Of course, there are many different things that cause chips to fail. A 2009 study of dynamic random access memory (DRAM) memory failures in Google's data centers found that about 8 percent of memory modules experienced at least one error per year. That was a much greater error rate than many people had expected, but the prime cause wasn't cosmic rays, it was old age, Google found. DRAM is good for only so many writes before it starts to wear out.

Another more recent study, this one on Oak Ridge's Jaguar, found similarly high error rates in the computer's memory module.

One of the authors of that paper was Vilas Sridharan, a reliability architect at AMD. Sridharan spends his days at the chipmaker figuring out new ways to keep the cosmic rays at bay. "It's something that every silicon manufacturer deals with and addresses in their chips," he says.

Because high-energy particles are blocked by the Earth's atmosphere, bit flipping is much more likely at higher altitudes. "In an airplane the rate is about 100 times greater than it is at sea level," Sridharan says.

So when you're surfing the internet at 37,000 feet, there could well be a flipped bit in your laptop. But the odds are that you'll never notice. In most cases, the bit will get flipped in some part of the memory that's not being used, or it will change something minor – like the color of a single pixel on your screen.

But down on Earth, bit flipping is something that's taken very seriously by the scientists like Geist, who run massive supercomputers. These systems fill up huge amounts of memory – a large target for the cosmic rays – and they run precise calculations that simply can't have any errors.

And without the radiation hardening techniques cooked up by people such as Sridharan, these supercomputers simply wouldn't work.