Phobos-Grunt, perhaps the most ambitious deep space mission ever attempted by Russia, crashed down into the ocean at the beginning of 2012. The spacecraft was supposed to land on the battered Martian moon Phobos, gather soil samples, and get them back to Earth. Instead, it ended up helplessly drifting in Low Earth Orbit (LEO) for a few weeks because its onboard computer crashed just before it could fire the engines to send the spacecraft on its way to Mars.

In the ensuing report, Russian authorities blamed heavy charged particles in galactic cosmic rays that hit the SRAM chips and led to a latch-up, a chip failure resulting from excessive current passing through. To deal with this latch-up, two processors working in the Phobos-Grunt’s TsVM22 computer initiated a reboot. After rebooting, the probe then went into a safe mode and awaited instructions from ground control. Unfortunately, those instructions never arrived.

Antennas meant for communications were supposed to become fully operational in the cruise stage of Phobos-Grunt, after the spacecraft left the LEO. But nobody planned for a failure preventing the probe from reaching that stage. After the particle strike, the Phobos-Grunt ended up in a peculiar stalemate. Firing on-board engines was supposed to trigger the deployment of antennas. At the same time, engines could only be fired with a command issued from ground control. This command, however, could not get through, because antennas were not deployed. In this way, a computer error killed a mission that was several decades in the making. It happened, in part, because of some oversights from the team at the NPO Lavochkin, a primary developer of the Phobos-Grunt probe. During development, in short, it was easier to count the things that worked in their computer than to count the things that didn’t. Every little mistake they made, though, became a grave reminder that designing space-grade computers is bloody hard. One misstep and billions of dollars go down in flames.

Everyone involved had simply grossly underestimated the challenge of carrying out computer operations in space.

Why so slow?

Curiosity, everyone’s favorite Mars rover, works with two BAE RAD750 processors clocked at up to 200MHz. It has 256MB of RAM and 2GB of SSD. As we near 2020, the RAD750 stands as the current state-of-the-art, single-core space-grade processor. It’s the best we can send on deep space missions today.

Compared to any smartphone we wear in our pockets, unfortunately, the RAD750’s performance is simply pathetic. The design is based on the PowerPC 750, a processor that IBM and Motorola introduced in late 1997 to compete with Intel's Pentium II. This means that perhaps the most technologically advanced space hardware up there is totally capable of running the original Starcraft (the one released in 1998, mind you) without hiccups, but anything more computationally demanding would prove problematic. You can forget about playing Crysis on Mars.

Still, the price tag on the RAD750 is around $200k. Why not just throw an iPhone in there and call it a day? Performance-wise, iPhones are entire generations ahead of RAD750s and cost just $1k apiece, which remains much less than $200k. In retrospect, this is roughly what the Phobos-Grunt team tried to accomplish. They tried to boost performance and cut costs, but they ended up cutting corners.

The SRAM chip in the Phobos-Grunt that was hit by a heavily charged particle went under the name of WS512K32V20G24M. It was well known in the space industry because back in 2005, T.E. Page and J.M. Benedetto had tested those chips in a particle accelerator at the Brookhaven National Laboratory to see how they perform when exposed to radiation. The researchers described the chips as "extremely" vulnerable, and single-event latch-ups occurred even at the minimum heavy-ion linear energy transfer available at Brookhaven. This was not a surprising result, mind you, because WS512K32V20G24M chips have never been meant nor tested for space. They have been designed for aircraft, military-grade aircraft for that matter. But still, they were easier to obtain and cheaper than real space-grade memories, so the Russians involved with Phobos-Grunt went for them regardless.

"The discovery of the various kinds of radiation present in the space environment was among the most important turning points in the history of space electronics, along with the understanding of how this radiation affects electronics, and the development of hardening and mitigation techniques,” says Dr. Tyler Lovelly, a researcher at the US Air Force Research Laboratory. Main sources of this radiation are cosmic rays, solar particle events, and belts of protons and electrons circling at the edge of the Earth’s magnetic field known as Van Allen belts. Particles hitting the Earth’s atmosphere are composed of roughly 89% protons, 9% alpha particles, 1% heavier nuclei, and 1% solitary electrons. They can reach energies up to 10^19 eV. Using the chips not qualified for space in a probe that intended to travel through deep space for several years was asking for a disaster to happen. In fact, Krasnaya Zvezda, a Russian military newspaper, reported at that time that 62% of the microchips used on the Phobos-Grunt were not qualified for spaceflight. The probe design was 62% driven by a "let’s throw in an iPhone" mindset.

Radiation becomes a thing

Today, radiation is one of the key factors designers take into account when building space-grade computers. But it has not always been that way. The first computer reached space onboard a Gemini spacecraft back in the 1960s. The machine had to undergo more than a hundred different tests to get flight clearance. Engineers checked how it performed when exposed to vibrations, vacuum, extreme temperatures, and so on. But none of those testes covered radiation exposure. Still, the Gemini onboard computer managed to work pretty fine—no issues whatsoever. That was the case because the Gemini onboard computer was too big to fail. Literally. Its whooping 19.5KB of memory was housed in a 700-cubic-inch box weighing 26 pounds. The whole computer weighed 58.98 pounds.

Generally for computing, pushing processor technology forward has always been done primarily by reducing feature sizes and increasing clock rates. We just made transistors smaller and smaller moving from 240nm, to 65nm, to 14nm, to as low as the 7nm designs we have in modern smartphones. The smaller the transistor, the lower the voltage necessary to turn it on and off. That’s why older processors with larger feature sizes were mostly unaffected by radiation—or, unaffected by so-called single event upsets (SEUs), to be specific. Voltage created by particle strikes was too low to really affect the operation of large enough computers. But when space-facing humans moved down with feature size to pack more transistors onto a chip, those particle-generated voltages became more than enough to cause trouble.

Another thing engineers and developers typically do to improve CPUs is to clock them higher. The Intel 386SX that ran the so-called "glass cockpit" in space shuttles was clocked roughly at 20MHz. Modern processors can go as high as 5GHz in short bursts. A clock rate determines how many processing cycles a processor can go through in a given time. The problem with radiation is that a particle strike can corrupt data stored in an on-CPU memory (like L1 or L2 cache) only during an extremely brief moment in time called a latching window. This means in every second, there is a limited number of opportunities for a charged particle to do damage. In low-clocked processors like the 386SX, this number was relatively low. But when the clock speeds got higher, the number of latching windows per second increased as well, making processors more vulnerable to radiation. This is why radiation-hardened processors are almost always clocked way lower than their commercial counterparts. The main reason why space CPUs develop at such a sluggish pace is that pretty much every conceivable way to make them faster also makes them more fragile.

Fortunately, there are ways around this issue.