The conventional wisdom about DRAM error rates is that errors are rare, and the majority of the errors that do occur are so-called "soft errors"—randomly corrupted bits that have been flipped by incoming cosmic rays. But a recent large-scale study of DRAM errors released by Google turns this wisdom on its head, and in doing so reinforces the importance of error correction coding (ECC) and regular hardware replacement for datacenter machines.

Google's 2.5-year study of DRAM error rates in its datacenters is the largest such real-world study ever released; prior studies have been based on lab tests done under artificially high-stress conditions, with the results then extrapolated to give a picture of real-world conditions. Google engineers tracked errors as they happened, and logged both the errors and relevant data like temperature, CPU utilization, and memory allocated. After analyzing the data, they drew seven main conclusions about the nature, frequency, and causes of DRAM errors.

The headline conclusion in the study is that DRAM errors are vastly more common than is typically assumed. Nearly one-third of the individual machines in the study saw at least one error per year, a rate that's orders of magnitude higher than previous research had indicated. To give some hard numbers, previous studies report 200 to 5,000 failures in time per billion hours of operation (FIT) per Mbit; Google found that their numbers were between 25,000 and 75,000 FIT per Mbit.

On the bright side, most of these errors are the result of a few bad apples. About 8 percent of DIMMs were responsible for over 90 percent of the errors, as DIMMs that produced one error were hundreds of times more likely to produce another error in the same month.

Another lesson that Google learned is that older hardware is much more likely to fail; at about 20 months the error rate shoots up drastically. Perhaps not coincidentally, typical IT refreshes happen at about the three-year mark, and it wouldn't be surprising to see computer vendors latch onto this study as another data point in their quest to convince businesses that their ongoing freeze on upgrades will soon start to cost them more than it's saving them.

For large datacenters, Google's findings are especially important, because for different businesses the tipping point where the cost of error-induced downtime exceeds the cost of refreshes will fall at different times. In the absence of detailed data, it's easy to imagine a company that's focused on cutting costs pushing past this tipping point without knowing it.

The other big error factor that Google found was duty cycle—the higher the CPU utilization and memory allocation on a machine, the greater the odds of an error. In fact, utilization was much more strongly correlated with error rates than was temperature, despite the fact that most lab-based studies stress the latter factor as a large contributor to error rates. (Temperature does correlated to error rates, though, but just not as strongly and cleanly across platforms.)

The study's final conclusion, and the one most at odds with previous studies, is that hard errors dominate soft errors by far. Either defects in the chip or transmission problems in the memory subsystem cause more errors than external factors like cosmic rays. The researchers suggest that the unexpected dominance of hard errors might explain the drastically higher error rates observed vs. previous studies, which focused on soft errors.

Given the increased error rates that Google found, it turns out that ECC is even more important than previously thought. The presence of ECC can mean the difference between a recoverable error and a catastrophic, downtime-producing failure, so it's no wonder that datacenter builders insist on it. It's also the case that the type of ECC matters, with stronger ECC like chip-kill showing the ability to lower error rates by four to five times vs. weaker coding.