On the need to use error-correcting memory

I asked a computer shop for a quote about a PC configuration featuring 4GiB of ECC RAM. The shop gave me a quote, but omitting the ECC part. When I pointed out the omission, I got an answer stating that ECC is not needed unless you want to do "very precise calculations".

Cosmic rays

Those damn cosmic rays! Not only are they an obstacle to manned space missions (I didn't say low earth orbit) but they also come down and flip our precious bits. Apparently, they do so at a rate of something between 1 and 20 upsets per bit per 10^13 hours. An average value of 1.3e-12 upsets/bit/hour is quoted in the following paper by Eugene Normand:

SEU at ground level

Consequences

Suppose you're compiling your kernel, and then a cosmic ray hits your machine and flips a bit in the kernel code. Then you proceed to use that kernel code, and your machine locks up randomly every day, leading to catastrophic loss of data.

Suppose you are in the process of copying your birth/wedding/funeral/whatever video to your new hard disk and a bit flips in the middle of the file. You then make further backups of that file, but ten years later you realize that you can only play the file up to the 5th minute because after that mplayer crashes.

Suppose you are a software developer, more specifically a firmware developer, and a cosmic ray flips a bit in your toolchain , and you deliver consistently buggy software... thru no fault of your own. Badly in debt, you get run over by a truck and have no money to pay for appropriate medical care. You die... because of cosmic rays!

Now all these scenarios are possible... but how probable are they?

Computing the probabilities

Quantities involved

Let m be the number of actual RAM bits in your system. For simplification, I'll suppose that your RAM usage ratio is constant at 100% - otherwise, just scale down m accordingly. Let T be a suitable length of time and let p the probability of any given bit being hit during that time T . Let n_d be the bus word width (typically n_d = 64 ) and n_c be the number of error correction bits (typically n_e = 8 for a Hamming plus parity code on 64 bits). Then m_w = m / n_d will be the number of words in the system.

Probability of a bit error

First, let's assume you have a system with no error-correction nor parity. The probability that you'll experience a bit error during the time T will be 1-(1-p)^m .

For T=1 hour , p = 1.3e-12 and m = 4*2^30*8 that gives 0.044 or 4.4% . That is quite a high probability. Indeed, in one day, that leads to a probability of 66% and in 72 hours to a probability of 96% .

So the probability of having at least one bit error in 4 gigabytes of memory at sea level on planet Earth in 72 hours is over 95% .

Note that you don't need 72 hours of continuous operation for that to happen. If you leave your computer on from 8 in the morning to midnight, that'll mean that you'll have a bit error in 4.5 days with probability exceeding 95% . Modern operating systems use all the available memory as a disk cache. Very easy to fill it up.

If your computer writes back some of that data to disk, the bit error becomes permanent. With all the journaling, restore set point, and continual software upgrades that a modern PC does (whatever its operating system is) bit rot can become a major concern.

At that rate, with 4GiB of RAM, a worst-case rot rate of about 1 bit per day can be expected. Until you buy your next computer in three years, a thousand bits might have been corrupted - without counting the cascade effects of bad bits leading to more bad bits by software processes.

Would you agree to such a data corruption rate? I won't. But how much does ECC help?

Error detection

ECC-capable RAM modules sold on the market are meant to be used with SECDED (single error correction, double error detection) codes. Such a code is the parity-augmented Hamming code for words of m bits that requires log_2(m) + 2 bits. Thus for 64-bit words that gives n_e = 8 error correction bits.

What happens when a single bit is flipped in a word? A log message might be generated, but there generally is no cause for cncern.

What happens when two bits are flipped in the same word? A fault condition is signaled on the CPU bus, triggering an interrupt. The OS then takes appropriate action such as killing the offending process.

If more than two bits are flipped, the corruption might go undetected, or not.

Probability of no errors with ECC memory

For a given memory word, the hardware will experience no problems if there is zero or one bit flips. So trouble starts with two or more bit flips.

Let n = n_d + n_e . Then the probability of having one or less bit flips in a given word, that is, the probability of a no error or a correctable error is p_w = (1-p)^n + n*p*(1-p)^(n-1) . Note that this supposes that the word is not accessed between the first and second bit flips - because in that case the word would be corrected and written back. I assume this simplification is good enough for our purpose, since we are computing a worst-case.

Now since we have m_w words, the probability of a word error for the whole memory is p_W = 1 - (1 - p_w)^m_w during time T . With p = 1.3e-12 (for T = 1 hours ) this gives p_w = 4.32e-21 and p_W = 1.25e-19 . That is, instead of a 3% chance of having a bit error in your system every hour, you only get a negligible p_W = 2.32e-12 per hour. At that rate, you can wait 2.7 million years before you get an uncorrectable bit error at a probability of 96% .

Of course all that assumes that the cosmic-ray-induced bit errors are independent. Apparently that assumption is well-founded, unless you manage to find a memory module whose RAM chips are stacked one on top of the other and you place them vertically.

Conclusion and summary

A system on Earth, at sea level, with 4 GB of RAM has a 96% percent chance of having a bit error in three days without ECC RAM. With ECC RAM, that goes down to 1.67e-10 or about one chance in six billions.

Further reading

Memory errors and SECDED by crypto superstar D. J. Bernstein.

Update 1

After I posted this to Reddit user crazynotes pointed to this paper DRAM Errors in the Wild: A Large-Scale Field Study which gives even more alarming empirical data, this time by including hard errors as well.