I love real data. Real data is so much better than speculation and, what I’ve learned from years of staring at production systems, is the real data from the field is often surprisingly different from popular opinion. Disk failure rates are higher than manufacturer specifications, ECC memory faults happen all the time, and events that are just about impossible actually happen surprisingly frequently in large populations. An example of two papers that, nearly 8 years later still remain relevant and are well worth reading are: 1) Failure Trends in a Large Disk Drive Population and 2) Disk Failures in the Real World: What does a MTBF of 100,000 hours mean to you. Both of these classics were published at the same FAST2007 conference.

Flash memory and flash failure rates were an issue we barely dealt with back in 2007. Today most megaclouds have petabytes of Flash storage deployed. As always there are specs on failure rates and strong opinions from experts but there really hasn’t been much public data for large fleets.

I recently came across a paper than does a fairly detailed study of the Facebook SSD population. This study isn’t perfect in that it’s reporting over a large but unspecified number of devices, there are 5 different device models in the population, these devices are operating in different server types, the lifetime of the different devices varies, and the fault tracking is external to the device and doesn’t see the detailed device internal failure data. However it does study devices over nearly 4 years with “many millions of operational hours” in aggregate. The population is clearly large enough to be relevant and, even with many uncontrolled dimensions, it’s a good early look at flash device lifetimes and I found their findings of interest:

Flash-based SSDs do not fail at a monotonically increasing rate with wear. They instead go through several distinct reliability periods corresponding to how failures emerge and are subsequently detected. Unlike the monotonically-increasing failure trends for individual flash chips, across a large number of flash-based SSDs, we observe early-detection, early failure, usable life, and wear out periods. Read disturbance errors (i.e. errors caused in neighboring pages due to a read) are not prevalent in the field. SSDs that have read the most data do not show a statistically significant increase in failure rates. Sparse logical data layout across an SSDs physical address space (e.g.. non-contiguous data), as measured by the amount of SSD-internal DRAM buffer usage for flash translation layer metadata, great affects device failure rate. In addition, dense logical data layout with adversarial patterns (e.g. small sparse writes), also negatively affects SSD reliability. Higher temperatures lead to higher failure rates, but techniques used in modern SSDs that throttle SSD operation (and consequently, the amount of data written to flash chips) appears to greatly reduce the reliability impact of higher temperatures by reducing access rates to raw flash chips. [JRH: This point seems self-evident that temperature mitigation techniques would reduce the impact of higher temperatures]. The amount of data written by the operating system to the SSD is not the same as the amount of data that is eventually written to the flash cells. This is due to system level buffering and wear reduction techniques employed in the storage software stack and in the SSDs. [JRH: This highlights the problem of studying flash error rates outside of the storage devices in that we aren’t able to see the impact of write amplification and it’s more difficult to see the impact of buffering layers between the application and the device].

The paper from Sigmetrics 2015: A Large Scale Study of Flash Memory Failures in the Field

Thanks to Matt Wilson for sending this paper my way.