Everything You Know About Disks Is Wrong

Update II: NetApp has responded. I’m hoping other vendors will as well.

Which do you believe?

Costly FC and SCSI drives are more reliable than cheap SATA drives.

RAID 5 is safe because the odds of two drives failing in the same RAID set are so low.

After infant mortality, drives are highly reliable until they reach the end of their useful life.

Vendor MTBF are a useful yardstick for comparing drives.

According to the one of the “Best Paper” awards at FAST ’07, none of these are backed by empirical evidence.

Beyond Google

Yesterday’s post discussed a Google-authored paper on disk failures. But that wasn’t the only cool storage paper.

Google’s wasn’t even the best: Bianca Schroeder of CMU’s Parallel Data Lab paper Disk failures in the real world: What does an MTTF of 1,000,000 hours mean to you? won a “Best Paper” award. (BTW, Ms. Schroeder is a post-doc looking for an academic position – but if I were Google or Amazon I’d be after her in a big way.)

Best “academic computer science” paper

So it is very heavy on statistics, including some cool techniques like the “auto-correlation function”. Dr. Schroeder explains:

The autocorrelation function (ACF) measures the correlation of a random variable with itself at different time lags l. The ACF, for example, can be used to determine whether the number of failures in one day is correlated with the number of failures observed l days later.

Translation: ever wonder if a disk drive failure in an array makes it more likely that another drive will fail? ACF will tell you.

She looked at 100,000 drives

Including HPC clusters at Los Alamos and the Pittsburgh Supercomputer Center, as well as several unnamed internet services providers. The drives had different workloads, different definitions of “failure” and different levels of data collection so the data isn’t quite as smooth or complete as Google’s. Yet it probably looks more like a typical enterprise data center, IMHO. Not all of the data could be used to draw all of the conclusions, but Dr. Schroeder appears to have been very careful in her statistical analysis.

Key observations from Dr. Schroeder’s research:

High-end “enterprise” drives versus “consumer” drives?

Interestingly, we observe little difference in replacement rates between SCSI, FC and SATA drives, potentially an indication that disk-independent factors, such as operating conditions, affect replacement rates more than component specific factors.”

Maybe consumer stuff gets kicked around more. Who knows?

Infant mortality?

. . . failure rate is not constant with age, and that, rather than a significant infant mortality effect, we see a significant early onset of wear-out degradation.

Dr. Schroeder didn’t see infant mortality – neither did Google – and she also found that drives just wear out steadily.

Vendor MTBF reliability?

While the datasheet AFRs are between 0.58% and 0.88%, the observed ARRs range from 0.5% to as high as 13.5%. That is, the observed ARRs by dataset and type, are by up to a factor of 15 higher than datasheet AFRs. Most commonly, the observed ARR values are in the 3%range.

Actual MTBFs?

The weighted average ARR was 3.4 times larger than 0.88%, corresponding to a datasheet MTTF of 1,000,000 hours.”

In other words, that 1 million hour MTBF is really about 300,000 hours – about what consumer drives are spec’d at.

Drive reliability after burn-in?

Contrary to common and proposed models, hard drive replacement rates do not enter steady state after the first year of operation. Instead replacement rates seem to steadily increase over time.

Drives get old, fast.

Data safety under RAID 5?

. . . a key application of the exponential assumption is in estimating the time until data loss in a RAID system. This time depends on the probability of a second disk failure during reconstruction, a process which typically lasts on the order of a few hours. The . . . exponential distribution greatly underestimates the probability of a second failure . . . . the probability of seeing two drives in the cluster fail within one hour is four times larger under the real data . . . .

Independence of drive failures in an array?

The distribution of time between disk replacements exhibits decreasing hazard rates, that is, the expected remaining time until the next disk was replaced grows with the time it has been since the last disk replacement.

Translation: one array drive failure means a much higher likelihood of another drive failure. The longer since the last failure, the longer to the next failure. Magic!

Big iron array reliability is illusory

One implication of Schroeder’s results is that big iron arrays only appear more reliable. How? Using smaller “enterprise” drives means that rebuilds take less time. That makes RAID 5 failures due to the loss of a second disk less likely. So array vendors not only get higher margins from smaller enterprise disks, they also get higher perceived reliability under RAID 5, for which they also charge more money.

The StorageMojo take

After these two papers neither disk drive or array businesses will ever be the same. Storage is very conservative, so don’t expect overnight change, but these papers will accelerate the consumerization of large-scale storage. High-end drives still have advantages, but those fictive MTBFs aren’t one of them anymore.

Further, these results validate the Google File System’s central redundancy concept: forget RAID, just replicate the data three times. If I’m an IT architect, the idea that I can spend less money and get higher reliability from simple cluster storage file replication should be very attractive.

Comments welcome, especially from disk drive and array vendors who dispute these conclusions. Moderation turned on to protect the innocent.

Update: Garth Gibson’s name is also on the paper. Since he is busy as a CMU professor and CTO of Panasas, I hope he’ll pardon me for assuming that Dr. Schroeder deserves most of the credit.