You won’t believe how much data can be written to modern SSDs. No, seriously. Our ongoing SSD Endurance Experiment has demonstrated that some consumer-grade drives can withstand over a petabyte of writes before burning out. That’s a hyperbole-worthy total for a class of products typically rated to survive only a few hundred terabytes at most.

Our experiment began with the Corsair Neutron GTX 240GB, Intel 335 Series 240GB, Samsung 840 Series 250GB, and Samsung 840 Pro 256GB, plus two Kingston HyperX 3K 240GB drives. They all surpassed their endurance specifications, but the 335 Series, 840 Series, and one of the HyperX drives failed to reach the petabyte mark. The remainder pressed on toward 1.5PB, and two of them made it relatively unscathed. That journey claimed one more victim, though—and you won’t believe which one.

Seriously, you won’t. But I’ll stop now.

To celebrate the latest milestone, we’ve checked the health of the survivors, put them through another data retention test, and compiled performance results from the last 500TB. We’ve also taken a closer look at the last throes of our latest casualty.

If you’re unfamiliar with our endurance experiment, this introductory article is recommended reading. It provides far more details on our subjects, methods, and test rigs than we’ll revisit today. Here are the basics: SSDs are based on NAND flash memory with limited endurance, so we’re writing an unrelenting stream of data to a stack of drives to see what happens. We pause every 100TB to collect health and performance data, which we then turn into stunningly beautiful graphs. Ahem.

Understanding NAND’s limited lifespan requires some familiarity with how NAND works. This non-volatile memory stores data by trapping electrons inside miniscule cells built with process geometries as small as 16 nm. The cells are walled off by an insulating oxide layer, but applying voltage causes electrons to tunnel through that barrier. Electrons are drawn into the cell when data is written and out of it when data is erased.

The catch—and there always is one—is that the tunneling process erodes the insulator’s ability to hold electrons within the cell. Stray electrons also get caught in the oxide layer, generating a baseline negative charge that narrows the voltage range available to represent data. The narrower that range gets, the more difficult it becomes to write reliably. Cells eventually wear to the point that they’re no longer viable, after which they’re retired and replaced with spare flash from the SSD’s overprovisioned area.

Since NAND wear is tied to the voltage range used to define data, it’s highly sensitive to the bit density of the cells. Three-bit TLC NAND must differentiate between eight possible values within that limited range, while its two-bit MLC counterpart only has to contend with four values. TLC-based SSDs typically have lower endurance as a result.

As we’ve learned in the experiment thus far, flash wear causes SSDs to perish in different ways. The Intel 335 Series is designed to check out voluntarily after a predetermined number of writes. That drive dutifully bricked itself after 750TB, even though its flash was mostly intact at the time. The first HyperX failed a little earlier, at 728TB, under much different conditions. It suffering rash of reallocated sectors, programming failures, and erase failures before its ultimate demise.

Counter-intuitively, the TLC-based Samsung 840 Series outlasted those MLC casualties to write over 900TB before failing suddenly. But its reallocated sectors started piling up after just a few hundred terabytes of writes, confirming TLC’s more fragile nature. The 840 Series also suffered hundreds of uncorrectable errors split between an initial spate at 300TB and second accumulation near the end of the road.

So, what about the latest death?

Much to our surprise, the Neutron GTX failed next. It had logged only three reallocated sectors through 1.1PB of writes, but SMART warnings appeared soon after, cautioning that the raw read error rate had exceeded the acceptable threshold. The drive still made it to 1.2PB and through our usual round of performance benchmarks. However, its SMART attributes showed a huge spike in reallocated sectors:

Over the last 100TB, the Neutron compensated for over 3400 sector failures. And that was it. When we readied the SSDs for the next leg, our test rig refused to boot with the Neutron connected. The same thing happened with a couple of other machines, and hot-plugging the drive into a running system didn’t help. Although the Neutron was detected, the Windows disk manager stalled when we tried to access it.

Despite the early warnings of impending doom, the Neutron’s exit didn’t go entirely by the book. The drive is supposed to keep writing until its flash reserves are used up, after which it should slip into a persistent read-only state to preserve user data. As far as we can tell, our sample never made it to read-only mode. It was partitioned and loaded with 10GB of data before the power cycle that rendered the drive unresponsive, and that partition and data remain inaccessible.

We’ve asked Corsair to clarify the Neutron GTX’s sector size and how much of the overprovisioned area is available to replace retired flash. Those details should give us a better sense of whether the drive ran out of spare NAND or was struck down by something else. For what it’s worth, the other SMART attributes suggest the Neutron may have had some flash in reserve.

The SMART data has two values for reallocated sectors: one that counts up from zero and another that ticks down from 256. The latter still hadn’t bottomed out after 1.2PB, and neither had the life-left estimate. Hmmm.

Although the graph shows the raw read error rate plummeting toward the end, the depiction isn’t entirely accurate. That attribute was already at its lowest value after 1.108PB of writes, which is when we noticed the first SMART error. We may need to grab SMART info more regularly in future endurance tests.

Now that we’ve tended to the dead, it’s time to check in on the living…

Two keep on truckin’

The Samsung 840 Pro and second Kingston HyperX 3K both reached 1.5PB with little drama. They also completed another unpowered retention test. After writing 1.5PB, the drives were loaded with a 200GB test file and then left unplugged for over a week. Both subsequently passed the MD5 hash check we use to verify data integrity.

A second hash check is integrated into Anvil’s Storage Utilities, the application we use to write data to the drives. This test is configured to verify a smaller 720MB file after roughly every terabyte of writes, and there haven’t been any inconsistencies yet.

Let’s examine the survivors in greater detail, starting with the 840 Pro, which continues to accumulate reallocated sectors.

The burn rate has slowed slightly since the initial uptick, but over 3400 sectors have been compromised so far. At 1.5MB each, that’s about 5GB of flash lost to cell degradation.

According to the SMART data, less than 40% of the flash reserves have been consumed. There’s still plenty on tap to cover future failures.

The wear leveling count is supposed to be related to drive health, but it ran aground after just 500TB, and the 840 Pro has been fine through a petabyte of writes since. The health indicator in Samsung’s SSD Magician utility software has given the drive a “good” rating since the beginning of the experiment, which seems like a more accurate assessment. Then again, the same utility gave the 840 Series a clean bill of health even after the drive had suffered hundreds of uncorrectable errors.

Practical limits restrict our experiment to one example of each SSD, but we have two HyperX 3K drives. One was tested like all the others, with randomized data that can’t be compressed by the DuraWrite mojo in SandForce controllers. The other has been getting a lighter diet based on the Anvil utility’s 46% incompressible setting. You can probably guess which one is still alive.

We can measure the effectiveness of SandForce’s compression scheme by tracking host writes, which come from the system, and compressed writes, which are committed to the NAND. The host writes are identical for both HyperX configs, but the compressed writes are not.

The HyperX 3K writes much less to the flash with the partially compressible payload. 1.5PB of host writes translates to only 1.07PB of compressed writes. On the other setup, compressed writes are slightly higher than host writes due to write amplification.

(The sequential transfers that dominate the endurance test have relatively low amplification, at least compared to the more random workloads typical of client systems. DuraWrite’s effectiveness in this particular scenario isn’t necessarily indicative of how the scheme will perform with other workloads.)

If compression were the only factor in the remaining HyperX’s survival, the drive would have hit the wall around 1.1PB, when it reached the same volume of compressed writes that crippled its twin. The built-in health indicator even suggested the end was coming around that mark:

But the flash in this particular SSD has proven surprisingly resilient. Just 12 sectors have been reallocated through 1.5PB, a far cry from the thousands accrued by the other HyperX.

Our sample size isn’t large enough to confirm which result is the outlier. Chip-to-chip variance is common in semiconductor manufacturing, though. Some dies are simply better than others, whether it’s the clock speeds that CPUs can attain or the write/erase cycles that NAND can survive.

The two HyperX SSDs arrived at the same time, and we used the highly scientific “eeny, meeny, miny, moe” method to determine which one got the partly compressible workload. If that drive also had a few cherry chips under the hood, it got lucky twice—and should probably buy a lottery ticket.

Digging deeper into the SMART data reveals that the surviving HyperX hasn’t been entirely flawless.

We didn’t notice it at the time, but the drive reported two uncorrectable errors between 900TB and 1PB of writes. Those episodes occurred during the same span as the first two reallocated sectors, though we can’t know for sure if the two are related. In any case, uncorrectable errors are very serious. They can corrupt data, crash applications, and even bring down entire systems.

The program and erase failures aren’t as critical. In those cases, the drive should be able to move on to another sector without risking the user’s data. Performance may suffer, but only momentarily.

Speaking of performance, the next page explores whether any of the SSDs lost a step over the last stretch.

Performance

We benchmarked all the SSDs before we began our endurance experiment, and we’ve gathered more performance data after every 100TB of writes since. It’s important to note that these tests are far from exhaustive. Our in-depth SSD reviews are a much better resource for comparative performance data. What we’re looking for here is how each SSD’s benchmark scores change as the writes add up.

With only a few exceptions, all the SSDs have performed consistently in these tests. The Neutron GTX stumbled with sequential reads a second time before it died, but it didn’t skip a beat elsewhere.

Unlike our first batch of results, which was obtained on the same system after secure-erasing each drive, the next set comes from the endurance test itself. Anvil’s utility lets us calculate the write speed of each loop that loads the drives with random data. This test runs simultaneously on six drives split between two separate systems (and between 3Gbps SATA ports for the HyperX drives and 6Gbps ones for the others), so the data isn’t useful for apples-to-apples comparisons. However, it does provide a long-term look at how each drive handles this particular write workload.

The 840 Pro and compressed HyperX react differently to Anvil’s stream of randomized files, but their behavior hasn’t changed over time. The Samsung’s write speeds continue to oscillate from one run to the next. The Kingston’s writes remain smooth and steady apart from the brief spikes associated with the secure-erase performed before each benchmarking round.

Unlike the other drives, the Neutron GTX actually sped up slightly as the writes piled up. Its average speed fell off a cliff toward the end, though. Here’s a tighter crop of its final steps versus the other failures:

All the SSDs slowed before deaths, but none as dramatically as the Neutron. No wonder it couldn’t carry on.