I feel for the subjects of our SSD Endurance Experiment. They didn’t volunteer for this life. These consumer-grade drives could have ended up in a corporate desktop or grandma’s laptop or even an enthusiast’s PC. They could have spent their days saving spreadsheets and caching Internet files and occasionally making space for new Steam downloads. Instead, they ended up in our labs, on the receiving end of a torturous torrent of writes designed to kill them.

Talk about a rough life.

We started with six SSDs: the Corsair Neutron GTX 240GB, Intel 335 Series 240GB, Samsung 840 Series 250GB, Samsung 840 Pro 256GB, and two Kingston HyperX 3K 240GB. They all exceeded their endurance specifications early on, successfully writing hundreds of terabytes without issue. That’s a heck of a lot of data, and certainly more than most folks will write in the lifetimes of their drives.

The last time we checked in, the SSDs had just passed the 600TB mark. They were all functional, but the 840 Series was burning through its TLC cells at a steady pace, and even some of the MLC drives were starting to show cracks. We’ve now written over a petabyte, and only half of the SSDs remain. Three drives failed at different points—and in different ways—before reaching the 1PB milestone. We’ve performed autopsies on the casualties and our usual battery of tests on the survivors, and there is much to report.

If you haven’t been following along with our endurance experiment, this introductory article is a good starting point. It spends far more time detailing our test methods and system configurations than the brief primer we’ll provide here.

The premise is straightforward. Flash memory has limited endurance, so we’re writing data to a stack of SSDs to see how much they can take. We’re checking health and performance at regular intervals, and we’re not going to stop until all the drives are dead.

The root cause of NAND’s limited endurance is a little complicated. Flash stores information by trapping electrons inside nanoscale cells; the associated voltage defines the data. The “tunneling” process used to move electrons in and out of the cell is destructive, not only eroding the physical structure of the cell wall, but also causing stray electrons to become stuck in it. These errant electrons impart a negative charge of their own, reducing the range of voltages available to represent data. The narrower that range becomes, the more difficult it is for SSDs to perform writes and to verify their validity.

Electron build-up is especially problematic at higher bit densities. MLC NAND needs to differentiate between four possible values within the flash’s shrinking voltage window, but TLC NAND must track twice as many. It’s more sensitive to normal flash wear as a result, which is why our 840 Series has been burning through more of its flash than the MLC-based drives in the experiment.

Continued write cycling eventually causes cells to become unreliable, at which point those cells are retired and replaced by flash harvested from the drive’s “spare area.” This reserve of fresh flash area ensures the SSD maintains its user-accessible storage capacity even if cell failures incapacitate some of the NAND. Of course, eventually that reserve becomes exhausted and the drive will begin to fail.

Now that we’ve laid the groundwork, it’s time to inspect the casualties. The first failures were a bit of a surprise but also completely expected. When we checked on our lab rats after 700TB of writes, we found SMART messages warning that the Intel 335 Series and one of the Kingston HyperX 3K units were at risk of failure. Both drives are based on MLC NAND, so we didn’t expect them to falter before our lone TLC contender.

Although the failure-prone drives were fully functional at 700TB, neither one made it to 800TB. The HyperX 3K expired at 728TB, while the 335 Series croaked at 750TB. We’ll deal with the Intel first, since its demise was a little more straightforward.

The 335 Series’ flash was almost entirely intact when the SMART warning hit. Only one reallocated sector had been logged up until that point, and it appeared way back at the 300TB mark, so it didn’t inspire the warning. Instead, the slow decline of the media wearout indicator (MWI) was responsible.

This SMART attribute starts at 100 and decreases as the NAND’s rated write tolerance is exhausted. It’s completely unaffected by the number of reallocated sectors, and it’s been ticking down steadily since the experiment began. The remaining life estimate in Intel’s SSD Toolbox utility is based on the MWI, and so is the general health assessment offered by HD Sentinel, the third-party tool we’ve been using to grab raw SMART data.

Our journey to 700TB drove the MWI all the way down to one, which is supposed to put the 335 Series in a read-only, “logical disable” state. The flash is deemed unreliable at this point, and in typically conservative fashion, Intel doesn’t want to perform a write that isn’t guaranteed. The SMART readout might have truncated a decimal place, though, because we were still able to run our usual performance tests and kick off the next 100TB of writes.

The 335 Series was fine until about 50GB into that run, when write errors started appearing in Anvil’s Storage Utilities, the application tasked with flooding the SSDs with writes. The Anvil app actually froze, though we were able to load it again and extract the performance log stored on the drive. We’ll take a closer look at those results in a moment.

Oddly, the 335 Series wouldn’t return SMART information after the Anvil write errors appeared. The attributes were inaccessible in both third-party tools and Intel’s own utility, which indicated that the SMART feature was disabled. After a reboot, the SSD disappeared completely from the Intel software. It was still detected by the storage driver, but only as an inaccessible, 0GB SATA device.

According to Intel, this end-of-life behavior generally matches what’s supposed to happen. The write errors suggest the 335 Series had entered read-only mode. When the power is cycled in this state, a sort of self-destruct mechanism is triggered, rendering the drive unresponsive. Intel really doesn’t want its client SSDs to be used after the flash has exceeded its lifetime spec. The firm’s enterprise drives are designed to remain in logical disable mode after the MWI bottoms out, regardless of whether the power is cycled. Those server-focused SSDs will still brick themselves if data integrity can’t be verified, though.

SMART functionality is supposed to persist in logical disable mode, so it’s unclear what happened to our test subject there. Intel says attempting writes in the read-only state could cause problems, so the fact that Anvil kept trying to push data onto the drive may have been a factor.

All things considered, the 335 Series died in a reasonably graceful, predictable manner. SMART warnings popped up long before write errors occurred, providing plenty of time—and additional write headroom—for users to prepare. On the next page, we’ll explore what happened to the HyperX 3K.

More casualties

There are actually two Kingston HyperX 3K SSDs in the experiment. One is being tested like all the others, with 100% incompressible data that’s immune to SandForce’s DuraWrite compression mojo. The second HyperX is identical to the first, but it’s getting a stream of compressible data via Anvil’s “applications” preset.

The HyperX’s SMART attributes log host and flash writes separately, giving us a glimpse of DuraWrite in action. After 700TB of writes from the host, the incompressible HyperX config showed 738TB of flash writes, while its compressible sidekick indicated only 501TB.

As one might expect, the imminent-failure warning came from the incompressible drive. The warning was displayed by both HD Sentinel and the Intel storage driver used on our test systems. Then, after 725TB of writes, we got another cautionary message, this time from the OS. “Windows detected a hard disk problem,” read the dialog box, “Back up your files immediately to prevent information loss.” 3TB later, Anvil started reporting write errors. The drive was still accessible, and we were able to dump one last batch of SMART data, but it bricked after a reboot.

On the HyperX 3K, the SSD life left attribute tracks flash wear. Like Intel’s media wearout indicator, it counts down from 100 and is tied directly to the rated lifespan of the NAND.

When this attribute reaches 10, the flash’s specified endurance has been exhausted, and the SMART warning is triggered. Kingston urges users to back up their data and move to a new SSD at this point. The firm describes the SMART message as being similar to the warning light on a car’s gas gauge. There’s still some fuel in the system when the light comes on, but you should pull over at next opportunity to fill up.

The HyperX is designed to keep writing for as long as the NAND is viable, regardless of its rated endurance. Flash blocks are only retired if there’s a programming failure, an erase failure, or if the acceptable ECC tolerance has been exceeded.

Programming and erase failures are logged by separate SMART attributes, and they really ramped up toward the end of the drive’s life. So did the number of reallocated sectors. By the end, there were 986 reallocated sectors, 111 programming failures, and 381 erase failures. Those figures suggest about half of the retired sectors were taken out of commission due to ECC issues.

The HyperX 3K has loads of overprovisioned area, but sections of it are reserved for internal management routines and for RAISE, the RAID-like redundancy feature available in SandForce SSDs. Only a small portion is dedicated to “spare” blocks that can fill in for reallocated sectors. Once this extra NAND is consumed, the HyperX is finished. Kingston says the drive will fail to mount if the power cycles, which explains why ours wasn’t detected after a reboot.

Unlike the 335 Series, which checked out on its own terms, the HyperX appears to have failed after burning through all of the NAND available for writes. We still received multiple warnings before the failure, and there was additional write headroom after each one. A normal user would have had plenty of time to prepare for the failure.

Our third casualty was the Samsung 840 Series, which we expected to fail first due to the shorter theoretical lifespan of TLC NAND. Our accumulated SMART data supported that assumption, too. The 840 Series started logging reallocated sectors after only 200TB of writes, and it’s reported thousands of them since our experiment began—far more than any other SSD. However, the 840 Series also allocates more spare area to replace bad blocks, so it’s tuned with the TLC’s relative frailty in mind.

When we checked on the SSDs after 900TB of writes, the 840 Series was still functional, and Samsung’s own SSD Magician software gave it a clean bill of health. The 840 Series didn’t make it to a petabyte, though. It died suddenly in the last leg, without any preceding SMART warnings.

We’re not entirely sure what caused the failure. The Anvil utility crashed, and the drive disappeared from not only the Windows device and disk managers, but also from the SSD Magician and HD Sentinel utilities. The Intel storage driver detected the 840 Series as an unnamed Samsung SATA drive, but we couldn’t actually do anything with it. We weren’t even able to grab a log of the last batch of writes or a final accounting of the SMART status. We can, however, analyze the SMART data collected up to 900TB.

The wear-leveling count is sort of like the MWI and life-left attributes on the Intel and Kingston SSDs. It’s “directly related to [the] lifetime of the SSD,” according to Samsung, and it bottomed out after 300TB of writes. HD Sentinel bases its health estimate on this attribute, so it’s had a dim assessment of the 840 Series since the 300TB mark. But Samsung’s own software pronounced the drive in good health after 300TB, as it did at every subsequent milestone.

The SMART attributes also track how much of the 840 Series’ spare block reserve has been consumed by reallocated sectors. That attribute suggested there were plenty of spare blocks at the 900TB mark, so the flash’s mortality rate would have to have spiked dramatically for insufficient reserves to cause the eventual failure. Without SMART details from the time of death, we can’t be certain about what happened. We can quantify the reallocated sectors along with another important attribute: uncorrectable errors.

Uncorrectable errors can compromise data integrity and potentially cause application or system crashes, so they’re kind of a big deal. The first bunch appeared after 300TB of writes, apparently during preparation for our first unpowered retention test. The 200GB file we use to check data integrity failed multiple initial hash checks and had to be recopied before proceeding. Although the 840 Series ultimately passed the retention test and a similar one after 600TB of writes, the uncorrectable errors put a mark on its permanent record.

Between 800 and 900TB of writes, the 840 Series logged 119 more uncorrectable errors, bringing the total to 295. Anvil didn’t report any hash failures during that period, but we have its built-in integrity test set to run relatively infrequently—after each 1TB of writes—and on a 700MB file, that covers only a small portion of the flash. Regardless of whether the last spate of uncorrectable errors resulted in incorrect data, it’s probably no coincidence the 840 Series died shortly after.

When we kicked off this experiment, Samsung told us to expect warning messages before the 840 Series’ demise. Failure would resemble a compatibility error, the company said, and it could manifest in a BSOD or “other failure notice.” Since we didn’t get any warnings or failure messages, something may have gone awry at the end of the line. The 840 Series’ lifeless body is being returned home for further analysis, which we hope will shed light on the drive’s final moments.

All these casualties are bumming me out, so let’s turn our attention to the survivors…

The petabyte club

As their comrades fell around them, the Corsair Neutron GTX, Samsung 840 Pro, and compressible Kingston HyperX 3K drives soldiered on to 1PB without issue. That’s kind of miraculous, really: a bunch of consumer-grade SSDs withstanding one freaking petabyte of writes. None of these drives are rated for more than 200TB.

Reaching such an important milestone warrants a closer look at the health of the remaining candidates, especially since one of them might not be with us for very long. Along the way to 1PB, the second HyperX posted a pre-failure SMART warning.

Thanks to its compressible payload, this HyperX logged only 716TB of flash writes for 1PB of host writes. Don’t read too much into the magnitude of the savings, though. The stream of sequential writes in our endurance test isn’t indicative of real-world client workloads. Those workloads write far too slowly to stress SSD endurance in a reasonable timeframe.

Apart from its declining life indicator, the compressible HyperX is in excellent shape. It’s logged only two reallocated sectors and no program or erase failures so far. The flash seems to be in much better condition than that of its incompressible twin, which had hundreds of reallocated sectors and lots of program and erase failures with a similar volume of flash writes. The difference between the two configs suggests there may be some variance in flash endurance from one SSD to the next, even within the same family. Our sample size is far too small to draw a definitive conclusion, though.

Given how the HyperX is designed to behave, death probably isn’t imminent. I wouldn’t expect a failure until the number of reallocated sectors starts increasing substantially.

Next up: the Samsung 840 Pro.

After the 840 Series’ sudden demise, it’s hard to know what to expect from the Pro. This drive has the same SMART attributes as its TLC counterpart, including the wear leveling count that’s supposed to be related to health. The thing is, that attribute hit its lowest point after 400TB of writes, and Samsung’s SSD utility said the drive was still in good shape. SSD Magician indicated that everything was cool at 1PB, too, although the shrinking reserve of used blocks points to an increase in reallocated sectors.

The number of reallocated sectors started ramping up after 700TB, hitting 1836 at the 1PB mark. Based on its 1.5MB sector size, the 840 Pro has retired 2.7GB of its total flash capacity. There’s plenty left, but whether we burn through it all remains to be seen.

Our last survivor is Corsair’s Neutron GTX. Several of this drive’s SMART variables are obfuscated by vague, “vendor-specific” titles, but Corsair’s Toolbox utility identifies attribute 231 as “SSD life left.” HD Sentinel lists the same attribute as temperature, but the profile fits what we’d expect from a lifespan indicator, albeit one that thinks the Neutron is going to be around for a very, very long time.

If the current rate of decline continues, the life attribute won’t hit zero until after more than 4PB of writes. That seems a tad optimistic for a consumer-grade SSD, so we’ve asked Corsair to clarify exactly how the value is calculated. It’s possible the slope could steepen in response to reallocated sectors. The drive hasn’t logged any of those yet, though.

Before moving on to our performance results, I should clarify that simply writing a petabyte isn’t sufficient for entry into our exclusive club. After reaching that milestone, the survivors faced another unpowered data retention test. They were left unplugged for seven days, and they all returned with our 200GB test file fully intact.

Now that we know which SSDs lived and which ones died, let’s see if any of them slowed down over the last stretch.

Performance

We benchmarked all the SSDs before we began our endurance experiment, and we’ve gathered more performance data after every 100TB of writes since. It’s important to note that these tests are far from exhaustive. Our in-depth SSD reviews are a much better resource for comparative performance data. What we’re looking for here is how each SSD’s benchmark scores change as the writes add up.

For the most part, all the drives have performed consistently since we began. We’ve observed a few blips here and there, including a potential one for the Neutron GTX in the last sequential read speed test. The drive hit roughly the same speed through five runs, so it was consistent in that sense, just short of previous efforts. We’ll have to see what happens at 1.1PB and beyond.

Accumulated writes don’t affect performance in most of these tests. However, the read speeds on the Samsung 840 Series are a little slower in our last set of results. Hmmm. Perhaps our other performance data will be more enlightening.

Unlike our first batch of results, which was obtained on the same system after secure-erasing each drive, the next set comes from the endurance test itself. Anvil’s utility lets us calculate the write speed of each loop that loads the drives with random data. This test runs simultaneously on six drives split between two separate systems (and between 3Gbps SATA ports for the HyperX drives and 6Gbps ones for the others), so the data isn’t useful for apples-to-apples comparisons. However, it does provide a long-term look at how each drive handles this particular write workload.

Again, the SSDs have mostly behaved consistently. The 840 Pro’s run-to-run inconsistency is kind of its thing, while the Neutron GTX’s slowly increasing pace has been evident from the start. Pay no attention to the regular spikes for some of the SSDs; those are related to the secure erase we perform before running our performance benchmarks every 100TB.

Our casualties maintained consistent write speeds for much of their lives, but there’s evidence of sputtering toward the end. Let’s zoom in for a closer look. The Intel and Kingston SSDs are covered through their final runs, but we don’t have data for the Samsung beyond 900TB.

Even without its last gasps on record, the 840 Series clearly started breathing more erratically over the last few hundred terabytes. The HyperX barely staggered in its final steps, while the 335 Series suffered a short but noticeable bout of wheezing before it hit the wall.

Ok, so maybe that’s a stretch—the noticeable bit, not the drawn-out running metaphor. These are ultimately minor reductions in write speeds, at least for the Intel and Kingston SSDs. It’s possible the Samsung got substantially slower closer to the end of its life, though I wouldn’t bet on it based on the data we have.