In early March, Intel gathered industry analysts and members of the tech press in Folsom, California to talk SSDs. The city is home base for Intel’s Non-Volatile Memory Solutions Group, otherwise known as NSG, which is responsible for the development and testing of Intel solid-state drives. The NSG had a story to tell about how its design and validation work produces extremely reliable SSDs. We got hard numbers on failure rates, details about efforts to make SSDs more dependable, and a peek behind the scenes at the Folsom facility.

We were also let in on a little secret—an easter egg, if you will. Remember the Intel 730 Series we reviewed in February? You know, the one with the skull on it? Well, it also has a bonus that glows under UV lighting. Good thing I have an old black light left over from my misspent youth.





I can’t decide if the logo is cheesy or cool. It’s probably a little of both. Intel is working on other aesthetic flourishes, which certainly can’t hurt on premium-priced products like the 730 Series.

With that revelation out of the way, let’s get down to business.

SSDs store precious personal information, and they increasingly power the datacenters behind popular online services. Reliability is of paramount importance. The trouble is, it’s rarely quantified. SSD makers have traditionally shied away from providing field reliability data, and retailers typically don’t disclose manufacturer-specific return rates. We’re left sifting through user reviews and forum threads to get a sense of which drives are better than others.

The anecdotal evidence spread across those sources suggests Intel SSDs are among the most reliable. And, wouldn’t you know, the snippets of data shared with us seem to agree.

Back in 2010, Intel decided to convert all of its corporate PCs to solid-state storage. The firm deployed its own SSDs, of course, and the failure rate for those drives has thus far been five times lower than for the mechanical drives they replaced. That’s an impressive decrease, especially since it reaches back four years. SSDs have matured quite a bit over the last couple of generations.

Intel didn’t discuss specific failure rates for the SSDs used in its own PCs. However, it did provide data on over six million drives shipped as part of its business-oriented Pro family. This product line has an annual failure rate target of 0.73%, and Intel has been comfortably under that mark for quite some time.

Even the return rates have met Intel’s AFR goal. More impressively, perhaps, the annual failure rate during this period never exceeded 0.4%. For much of 2013, the failure rate was well under 0.2%.

Intel is meeting its targets for datacenter and client SSDs as a whole, too. Venkatesh Vasudevan, Director of NSG Quality, Reliability & Validation, pulled up the following graph during his presentation:

These numbers are based only on 2-3 million drives, Vasudevan said, and the sample size is small and skewed for the datacenter SSDs. Still, failure rates were below 0.2% for pretty much all of 2013, and they were around 0.1% for the last seven months of the year. That blip at nearly 0.8% on the datacenter plot represents some initial teething problems with Intel’s then-new DC-series drives, Vasudevan clarified.

We don’t have similar data from other drive makers, so it’s hard to put Intel’s numbers into perspective. However, Vasudevan pointed out that IHS analyst Ryan Chien told Computerworld in September that he’d seen data suggesting that “client SSD annual failure rates under warranty tend to be around 1.5%.” Based on that information, the assertion by NSG Marketing Director Pete Hazen that Intel SSDs have “best-in-class annual failure rates” seems pretty plausible.

Intel’s field reliability data also lends some credibility to its argument for deploying dual 730 Series drives in striped RAID 0 arrays on high-end desktop PCs. RAID 0 doubles the odds of losing data to a catastrophic drive failure, but with failure rates as low as Intel claims, that doesn’t seem like such a big danger.

Extensive validation efforts

The NSG’s development and quality assurance teams are housed in the same building, making it easy for them to collaborate closely during development. Reliability is “part of the architecture,” Vasudevan told us.

We got a few ingredients for Intel’s special sauce, including background data refresh, a feature that moves data around if flash cells are losing their charge. Randomized data patterns are used to cut down on the cell-to-cell interference. Also, failed reads are reattempted at higher voltages in an attempt to extract data from cells that might otherwise produce errors.

Intel has a particularly intimate relationship with the flash that goes into its SSDs. The company is part of a joint NAND manufacturing venture with Micron, and according to Hazen, Intel characterizes “every cell on every die on every package on every SSD” it sells. The SSD’s own controller and firmware are used to test the NAND, he said, and “hundreds of settings” determine how voltage is distributed to the cells. Intel uses its familiarity with NAND vulnerabilities to ensure sufficient error-correction capabilities in the accompanying controller, as well.

The drives do more than just test themselves, of course. Intel’s qualification team gets a crack at new models early in the development process, and it continues to monitor them after finished products have been released into the wild. It even maintains a comprehensive stockpile of every SSD Intel has ever made. If customers encounter issues with older models, Intel has comparable drives on hand for further analysis.

Intel’s internal testing comprises thousands of unique tests on hundreds of different validation platforms for both server and client systems. When the local staff punches out, remote testers from India log in to keep the operation running 24 hours a day.

The main Folsom verification lab has enough capacity to test 2,500 SATA drives and 1,500 PCIe ones simultaneously. The scale is impressive, as are the heat and cooling noise from the racks of drives. Additional testing is done in other Folsom labs, in separate Intel facilities, and also at the factory. All told, Intel’s internal test capacity is about 10,000 drives.

Every aspect of SSD operation is tested. Drives are heated and chilled in massive incubators, and stacks of them are put away for long-term data integrity tests. Even power-loss protection is given a thorough workout. Special equipment simulates numerous disconnect scenarios, including cutting the power and data lines separately and pulling connectors off at an angle instead of with a straight tug.

If drives fail, the Folsom facility is loaded with diagnostic hardware that can be used to investigate the problem. The lab is also equipped to validate the individual components used in SSDs, and it even has the tools required to analyze and cut entire flash dies.

Endurance testing is a fairly straightforward process for client SSDs, but burning through the write cycles of high-endurance enterprise drives is a little more time-consuming. For these SSDs, Intel follows the short-stroke method established by the JEDEC standards group. This approach uses custom firmware to limit writes to a smaller percentage of the flash, forcing those blocks to accumulate write cycles at an accelerated rate. The short-stroke pattern typically targets flash cells in the corners of the dies, though it can also sample from the middle.

Intel’s internal reliability specification for high-endurance SSDs is tighter than the requirements laid out in JEDEC’s JESD219 standard for enterprise drives. The Intel spec accepts fewer functional failures and has a lower uncorrectable bit error rate: 10-17 instead of a mere 10-16. The Intel spec also adds a provision to account for read disturb, which isn’t covered by the JEDEC spec. Read disturb describes a phenomenon by which reading flash memory can alter the charge—and thus the data—stored in adjacent cells.

Intel Fellow and Director of Reliability Methods Neal Mielke told us that short-stroke testing revealed errors in the DC S3700 server drive. There were only two errors, he said, and they were within the spec. But Intel still changed the firmware and re-tested the drive to confirm that the problem had been addressed. Mielke said it’s “pretty common” to find issues when doing reliability testing early in product development.

On the next page, we’ll delve into the most interesting element of Mielke’s presentation: Intel’s efforts to safeguard SSDs from cosmic rays. Seriously.