NVMe is the key to unlocking the performance of flash in client and enterprise applications. We take a hands-on look at the NVMe specification with tests.

Introduction

Introduction - What is NVMe?

VIEW GALLERY - 33 IMAGES

Identifying the problem with current SSDs is simple. SSDs were forced to conform to the legacy of spinning media, in every fashion, as they gained acceptance. A glaring problem is the interface; SSDs commonly use SATA and SAS connections. AHCI and SCSI, the interface protocols for SATA and SAS, were initially designed to communicate with the slow rotating platters of HDDs. HDDs have never come close to providing the high IOPS and low-latency performance of SSDs, and legacy interfaces simply aren't designed to deliver the unadulterated connection flash-based devices crave.

SSD manufacturers migrated to the PCIe bus to power high-performance products for several reasons. Most common SSDs still come in a 2.5" form factor, even though one of the primary advantages of flash memory is enhanced density and a smaller footprint. The PCIe slot allows for denser designs better suited to the advantages of flash memory.

PCIe also provides much more bandwidth than other interfaces and reduces latency due to its close proximity to the CPU. Some manufacturers hamper performance by utilizing the AHCI protocol over the PCIe bus. Other manufacturers have taken to developing their own proprietary software interfaces to provide better performance. These approaches have restricted the performance of most PCIe SSDs.

The NVMe (Non-Volatile Memory Express) protocol was built ground-up for non-volatile memory and scales to address both client and enterprise applications. Actually designed with next-generation non-volatile memory like PCM and MRAM in mind, NVMe is memory-agnostic and has no NAND-specific commands. Instead, latency, parallelism, performance, and low power operation were a key focus during NVMe development. These design tenets avail themselves perfectly to NAND flash media and its inherently parallel architecture.

Designed specifically to cater to multi-core environments, the NVMe interface provides almost unlimited scaling. The optimized software stack provides a radical reduction in latency. Latency is the most treasured performance characteristic of SSDs and delivers tangible performance benefits to applications. NVMe also lightens the CPU load required to perform storage operations through a number of enhancements.

An overriding goal of the NVMe specification is to usher in the commodization of PCIe SSDs. Over 90 companies have participated in the development in an open industry consortium, and a 13-company promoter group directs the body. Revolutionizing the communication protocol is a huge undertaking and development of the entire ecosystem is key. The NVMe committee will help speed broad industry adoption of PCIe SSDs by providing standard drivers, a consistent feature set, development tools, and compliance and interoperability testing through the UNH-IOL NVMe Test Consortium.

NVMe is already experiencing broad driver support with integrated drivers in Windows, Linux, UNIX, Solaris, VMware, and UEFI. In the past, it took a considerable amount of engineering effort to develop custom drivers and software for PCIe SSDs from scratch, and then validation essentially occurred in a vacuum. This process takes months, and sometimes years, to validate and progress through qual cycles. Unfortunately, some of the problems encountered with drivers and software occurred with SSDs already deployed into production environments.

The NVMe protocol speeds development and accelerates time to market. Tier 1 NAND vendors, OEMs, SSD vendors, and even hyperscale datacenters, can deploy solutions easily with standardized controllers and reference firmware designed to utilize the NVMe interface. The ability to deploy previously validated utilities, software, and drivers, helps minimize firmware and software customizations.

After years in development, and attending more sessions at trade shows than I can count, it really is exciting to have NVMe finally make it into the lab. Let's get to testing.

Test Methodology

Testing NVMe performance against other storage protocols would be best if we could use DRAM memory, with its high performance and nanosecond-class latency, and then test each protocol head-to-head with the same device. In fact, during NVMe development this is how they conducted many of the tests. In the absence of a DRAM-class test device, we utilize the 1.6TB Intel P3700, the first NVMe SSD in our lab. Many of the advantages are focused around latency and lower CPU utilization, but the Intel P3700 also delivers tremendous performance and endurance. In following pages, we will detail exactly how NVMe provides these performance advantages.

Our full 1.6TB Intel P3700 NVMe PCIe Enterprise SSD Review delves further into the characteristics and features of the P3700. Today we are using the P3700 to conduct a few cursory tests that highlight some of the attractive features of the NVMe protocol. Bear in mind that there are NAND and controller limitations to all current-generation SSD products that prevent a full illustration of NVMe performance.

Readers should view our results as anecdotal evidence of performance improvements from NVMe; the architecture of the new Intel SSD delivers some of the performance enhancements as well. We are presenting 4k random read and write results, and the OLTP server workload, which consists of a 8K 66/33% read/write workload. This highlights a common use-case with a difficult mixed workload.

Our test platform is an Intel R2000GZ server with dual 10 core (20 logical) E5-2680 v2 processors, and 64 GB of DDR3 RAM. The other two SSDs in the test pool, the Micron P420m and Virident FlashMAX II, utilize proprietary software and drivers for connection via PCIe.

We also include results for the SATA 6Gb/s Intel DC S3700 and the 12Gb/s SAS HGST SSD800MH. We do not include their test results in standard performance tests; they do not compete in the same segment. These two 2.5" SSDs are used as reference points for our Latency v IOPS, CPU, and interrupt tests.

Our approach to storage testing targets long-term performance with a high level of granularity. Many testing methods record peak and average measurements during the test period. These average values give a basic understanding of performance, but fall short in providing the clearest view possible of I/O QoS (Quality of Service).

'Average' results do little to indicate performance variability experienced during actual deployment. The degree of variability is especially pertinent, as many applications can hang or lag as they wait for I/O requests to complete. This testing methodology illustrates performance variability, and includes average measurements, during the measurement window.

While under load, all storage solutions deliver variable levels of performance. While this fluctuation is normal, the degree of variability is what separates enterprise storage solutions from typical client-side hardware. Providing ongoing measurements from our workloads with one-second reporting intervals illustrates product differentiation in relation to I/O QOS. Scatter charts give readers a basic understanding of I/O latency distribution without directly observing numerous graphs.

Consistent latency is the goal of every storage solution, and measurements such as Maximum Latency only illuminate the single longest I/O received during testing. This can be misleading, as a single 'outlying I/O' can skew the view of an otherwise superb solution. Standard Deviation measurements consider latency distribution, but do not always effectively illustrate I/O distribution with enough granularity to provide a clear picture of system performance. We utilize high-granularity I/O latency charts to illuminate performance during our test runs.

Our testing regimen follows SNIA principles to ensure consistent, repeatable testing, and utilizes multiple threads to represent typical production environments.

4k Random Test Results

The Intel P3700 peaks at 467,055 IOPS with 256 OIO (Outstanding I/O), well below the 728,111 IOPS from the Micron P420m. The Virident FlashMAX II averages 350,532 IOPS. One area to note is the impressive performance from the Intel P3700 at 32, 64, and 128 OIO. The P3700 performs very well under these relatively light workloads. The FlashMAX II also demonstrates excellent performance at 8 and 16 OIO.

While the Intel P3700 trails the Micron P420m in IOPS, it outperforms the P420m in terms of latency. The FlashMAX II is competitive in this respect, but the overall advantage goes to the P3700.

Comparing IOPS to latency highlights the performance enhancements of the Intel P3700. At .2ms the P3700 delivers an astounding 435,000 IOPS, while the P420m scores 150,000 IOPS and the FlashMAX II provides 280,000 IOPS. The FlashMAX II actually provides lower latency until it reaches 180,000 IOPS, but then the P3700 dominates the remainder of the chart. The P420m delivers excellent performance and IOPS top out much higher than competing devices, but requires a high OIO load to leverage its parallelism advantage. This has the side effect of adding latency, presumably from system overhead.

The P3700 delivers lower latency until it reaches 460,000 IOPS, its effective performance limitation. The P3700 delivers almost 400,000 IOPS before it reaches the same latency the P420m begins with at 50,000 IOPS.

We also included a 6Gb/s SATA Intel DC S3700 and a 12Gb/s SAS HGST SSD800MM as reference markers for the performance of two leading traditional 2.5" SSDs. The DC S3700 and its SATA/AHCI connection do not scale well, and performance does not increase much as it adds latency. The SSD800MH and its 12Gb/s SAS connection scale well as we reach its top speed of 130,000 IOPS, but latency increases quickly as it reaches the limit of its performance.

One of the most important benefits of NVMe is the reduction in processor overhead for input/output operations. Streamlining of the I/O submission and completion process yields a performance advantage, and we describe the process in detail in the coming pages.

We measure IOPS performance for each 0.1% of CPU utilization by dividing the number of IOPS during the test window by the CPU utilization. We are testing with a dual E5-2680 v2 test system, and the high number of cores (40 logical) results in some results under 1%, forcing us to measure at greater granularity. We recorded these results without the IOPS generation overhead from our test tool.

The P3700 easily leads this test with an average of 78,843 IOPS. The P420m falls in behind at 52,937 IOPS. The Virident takes third place, and the SATA and SAS representatives lag far behind in efficiency.

The host system creates an interrupt when there is an I/O request. A high number of interrupts can affect processor performance, which leads to work processing I/O that could potentially be dedicated to applications. One of the key advantages of NVMe is interrupt coalescence, which allows the protocol to service several commands per interrupt. We cover that process in more detail later in the article. To compare processor efficiency we recorded the number of IOPS processed per interrupt for each of the competing solutions.

The Intel P3700 easily leads this test with an average of nearly four IOPS per interrupt during the measurement window. The P420m comes in a close second with just shy of 3.5 IOPS per interrupt. Micron utilizes a custom driver for their controller and is one of the leaders of NVMe development. It is important to note that Micron actually has different interrupt coalescence settings in the RealSSD Manager. We may be observing some traces of NVMe-esque interrupt coalescence from the Micron driver.

The Virident FlashMAX II handles its drive management operations, such as garbage collection and wear leveling, on the host system, likely penalizing it in this test. The 6Gb/s SATA and 12Gb/s SAS representatives also provide less than half an IOP per interrupt.

The Intel P3700 leads easily in this test, delivering nearly its maximum performance of 148,989 IOPS with only 8 OIO. The P420m also delivers performance right out of the gate and tops out at nearly 110,000 IOPS. The FlashMAX II requires more parallelism to add performance, but tops out at 120,000 IOPS at 256 OIO.

The P3700 leads across the board as expected, with superb latency performance that leads every test.

The P3700 is simply outstanding in this test. It leads the pack and delivers the absolute lowest latency at nearly full performance. The performance in this heavy write workload also outstrips competitors. This illustrates the advantages of NVMe providing superior latency, and represents Intel's impressive engineering focus on delivering performance under low load conditions. This low-load performance has a tangible impact on application performance, and we delve into that aspect in the full product evaluation.

The HGST SSD800MH deserves honorable mention; its performance of 100,000 IOPS at such low latency (nearly that of the FlashMAX II) is a testament to the performance of 12Gb/s SAS. The poor DC S3700 is far alone to the left, clearly outclassed and out of its element. While not tuned for heavy write workloads the P420m delivers solid performance, albeit at a much higher starting latency in comparison to the other entrants.

The overall processor efficiency of the NVMe protocol and the P3700 come into clear focus during our 4k random write test. The P3700 flaunts more than double the CPU efficiency in comparison to competing solutions.

The P3700 once again leads the interrupt testing, and the Micron P420m follows on its heels. The other solutions are not close to the interrupt efficiency provided by the P3700 and P420m.

The P420m delivers monstrous read performance, but as we mix in writes the P3700 overtakes it at the 30% write mixture and takes the lead for the remainder of the chart.

OLTP Tests

The Intel P3700 leads the OLTP workload with 132,238 IOPS at 256 OIO. The P420m averages 106,403 IOPS, and the FlashMAX II weighs in with 121,240 IOPS.

The P3700 delivers lower latency at lower OIO while the FlashMAX II takes the lead at 128 and 256 OIO.

Comparing IOPS to latency illustrates just how much the P3700 distances itself from the competition. The P3700 occupies the space lowest and to the right, delivering the most IOPS at the lowest latency. At 1ms it delivers an outstanding 120,000 IOPS, while the FlashMAX II delivers 90,000 and the P420m weighs in at roughly 28,000 IOPS at the same latency. The P420m skims slightly above 1ms until it reaches 65,000 IOPS, still far below the performance from the P3700.

The HGST SSD800MH is still impressive; numerous 12Gb/s SAS SSDs stacked in a RAID configuration would be very competitive.

With a mixed workload, the P420m pulls slightly ahead in efficiency, possibly a hallmark of Micron's NVMe experience. The FlashMAX II fares well and is more competitive with mixed workloads. The DC S3700 is also surprisingly efficient in this workload.

Once again, the simplified driver stack and command set of NVMe, along with command coalescence, distances the P3700 from the pack in interrupt testing.

NVMe Connections

Now that we have highlighted several of the attractive performance benefits of NVMe, we will dig into how NVMe provides those results.

One of the best aspects of PCIe is its wide deployment into the computing space. Replacing or supplementing other protocols requires the broadest use case possible. PCIe is pervasive in almost all aspects of computing, from laptops and mobile applications all the way up to data-center class systems. PCIe has a clear path forward with faster versions, such as PCIe 4.0, already in the making. This will unlock faster speed with each new generation.

The ubiquitous PCIe connection also provides a big jump in bandwidth. SSDs outstripped the 6Gb/s SATA connection almost on arrival, leading to a necessary rethinking of the spec. SAS 12Gb/s is conspicuously absent from this chart, and tops out around 1.2 GB/s.

One of the most impressive benefits to the PCIe connection is its ability to process read and write operations simultaneously (full duplex). The commonly used SATA interface conducts each type of operation separately (half duplex), creating a bottleneck that hinders saturation of SSDs parallelism characteristics.

PCIe also provides linear performance scaling. There is 1Gb/s of bandwidth available for each lane of PCIe Gen 3 (x1), and increasing the number of lanes increases speed until the lanes are all occupied. This creates a performance ceiling of 8000 MB/s for current generation devices.

The innate capabilities of NVMe are well suited to datacenter applications. NVMe's initial market penetration will focus on enterprise applications, much like the progression from PATA to SATA.

The stage is being set for client applications with the development of standardized connections and form factors, such as M.2, SATA Express, and SFF-8639 already well under way. The numerous connection methods will support backwards compatibility with AHCI, but the true advantages will come from use of the NVMe protocol. The HHHLx4 PCIe CEM 2.0 specification provides plenty of room to scale capacity, especially as we enter the 3D NAND generation.

The SFF-8639 is the Swiss army knife of connectors. It will support SAS, SATA, PCIe x4 SSDs, and backplanes for x4 PCIe SSDs and SAS/SATA HDDs. The SFF-8639 connector will bring the PCIe connection to the familiar 2.5" form factor for easily serviced and hot-swappable SSDs.

There are no plans to expand the SATA bus to a faster speed. Instead, there is a focus on SATA Express (SATA 3.2). SATA Express actually isn't a protocol, it is a dual SATA/PCIe connector. It supports SATA as well as a PCIe connection and can use either AHCI or NVMe as the logical device interface. SATA Express will transfer over to the SFF-8639 receptacle in the future.

NVMe's primary competitor will also leverage the SFF-8639 connector, and the x4 SAS noted in the graphic denotes the SCSI Express protocol. SCSI Express utilizes the SCSI command set over the PCIe bus by utilizing two T10 standards, SCSI over PCIe (SOP) and the PCIe Queuing Interface (PQI). The broad support for NVMe, and products already reaching the market, does not bode well for SCSI Express. SCSI Express is still being pushed to completion by the STA (SCSI Trade Association).

NVMe Operation

The goal of NVMe is to simplify and reduce the driver stack as much as possible. In the past, each small step forward has focused on reducing the link transfer and platform+adapter latency, marked in green and yellow. The move to PCIe places the storage device closer to the CPU, automatically dropping platform+adapter latency from 10usec to 3usec. The blue categories are the actual speed of the hardware. Future NVM (Non-Volatile Memory) products, such as MRAM and PCM, will be up to 1000 times faster than NAND-based devices. NVMe lays the groundwork for handling the massive reduction in hardware latency, and in effect pushes latency back onto the software.

One of the key building blocks of NVMe is its simplified command set. NVMe only requires 10 administrative and 3 I/O commands. Administrative commands are less frequent than NVM I/O commands. Mandatory admin commands are for management functions, such as creating and deleting queues. There are also optional administrative commands for formatting, firmware updates, and security features.

The three mandatory NVM I/O commands control reading, writing, and flushing volatile write caches to the storage medium. There are also optional commands such as dataset management, which provides TRIM functionality. Command arbitration assigns different priority levels to manage service levels.

By comparison, after decades of development, SCSI has accumulated 170 total commands, and SATA has 8 different read commands alone. This leads to increasing complexity and is a perfect example of legacy baggage, while NVMe's spartan command set provides streamlined operation. In addition, an optional SCSI translation layer provides NVMe compatibility for those leveraging existing SCSI infrastructure.

NVMe supports parallel operations by establishing multiple queue pairs (Submission/Completion). The management queue, to the left, creates or deletes additional queues. Each new queue is assigned to its own CPU core, and these simple queues only process read, write, or flush commands.

Each queue supports up to 64,000 commands (QD), and the controller management queue can create a mind-boggling 64,000 queues. Submission and completion queues are allocated in host memory, and multiple submission queues can utilize the same completion queue.

AHCI is woefully inadequate by comparison; it only supports one queue with 32 commands. The location of the single queue on one core also severely hampers performance; operations routinely traverse multiple cores during the path to completion. NVMe supports multiple cores for maximum performance, and MSI-X, interrupt steering, and interrupt aggregation help extend scalability well beyond the capability of AHCI.

These two slides outline the steps required for command submission and processing. This streamlined process only requires one 64B fetch per 4K command; in contrast, AHCI requires two serialized host DRAM fetches. NVMe also utilizes memory reads of the submission queue and eliminates performance-killing uncacheable / MIMO register reads in the issuance or completion path.

NVMe Performance and Drivers

This graphic illustrates the difference between SCSI and NVMe in the Linux storage stack. NVMe reduces latency overhead by more than 50% by removing the request queue and SCSI translation layer. By utilizing the block-layer driver, NVMe saves 10,000 CPU cycles, and dedicating these extra CPU cycles to applications or other host processes will provide a performance advantage.

There will be an accompanying evolution of the software stack, and operating systems, to take advantage of the enhanced performance. Like the transition to fully utilizing multiple CPU cores after their debut, this will not occur overnight. NVMe drivers will hasten the transition by simplifying and standardizing software interaction with the storage stack.

The standardized Windows StorNVMe.sys driver follows the StorPort model. StorPort is an optimized library of hardened drivers that scales from servers to tablets. StorNVMe also supports interrupt coalescing for optimum latency performance and is NUMA-optimized.

This graph compares the performance of NVMe against SATA and SAS devices, but also compares a flash-based NVMe drive to a RAM-based NVMe drive. The advantages of moving beyond flash for ultra-high-performance applications will likely spawn a new set of DRAM-based products to address the high-performance segment in the interim before next generation non-volatile memories.

Initial performance comparisons from Intel tout a 2x performance over 12Gb/s SAS and 4-6x the performance of SATA SSDs. We will be conducting tests to compare the speed of the P3700 against an array of leading 6Gb/s SATA and 12Gb/s SAS SSDs, look to these pages soon for a full report.

Database and virtualization workloads will benefit tremendously from NVMe architecture. NVMe provides native atomic I/O size affinity for databases and matches natural application I/O granularity.

The multi-core nature of NVMe queueing is well suited for virtualization workloads and its multi-threaded performance requirements. Servicing the I/O from submission to completion on the same core occupied by the application thread will significantly increase performance, especially once software engineers begin to tap the possibilities of NVMe performance.

The comparison between NVMe and SATA SSDs are unavoidable, and rightly so. Replacing 6 SSDs with one PCIe SSD is a performance win, but the benefits of PCIe SSDs extend beyond the normal bandwidth and IOPS measurements.

Switching to a PCIe SSD eliminates several SSDs and an HBA or RAID adaptor. There are also peripheral advantages of reduced cabling and complexity. Another compelling use-case is in micro-server designs, where the enhanced performance density and capacity of an NVMe solution will facilitate smaller designs.

NVMe includes comprehensive end-to-end data protection (T10 DIF and DIX compliant), security and encryption capabilities (TCG), robust error correcting, and management capabilities (SMBus and I2C). Support for I/O virtualization architectures, such as SR-IOV, is also included.

Competing with SAS functionality required the development of similar enterprise-class features, such as Multi-Path I/O, namespace sharing, enhanced reset capabilities, and a reservation mechanism compatible with SCSI reservations. These features are baked into revision 1.1, and increased power management capabilities will be included in 1.2.

Autonomous power management provides low power states, similar to DEVSLP, in client applications. The NVMe implementation is autonomous, meaning the device does not have to power back up to receive commands to enter lower sleep states.

Final Thoughts

NVMe has garnered industry-wide support and is speeding its way into a datacenter near you. Many are not aware just how far NVMe has already spread behind the scenes since the release of the NVMe 1.0 specification back in 2011.

PMC-Sierra is already supplying controllers to two of the largest datacenter customers in the world so they can build their own SSDs. Even though PMC cannot name them directly, it isn't hard to speculate which two entities have the largest datacenters in the world. This build-it-yourself philosophy lines up perfectly to the Open Compute initiative, and we all know the players in that space. PMC is publicly supplying Princeton controllers to EMC for their own internally developed products. These three projects are already deep in internal qual cycles, and the new approach of build-it-yourself SSDs has the ability to upend the current model for hyperscale datacenter flash deployments. NVMe standardization, with its focus on ease-of-deployment, enables much of this progress.

Samsung already has units shipping, and SanDisk, EMC, HGST, LSI, Seagate, and Micron are all already well underway in their NVMe initiatives. A quick look at the University of New Hampshire's InterOperability Laboratory (UNH-IOL) page reveals a list of companies that have passed conformance and interoperability testing for NVMe 1.0 and 1.1 designs, and we expect that list to grow quickly.

Intel is the first to market with a wide range of both PCIe and 2.5" NVMe SSDs for the retail space, hence our testing with the Intel P3700 for this article. Intel is in a unique situation among the companies developing NVMe; they actually have the ability to push NVMe-capable slots onto motherboards and chipsets. There is already a growing interest in M.2 slots for ultra-dense server designs, and Intel is the locomotive that pulls the chipset train along the track.

One of the only challenges facing NVMe is PCIe lane limitations. At first glance, 80 lanes for a dual socket server would seem to be adequate for even the most extreme storage requirements. It is important to remember the other devices sharing the bus as well; high performance 40Gbe networking cards and other PCIe devices typically accompany high performance flash deployments.

Intel's dominance in the server CPU market places them in the unique position to address lane limitations with future CPUs and chipsets. If deemed necessary they could simply increase the number of lanes. They also have some control over the PCIe 4.0 expansion timeline, which could be a more efficient approach to addressing PCIe lane limitations.

NVMe's optimized register interface and command set minimizes the number of CPU clocks per I/O and delivers higher performance and lower latency to the end-user. Our testing confirmed the Intel P3700 features efficient CPU utilization when servicing I/O, and its interrupt coalescence also leads the pack in most tests. It is hard to do a better job of illustrating the superior latency performance of NVMe than our IOPS v Latency tests. The difference between NVMe latency and other solutions is night and day.

The initial move to the faster PCIe bus for SSDs was almost intuitive. Placing NAND closer to the CPU delivered tangible performance benefits in relation to existing interconnects, but the need to rethink and design a new protocol free of legacy baggage was necessary to unlock the performance of the PCIe bus.

Intel founded the Non Volatile Memory Express Workgroup in 2009 to power the future of storage. The result is a widely accepted standard that delivers 1/5th the latency of previous protocols, and scales to address devices with 1000x the speed of today's flash-based storage products. NVMe is a forward-thinking protocol that will be around for at least another decade.

The aging AHCI interface debuted in 2004, and like previous interfaces before it, it will succumb to the forward march of progress. It is only a matter of time until NVMe-compatible connections replace AHCI. SCSI Express is a projected contender in the protocol race, but NVMe is widely accepted by a broad consortium of industry leaders and it is hard to foresee SCSI Express mustering a challenge with NVMe already on the market.

For now, our bet is on NVMe powering the future of non-volatile memory. Keep your eyes on the pages here at TweakTown for our full product evaluation of the 1.6 TB Intel P3700, followed shortly with an article exploring four of the 1.6TB P3700's in various RAID configurations.