| By

This post completes The SSD Guy’s four-part series to help explain Intel’s two recently-announced modes of accessing its Optane DIMM, formally known as the “Intel Optane DC Persistent Memory.”

Comparing the Modes

In the second and third parts of this series we discussed Intel’s Memory Mode and the company’s App Direct Mode. This final part aims to compare the two: When would you use one and when the other?

There’s really no simple answer. As with all benchmarks, certain applications will perform better with one mode than with another, while other applications will behave the opposite way. Adding to the problem is the fact that App Direct Mode actually supports not one but four different access methods, which will be further explained below. As a rule of thumb performance for large serial accesses might be better when Optane is being used as a standard I/O device, while small random accesses could perform better in either Memory Mode or in App Direct Mode’s direct memory access, depending on whether or not persistence is necessary.

It’s a little easier to understand why the App Direct Mode exists if we examine the delay penalties that standard storage interfaces impose upon fast storage. Back in 2015 the Storage Networking Industry Association (SNIA) often used a rainbow-colored chart to illustrate the sub-components of the delays in storage access. The following diagram shows a portion of that chart. The original data that drives this chart was provided by Intel and it shows data access delays when using an Optane-based NVMe SSD.

Various colors are used to represent the delays from different parts of the system. The three blue portions are delays within an Optane device, whether it’s an SSD or an NVDIMM. If I understand this correctly, the green “Link Transfer” time represents delays from the PCIe interface and the amber “Platform & Adapter” delay stems from the NVMe protocol, both of which could be significantly reduced if the storage was connected via the DRAM’s memory channel instead of an NVMe interface. But the clear culprit in this chart is the software, in red, which accounts for about 10 of the 18-odd microseconds taken to service the request. This overwhelms the small hardware delays represented in blue.

To eliminate these delays the industry needed to find a way to shorten the software delays, and that has been tackled through the SNIA NVM Programming Model that was explained in an earlier post.

For today’s post I decided to redraw SNIA’s standard NVM Programming Model diagram a bit to simplify and clarify and I will use that diagram in a moment.

First, though, let’s look at how software stored data before the days of persistent memory. One good approach to discussing this is to first provide the same NVM Programming Model diagram stripped down to show only the storage interface as it existed before the SNIA NVM Programming Model was defined. That is provided in the diagram below:

An application program in this model normally writes to storage (SSD, HDD, or network storage) by using a Standard File API. More sophisticated applications that were written to manage the disk themselves (rather than leaving this to the file system) could write directly to storage using Raw Device Access. Raw device access can provide a little more speed since the data doesn’t need to go through the file system. Both access types go through the driver. The big red software delay in the earlier chart represents the slower of these two paths — a standard file API.

NVDIMMs allow even more software to be bypassed than is possible with raw device access, since the application program can read from or write to an NVDIMM without using either a driver or the file system. This is illustrated in a similar diagram below:

The new NVM-Aware File System is designed to take advantage of the new features of the NVDIMM to run faster than the standard file system would. Application programs can also communicate directly with the NVDIMM as if it were standard DRAM main memory. All of this is handled by the Direct Access Channel, which is usually abbreviated to “DAX.” This is a set of calls that understand that persistent memory exists on the memory bus.

The application program must be rewritten to use these DAX calls, and this process has already started at a number of leading application software providers.

DAX mode is the way that the fastest systems will use NVDIMMs, whether NVIMM-N (based on DRAM and NAND flash), Intel’s Optane DIMMs, or some future NVDIMM type. DAX minimizes or even eliminates the big red software delay of the top diagram, depending on whether the NVDIMM is accessed as storage through the drivers or as memory.

When the NVDIMM is accessed as memory the amber and green delays of that rainbow bar also nearly disappear since it is on the memory channel , leaving us only with the blue delays of the 3D XPoint memory technology itself.

Software that previously used the older two approaches shown in the top diagram will still operate in a system that embodies the newer calls, but it won’t be able to take advantage of all of the speed that the NVDIMM can offer.

In essence, then, we have four ways that an NVDIMM-N can be accessed through the SNIA NVM Programming Model, and five ways that an application program can access an Optane DIMM, since Intel supports the four access types in the SNIA NVM Programming Model with its App Direct Mode and another way through its Memory Mode.

So far I have not seen any benchmarks to compare Memory Mode against App Direct Mode, or benchmarks comparing the various App Direct Mode’s four different access methods, but it’s clear that direct memory accesses will out-perform the other three App Direct Mode interfaces for small random transfers.

The comparison between Memory Mode and the direct memory access in App Direct Mode is likely to veer in favor of App Direct Mode in most benchmarks, as long as most persistent stores can be converted to writes on the memory channel. You can be sure, though, that someone will find a benchmark that comes to the opposite conclusion.

For example, in Memory Mode nearly all of the Optane DIMM’s traffic will be cached in the DRAM, making the Optane DIMM appear to be much faster than it really is. In App Direct Mode Optane runs at its own native speed, which varies according to its power settings, and is roughly 1/3rd as fast when performing 64-Byte transfers as it is running 256-Byte transfers. Also, Optane writes take about three times as long as Optane reads. Still, if the Optane DIMM is being used for persistent storage that would otherwise be serviced by an SSD, and if that persistence is required by the application (i.e. all data must survive a power outage) then App Direct Mode will always win. In other cases it might not.

It is likely that benchmarks that compare App Direct Mode against Memory Mode will sometimes fall in favor of one and sometimes in favor of the other since these benchmarks will be comparing a large memory with a small delay against fast storage with a larger delay. The University of California, San Diego (UCSD) has published a study that compares a number of these modes, and that study will be the subject of a future SSD Guy post.

Intel published a couple of benchmarks during the Cascade Lake roll-out in April, and I will share these here. The first is a Redis benchmark that shows the number of virtual machines that can be supported as a function of memory size. This is an illustration of how Memory Mode can reduce system cost.

The memory size runs along the bottom of the chart, and the number of virtual machines that it can support is shown on the vertical axis. If you have more memory then you can support more virtual machines. The red and blue lines plot the number virtual machines supported by each memory size depending on the size of the VM, with the blue line depicting a 45GB virtual machine and the red line depicting a VM that is twice as large.

Probably the most important portion of this chart, though, is the gray triangles on the horizontal axis that call out the system cost difference between a DDR4-only system and a Memory Mode system that uses both DDR4 and Optane DIMMs. At 6TB thememory in an Optane-based system costs 30% less than a DRAM-based system, even though the Optane-based system uses 768GB of DDR4 DRAM that is not even included as a part of that 6TB since Memory Mode renders it invisible to the application program. What does not show in this chart is the relative performance of Memory Mode versus App Direct Mode.

The results that Intel has shared for App Direct mode are less tangible. One example given in April was a re-statement of an Apache Spark benchmark highlighted at last August’s Flash Memory Summit. It appears below:

This slide shows that when a 1TB Optane Memory is used to cache HDD storage, versus a 600GB DRAM cache, performance improved by eight times: the mean times for I/O-intensive queries dropped from 1,222 seconds to 147 seconds. This benchmark took advantage of the fact that the Optane cache was persistent, allowing more of the workload to be cached than was possible in the volatile DRAM cache. (One would suspect that the results would have been less striking had SSDs been used instead of HDDs.)

But, so far, everything I have detailed above is still confusing. To clarify I have tried to represent most of the ways that an Optane DIMM can be used in the following table. I will explain more below.

Memory Mode App Direct Mode Raw Device Access File API Memory Access

Persistent? No Yes Yes Yes Interface to PM Memory I/O Stack File System Memory I/O Size 64 bytes 4K Bytes 4K Bytes 64 Bytes Backward-Compatible? Yes Yes Yes No Context Switch? No Yes Yes No System DRAM Invisible Visible Visible Visible Storage Speed N/A Moderate Slowest Fastest

For simplicity’s sake I have collapsed the two different Standard File APIs (the one through the File System and the other one that runs through the NVM-Aware File System) into a single one. At this early stage they are likely to be pretty similar to each other. That will probably change over time.

In the I/O row I write that the size of an I/O access is 4K Bytes for both Raw Data Access and a File API. I have not included the fact that all operating systems also support legacy 512-Byte accesses, which do have higher granularity, but which have largely been phased out in favor of the faster 4K-Byte transaction.

The row labeled: “Backward Compatible?” is meant to explain which application software can run on a system. Memory Mode operates with all legacy software, as do all of the App Direct Mode access types except direct memory access.

The next row: Context Switch, refers to the way that the I/O is handled. Standard I/O devices interrupt the processor to communicate that a transaction has been completed. To service this interrupt the processor must save its current state onto the stack and replace the current state with the starting state of an interrupt service routine. This can consume several microseconds, so it is avoided when Optane is being accessed as memory in the App Direct Mode. (I will save details about this for a later post in The Memory Guy. For now, just believe me when I tell you that slow context switches can be avoided during the persistent memory accesses in App Direct Mode and in the non-persistent memory accesses in Memory Mode.)

In the next row, titled “System DRAM” reference is given to the fact that the DRAM in a Memory Mode system is invisible since it acts as the cache for the Optane DIMM. A system that uses 128GB of DRAM and 1TB of Optane memory in Memory Mode will appear to have only 1TB of memory. When the system is used in App Direct Mode then the memory size will be 1.1TB since all of the memory, both DRAM and Optane, will be accessible to the application.

The table’s final row addresses the speed of storage. Memory Mode is given an “N/A” since storage is not a part of this mode. Remember that Intel Optane DC Persistent Memory is not persistent when run in Memory Mode. Storage may be either fast or slow since it’s on some other device. The other columns reflect how much software delay impedes the various forms of access. Once again, serial streams will behave differently than random streams, and this table assumes that streams are pretty random.

But it gets even more complicated than that! Intel tells me that an Optane DIMM can be divided into different portions , each assigned to a different mode, so one or more address ranges will operate in Memory Mode while the remainder operate in App Direct Mode. This means that an application that uses Memory Mode to increase its memory size could be using an App Direct Mode on the same DIMM as a part of its storage. While that may initially be hard to understand, I am sure you will appreciate that you would be adding speed on top of speed if you were to take advantage of this approach.

Does your head hurt yet?

Something really important to remember is that the Optane DIMM is only supported by certain Intel Cascade Lake CPUs (now known as the “2nd-Generation Intel Xeon Scalable processor”) but not by earlier-generation Intel processors or by any other company’s processors. This is because the Optane DIMM uses a proprietary DDR4-compatible interface that is a tightly-guarded secret. Called DDR-T, this interface uses a standard DDR4 protocol and signaling for the data and address buses, but has expanded the commands to manage the Optane DIMM while remaining compatible with standard DDR4 DIMMs. This will be the subject of a future post on the Memory Guy blog.

Those who want to really jump in and program Optane DIMMs should investigate Intel’s Persistent Memory Developer’s Kit (PMDK) which provides tools to actually use App Direct Mode. It’s available at https://software.intel.com/en-us/persistent-memory/get-started

So that’s the story on Intel’s two confusing modes: App Direct Mode and Memory Mode. Although I hope that I have been clear enough I realize that it’s a pretty tough concept, so please feel free to post any questions in the comments to this series.

And, if you have questions about what all of this means to your business please contact me through the Objective Analysis website. All of my contact information is at the bottom of my Analyst Profile page. Objective Analysis makes a point of helping our clients clear through the fog to find their best path to success.

This four-part series, published in early 2019, explores each of Intel’s two modes to explain what they do and how they work in the following sections: