Intel Optane SSD Microbenchmark

Symas Corp., August 2018

Since the point of these tests is to explore the performance of the Optane SSDs, the tests are configured much like the previous ondisk benchmark, using a database approximately 5x larger than RAM, to minimize the impact of caching in RAM and force the storage devices to be exercised. However, there are some twists to this as well: The Optane SSDs on NVMe can also be operated as if they were system RAM. The Optane technology still has higher latency than DRAM, but as we'll see, there's still a performance benefit to using this mode.

The hardware for these tests was graciously provided by our friends at Packet and system support was provided by Intel. The machine was based on an Intel S2600WFT motherboard with a pair of 16 core/32 thread Intel Xeon Gold 6142 processors and 192GB DDR4-2666 DRAM. Storage being tested included a 4 TB DC P4500 TLC NAND-Flash SSD and three 750GB DC P4800X Optane SSDs. The machine had Ubuntu 16.04 installed, with a 4.13.0-41-generic kernel. The software versions being used are LMDB 0.9.70 and RocksDB 5.7.3, both compiled from their respective git repos. (Note that LMDB 0.9.70 is the revision in the mdb.master branch, not an officially released version. The main difference is the addition of support for raw devices.)

Test Overview Prior tests have already illustrated how performance varies with record sizes. In these tests we're strictly interested in the relative performance across the different storage types so we're only testing with a single record size. We're using the ext4 filesystem in these tests, configured once with journaling enabled and once with journaling disabled. Each test begins by loading the data onto a freshly formatted filesystem. We use a 750GB partition on the 4TB Flash SSD, to ensure that the filesystem metadata overhead is identical on the Flash and Optane filesystems. Additionally, we test LMDB on raw block devices, with no filesystem at all, to explore how much overhead the filesystems impose. RocksDB doesn't support running on raw block devices, so it is omitted from those tests.

The test is run using 80 million records with 16 byte keys and 4000 byte values, for a target DB size of around 300GB. The system is set so that only 64GB RAM is available during the test run. After the data is loaded a readwhilewriting test is run multiple times in succession. The number of reader threads is set to 1, 2, 4, 8, 16, 32, and 64 threads for each successive run. (There is always only a single writer.) All of the threads operate on randomly selected records in the database. The writer performs updates to existing records; no records are added or deleted so the DB size should not change much during the test. The results are detailed in the following sections. Loading the DB Here are the stats collected from initially loading the DB for the various storage configurations.

Storage Load Time CPU DB Size Context Switches FS Ops LMDB Journal Wall User Sys % KB Vol Invol In Out Write Amp Flash/Ext4 Y 11:50.91 01:15.70 09:40.36 92 322683976 5910595 1303 2640 840839736 10.5104967 Flash/Ext4 N 13:21.04 01:16.69 11:01.86 92 322683976 8086767 1241 3696 946659568 11.8332446 Flash N 17:25.23 03:29.26 04:11.36 44 80669411 1346 645369800 645487344 8.0685918 Optane/Ext4 Y 14:20.99 01:12.78 12:09.88 93 322683976 9991458 1170 552 928896808 11.6112101 Optane/Ext4 N 15:11.10 01:16.72 12:49.09 92 322683976 10487638 1377 1080 1029364408 12.8670551 Optane N 20:26.19 03:30.62 03:55.97 36 80670953 1305 645367344 645547472 8.0693434 RocksDB Journal Wall User Sys % KB Vol Invol In Out Flash/Ext4 Y 15:00.44 13:01.27 11:45.63 165 318790584 231768 3184 11400 1265319232 15.8164904 Flash/Ext4 N 14:30.45 12:53.43 10:46.62 163 318790584 215318 2786 11016 1265362424 15.8170303 Optane/Ext4 Y 02:13:40.00 13:51.74 11:14.07 18 318790328 339737 7549 11088 1265319000 15.8164875 Optane/Ext4 N 02:13:40.00 13:47.29 10:49.81 18 318790328 337922 7598 11256 1265364360 15.8170545

The "Wall" time is the total wall-clock time taken to run the loading process. Obviously shorter times are faster/better. The actual CPU time used is shown for both User mode and System mode. User mode represents time spent in actual application code; time spent in System mode shows operating system overhead where the OS must do something on behalf of the application, but not actual application work. In a pure RAM workload where no I/O occurs, ideally the computer should be spending 100% of its time in User mode, processing the actual work of the application. Since this workload is 5x larger than RAM, it's expected that a significant amount of time is spent in System mode performing actual I/O.

The "CPU" column is the ratio of adding the User and System time together, then dividing by the Wall time, expressed as a percentage. This shows how much work of the DB load occurred in background threads. Ideally this value should be 100, all foreground and no background work. If the value is greater than 100 then a significant portion of work was done in the background. If the value is less than 100 then a significant portion of time was spent waiting for I/O. When a DB engine relies heavily on background processing to achieve its throughput, it will bog down more noticeably when the system gets busy. I.e., if the system is already busy doing work on behalf of users, there will not be any idle system resources available for background processing.

The "Context Switches" columns show the number of Voluntary and Involuntary context switches that occurred during the load. Voluntary context switches are those which occur when a program calls a function that can block - system calls, mutexes and other synchronization primitives, etc. Involuntary context switches occur e.g. when a CPU must handle an interrupt, or when the running thread's time slice has been fully consumed. LMDB issues write() system calls whenever it commits a transaction, so there are a lot of voluntary context switches here. However, not every write() results in a context switch - this depends largely on the behavior of the OS filesystem cache. RocksDB is configured with a large cache (32GB, one half of available RAM) as well as a large write buffer (256MB) so it has far fewer voluntary context switches. But since this workload is dominated by I/O, the CPU overhead of LMDB's context switches has little impact on the overall runtime.

The "FS Ops" columns show the number of actual I/O operations performed, which is usually different from the number of DB operations performed. Since the loading task is "write-only" we would expect few, if any, input operations. However, since the DB is much larger than RAM, it's normal for some amount of metadata to need to be re-read during the course of the run, as the written data pushes other information out of the filesystem cache. The number of outputs is more revealing, as it directly shows the degree of write amplification occurring. There are only 80 million DB writes being performed, but there are far more than 80 million actual writes occurring in each run. The results with the raw block device shows that the filesystem adds 25% more writes than the DB itself.





There are a few unexpected results here. The LMDB loads actually ran slower with the filesystem journal turned off. Also, the LMDB loads on the raw block device also ran slower than with a filesystem. The I/O statistics imply that the block device wasn't caching any of the device reads. RocksDB has a serious performance issue on the Optane filesystems, taking over 2 hours to load the data. There's no explanation for that yet.

Here's the load times plotted again, without the 2 hour outliers.





With LMDB on the raw block device, each write of a record results in an immediate write to the device, which always causes a context switch. So for 80 million records there are at least 80 million voluntary context switches. In general, even though this is a purely sequential workload, RocksDB performs more filesystem writes per database write than LMDB, and usually more filesystem reads. The latter is somewhat surprising because LSM-based designs are supposed to support "blind writes" - i.e., writing a new record shouldn't require reading any existing data - that's supposed to be one of the features that makes them "write-optimized." This LSM advantage is not in evidence here.

Overall, the specs for the Optane P4800X show 11x more random write IOPS and faster latency than the Flash P4500 SSD, but all of the load results here are slower for the P4800X than for the Flash SSD. Again, we have no explanation for why the results aren't more reflective of the drive specs. At a guess, it may be due to wear on the SSDs from previous users. It was hoped that doing a fresh mkfs before each run, which also explicitly performed a Discard Blocks step on the device, would avoid wear-related performance issues but that seems to have had no effect.