Optane 900p 480G: zfs vs btrfs vs ext4 benchmarks

I recently bought a new server with an Optane 900p 480G and I decided to give zfs a try instead of using btrfs as usual (I will not use raid or other devices, just a single 900p).

I will use my Optane drive to host several KVM virtual machines.

I have been fooled to think that the native sector size was 512B by the fact that we weren’t allowed to reformat the NVMe to 4K/8K:

https://github.com/linux-nvme/nvme-cli/issues/346

https://communities.intel.com/thread/124672

This seems to be just a marketing move to sell the more expensive datacenter disks, in fact some reviews suggest that 512B is emulated, as well as 4K for the datacenter disks:

https://superuser.com/questions/1263828/why-512-behaves-worse-than-4096-when-nvme-configured-with-512-sector-size

https://www.anandtech.com/show/11209/intel-optane-ssd-dc-p4800x-review-a-deep-dive-into-3d-xpoint-enterprise-performance

Regular NVMe SSDs will present a 512B emulated sector by slicing up the larger (4K/8K/etc) flash pages into the smaller sector. Optane on the other hand is byte (bit?) addressable by design so all of its “sector sizes” are emulated by assembling a sector from each individual component. Since we are able to choose freely both the sector size and the record size the question is: which one to choose?

Since I plan to use compression I need to basically rule out all combinations where sector size equals record size. Recordsize is on the uncompressed size, so if you take an 8k record, compress it to 5.3k, you then still have to store that data in a 8k sector, so you save nothing. So I will consider only record sizes which are at least 4 times the sector size.

I also decided to throw raw device, btrfs and ext4 values to the mix, just to make things more fun.

I used fio 3.6 benchmark for the worst case: queue depth 1, single job. I also used direct=1 for the raw values, but I didn’t find a way to completely bypass caches for zfs.

Disk partitions has been aligned at 1MiB by zfs itself.

For a 512 sector size you need to set ashift=9 for your whole zpool, ahift=12 for 4K and ashift=13 for 8K.

On the contrary you can set recordsize=512, recordsize=4K or recordsize=8K on per-dataset basis.

Here are the results:

I suggest you do download the calc file:

optane_benchmarks

And the fio output, along with the commands I used:

optane_benchmarks_results

The official zfs wiki suggests a 4K recordsize to store virtual machine images, so I will probably opt for a 512 sector size with a 4K recordsize for VMs and a 32K recordsize for everything else.

EDIT: the ‘none’ scheduler has been used. It wasn’t clear from the previous graphs, but going from the default s 512 / r 128k (22,08 MiB/s) to s 512 / r 4k (32.57 MiB/s) leads to a 48% improvement in 4k randwrite, while s 512 / r 32k still retains a very good 45% increase in performance.

