I'm trying to setup ZFS on a single disk because of the amazing compression and snapshotting capabilities. My workload is a postgres server. The usual guides suggest the following settings:

atime = off compression = lz4 primarycache = metadata recordsize=16k

But with those settings I do see some weirdness in read speed - I'm just looking at this atm!

For reference here's my test drive (Intel P4800X) with XFS, it's a simple direct IO test with dd:

[root@at-storage-01 test]# dd if=large_file.bin of=/dev/zero bs=4K iflag=direct 910046+0 records in 910046+0 records out 3727548416 bytes (3.7 GB) copied, 10.9987 s, 339 MB/s [root@at-storage-01 test]# dd if=large_file.bin of=/dev/zero bs=8K iflag=direct 455023+0 records in 455023+0 records out 3727548416 bytes (3.7 GB) copied, 6.05091 s, 616 MB/s [root@at-storage-01 test]# dd if=large_file.bin of=/dev/zero bs=16K iflag=direct 227511+1 records in 227511+1 records out 3727548416 bytes (3.7 GB) copied, 3.8243 s, 975 MB/s [root@at-storage-01 test]# dd if=large_file.bin of=/dev/zero bs=32K iflag=direct 113755+1 records in 113755+1 records out 3727548416 bytes (3.7 GB) copied, 2.78787 s, 1.3 GB/s [root@at-storage-01 test]# dd if=large_file.bin of=/dev/zero bs=64K iflag=direct 56877+1 records in 56877+1 records out 3727548416 bytes (3.7 GB) copied, 2.18482 s, 1.7 GB/s [root@at-storage-01 test]# dd if=large_file.bin of=/dev/zero bs=128K iflag=direct 28438+1 records in 28438+1 records out 3727548416 bytes (3.7 GB) copied, 1.83346 s, 2.0 GB/s [root@at-storage-01 test]# dd if=large_file.bin of=/dev/zero bs=256K iflag=direct 14219+1 records in 14219+1 records out 3727548416 bytes (3.7 GB) copied, 1.69168 s, 2.2 GB/s [root@at-storage-01 test]# dd if=large_file.bin of=/dev/zero bs=512K iflag=direct 7109+1 records in 7109+1 records out 3727548416 bytes (3.7 GB) copied, 1.54205 s, 2.4 GB/s [root@at-storage-01 test]# dd if=large_file.bin of=/dev/zero bs=1M iflag=direct 3554+1 records in 3554+1 records out 3727548416 bytes (3.7 GB) copied, 1.51988 s, 2.5 GB/s

As you can see the drive can go to about 80k IOPS at 4K reads, and the same at 8K - linear increase here (According to spec it can go to 550k IOPS at QD16, but I'm testing here single thread sequential read - so everything as expected)

Kernel Parameters for zfs:

options zfs zfs_vdev_scrub_min_active=48 options zfs zfs_vdev_scrub_max_active=128 options zfs zfs_vdev_sync_write_min_active=64 options zfs zfs_vdev_sync_write_max_active=128 options zfs zfs_vdev_sync_read_min_active=64 options zfs zfs_vdev_sync_read_max_active=128 options zfs zfs_vdev_async_read_min_active=64 options zfs zfs_vdev_async_read_max_active=128 options zfs zfs_top_maxinflight=320 options zfs zfs_txg_timeout=30 options zfs zfs_dirty_data_max_percent=40 options zfs zfs_vdev_scheduler=deadline options zfs zfs_vdev_async_write_min_active=8 options zfs zfs_vdev_async_write_max_active=64

Now the same test with ZFS and a blocksize of 16K:

910046+0 records in 910046+0 records out 3727548416 bytes (3.7 GB) copied, 39.6985 s, 93.9 MB/s 455023+0 records in 455023+0 records out 3727548416 bytes (3.7 GB) copied, 20.2442 s, 184 MB/s 227511+1 records in 227511+1 records out 3727548416 bytes (3.7 GB) copied, 10.5837 s, 352 MB/s 113755+1 records in 113755+1 records out 3727548416 bytes (3.7 GB) copied, 6.64908 s, 561 MB/s 56877+1 records in 56877+1 records out 3727548416 bytes (3.7 GB) copied, 4.85928 s, 767 MB/s 28438+1 records in 28438+1 records out 3727548416 bytes (3.7 GB) copied, 3.91185 s, 953 MB/s 14219+1 records in 14219+1 records out 3727548416 bytes (3.7 GB) copied, 3.41855 s, 1.1 GB/s 7109+1 records in 7109+1 records out 3727548416 bytes (3.7 GB) copied, 3.17058 s, 1.2 GB/s 3554+1 records in 3554+1 records out 3727548416 bytes (3.7 GB) copied, 2.97989 s, 1.3 GB/s

As you can see, the 4K read test maxes out already at 93 MB/s and the 8K read at 184 MB/s, the 16K reaches 352 MB/s. Based on the previous tests I would definitly expect faster reads at the 4k (243.75),8k (487.5),16k (975). Additionally I read that the recordsize has no impact on the read performance - but clearly it does.

for comparison 128k recordsize:

910046+0 records in 910046+0 records out 3727548416 bytes (3.7 GB) copied, 107.661 s, 34.6 MB/s 455023+0 records in 455023+0 records out 3727548416 bytes (3.7 GB) copied, 55.6932 s, 66.9 MB/s 227511+1 records in 227511+1 records out 3727548416 bytes (3.7 GB) copied, 27.3412 s, 136 MB/s 113755+1 records in 113755+1 records out 3727548416 bytes (3.7 GB) copied, 14.1506 s, 263 MB/s 56877+1 records in 56877+1 records out 3727548416 bytes (3.7 GB) copied, 7.4061 s, 503 MB/s 28438+1 records in 28438+1 records out 3727548416 bytes (3.7 GB) copied, 4.1867 s, 890 MB/s 14219+1 records in 14219+1 records out 3727548416 bytes (3.7 GB) copied, 2.6765 s, 1.4 GB/s 7109+1 records in 7109+1 records out 3727548416 bytes (3.7 GB) copied, 1.87574 s, 2.0 GB/s 3554+1 records in 3554+1 records out 3727548416 bytes (3.7 GB) copied, 1.40653 s, 2.7 GB/s

What I also can clearly see with iostat the the disk has an average request size of the corresponding record size. But the IOPS are way lower than with XFS.

Is that how it should behave? Where is that behaviour documented? I need good performance for my postgres server (sequential + random), but I also want great performance for my backups, copies etc. (sequential) - so it seems either I get good sequentials speed with big records, or good random speed with small records.

Edit: Additionally I also tested with primarycache=all there's more weirdness because it maxes out at 1.3 GB/s regardless of the record size.

Server details:

64 GB DDR4 RAM

Intel Xeon E5-2620v4

Intel P4800X