I'm completely new to ZFS, so to start with I thought I'd do some simple benchmarks on it to get a feel for how it behaves. I wanted to push the limits of its performance so I provisioned an Amazon EC2 i2.8xlarge instance (almost $7/hr, time really is money!). This instance has 8 800GB SSDs.

I did an fio test on the SSDs themselves, and got the following output (trimmed):

$ sudo fio --name randwrite --ioengine=libaio --iodepth=2 --rw=randwrite --bs=4k --size=400G --numjobs=8 --runtime=300 --group_reporting --direct=1 --filename=/dev/xvdb [trimmed] write: io=67178MB, bw=229299KB/s, iops=57324, runt=300004msec [trimmed]

57K IOPS for 4K random writes. Respectable.

I then created a ZFS volume spanning all 8. At first I had one raidz1 vdev with all 8 SSDs in it, but I read about the reasons this is bad for performance, so I ended up with four mirror vdevs, like so:

$ sudo zpool create testpool mirror xvdb xvdc mirror xvdd xvde mirror xvdf xvdg mirror xvdh xvdi $ sudo zpool list -v NAME SIZE ALLOC FREE EXPANDSZ FRAG CAP DEDUP HEALTH ALTROOT testpool 2.91T 284K 2.91T - 0% 0% 1.00x ONLINE - mirror 744G 112K 744G - 0% 0% xvdb - - - - - - xvdc - - - - - - mirror 744G 60K 744G - 0% 0% xvdd - - - - - - xvde - - - - - - mirror 744G 0 744G - 0% 0% xvdf - - - - - - xvdg - - - - - - mirror 744G 112K 744G - 0% 0% xvdh - - - - - - xvdi - - - - - -

I set the recordsize to 4K and ran my test:

$ sudo zfs set recordsize=4k testpool $ sudo fio --name randwrite --ioengine=libaio --iodepth=2 --rw=randwrite --bs=4k --size=400G --numjobs=8 --runtime=300 --group_reporting --filename=/testpool/testfile --fallocate=none [trimmed] write: io=61500MB, bw=209919KB/s, iops=52479, runt=300001msec slat (usec): min=13, max=155081, avg=145.24, stdev=901.21 clat (usec): min=3, max=155089, avg=154.37, stdev=930.54 lat (usec): min=35, max=155149, avg=300.91, stdev=1333.81 [trimmed]

I get only 52K IOPS on this ZFS pool. That's actually slightly worse than one SSD itself.

I don't understand what I'm doing wrong here. Have I configured ZFS incorrectly, or is this a poor test of ZFS performance?

Note I'm using the official 64-bit CentOS 7 HVM image, though I've upgraded to the 4.4.5 elrepo kernel:

$ uname -a Linux ip-172-31-43-196.ec2.internal 4.4.5-1.el7.elrepo.x86_64 #1 SMP Thu Mar 10 11:45:51 EST 2016 x86_64 x86_64 x86_64 GNU/Linux

I installed ZFS from the zfs repo listed here. I have version 0.6.5.5 of the zfs package.

UPDATE: Per @ewwhite's suggestion I tried ashift=12 and ashift=13 :

$ sudo zpool create testpool mirror xvdb xvdc mirror xvdd xvde mirror xvdf xvdg mirror xvdh xvdi -o ashift=12 -f

and

$ sudo zpool create testpool mirror xvdb xvdc mirror xvdd xvde mirror xvdf xvdg mirror xvdh xvdi -o ashift=13 -f

Neither of these made any difference. From what I understand the latest ZFS bits are smart enough identifying 4K SSDs and using reasonable defaults.

I did notice however that CPU usage is spiking. @Tim suggested this but I dismissed it however I think I wasn't watching the CPU long enough to notice. There are something like 30 CPU cores on this instance, and CPU usage is spiking up as high as 80%. The hungry process? z_wr_iss , lots of instances of it.

I confirmed compression is off, so it's not the compression engine.

I'm not using raidz, so it shouldn't be the parity computation.

I did a perf top and it shows most of the kernel time spent in _raw_spin_unlock_irqrestore in z_wr_int_4 and osq_lock in z_wr_iss .

I now believe there is a CPU component to this performance bottleneck, though I'm no closer to figuring out what it might be.

UPDATE 2: Per @ewwhite and others' suggestion that it's the virtualized nature of this environment that creates performance uncertainty, I used fio to benchmark random 4K writes spread across four of the SSDs in the environment. Each SSD by itself gives ~55K IOPS, so I expected somewhere around 240K IOs across four of them. That's more or less what I got:

$ sudo fio --name randwrite --ioengine=libaio --iodepth=8 --rw=randwrite --bs=4k --size=398G --numjobs=8 --runtime=300 --group_reporting --filename=/dev/xvdb:/dev/xvdc:/dev/xvdd:/dev/xvde randwrite: (g=0): rw=randwrite, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio, iodepth=8 ... randwrite: (g=0): rw=randwrite, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio, iodepth=8 fio-2.1.5 Starting 8 processes [trimmed] write: io=288550MB, bw=984860KB/s, iops=246215, runt=300017msec slat (usec): min=1, max=24609, avg=30.27, stdev=566.55 clat (usec): min=3, max=2443.8K, avg=227.05, stdev=1834.40 lat (usec): min=27, max=2443.8K, avg=257.62, stdev=1917.54 [trimmed]

This clearly shows the environment, virtualized though it may be, can sustain the IOPS much higher than what I'm seeing. Something about the way ZFS is implemented is keeping it from hitting the top speed. I just can't figure out what that is.