Performance, Capacity, and Integrity

One of the better websites I found where Calomel — Open Source Research and Reference, from there, I was lead to ZFS based systems, and FreeNAS, with its vast community.

The article pointed out some foundations in storage planning, basically stating that no matter how hard you try, you cant have it all. Therefore they drew this diagram, and I needed to find out where I wanted to place SPEEDY.

capacity

/\

/ \

/ \

/ \

performance /________\ integrity

We knew that we wanted mainly performance and capacity, with a dash of integrity. We also know that we have some constraints, like the 40GbE Mellanox Networking card. With the dash of integrity, I meant that we needed a RAIDZ level and not striping, and mirrors will be to big loss on capacity.

Thus the balance.

First, I wanted to make a baseline, based on my hardware. What can I expect?

I found solnet-array-test on the FreeNAS Forums, that’s doing a parallel read pass, and an identical read passes with multiple accesses per disk.

The baseline gives me the average read speed per disk, from there, we can calculate our speed per drive, and we can figure the best vdev setup and the system as a whole.

My goal is to be somewhere around 5GB RAW — READ speed.

Selected disks: da2 da3 da4 da5 da6 da7 da8 da9 da10 da11 da12 da13 da14 da15 da16 da17 da18 da19 da20 da21 da22 da23 da24 da25 Samsung SSD 860 Performing initial serial array read (baseline speeds)

Thu Jan 03 08:36:47 CET 2019

Thu Jan 03 09:30:48 CET 2019

Completed: initial serial array read (baseline speeds) Array's average speed is 425 MB/sec per disk Disk Disk Size MB/sec %ofAvg

------- ---------- ------ ------

da2 3815447MB 436 103

da3 3815447MB 430 101

da4 3815447MB 436 103

da5 3815447MB 432 102

da6 3815447MB 436 103

da7 3815447MB 428 101

da8 3815447MB 428 101

da9 3815447MB 438 103

da10 3815447MB 433 102

da11 3815447MB 431 102

da12 3815447MB 428 101

da13 3815447MB 428 101

da14 3815447MB 434 102

da15 3815447MB 431 101

da16 3815447MB 428 101

da17 3815447MB 428 101

da18 3815447MB 433 102

da19 3815447MB 428 101

da20 3815447MB 431 101

da21 3815447MB 427 101

da22 3815447MB 419 99

da23 3815447MB 427 100

da24 3815447MB 363 85 --SLOW--

da25 3815447MB 368 87 --SLOW-- Performing initial parallel array read

Thu Jan 03 09:30:48 CET 2019

The disk da2 appears to be 3815447 MB.

Disk is reading at about 265 MB/sec

This suggests that this pass may take around 240 minutes Serial Parall % of

Disk Disk Size MB/sec MB/sec Serial

------- ---------- ------ ------ ------

da2 3815447MB 436 265 61 --SLOW--

da3 3815447MB 430 265 62 --SLOW--

da4 3815447MB 436 265 61 --SLOW--

da5 3815447MB 432 265 61 --SLOW--

da6 3815447MB 436 267 61 --SLOW--

da7 3815447MB 428 266 62 --SLOW--

da8 3815447MB 428 266 62 --SLOW--

da9 3815447MB 438 266 61 --SLOW--

da10 3815447MB 433 266 62 --SLOW--

da11 3815447MB 431 268 62 --SLOW--

da12 3815447MB 428 267 63 --SLOW--

da13 3815447MB 428 267 62 --SLOW--

da14 3815447MB 434 266 61 --SLOW--

da15 3815447MB 431 267 62 --SLOW--

da16 3815447MB 428 267 62 --SLOW--

da17 3815447MB 428 269 63 --SLOW--

da18 3815447MB 433 267 62 --SLOW--

da19 3815447MB 428 268 63 --SLOW--

da20 3815447MB 431 269 62 --SLOW--

da21 3815447MB 427 270 63 --SLOW--

da22 3815447MB 419 268 64 --SLOW--

da23 3815447MB 427 268 63 --SLOW--

da24 3815447MB 363 267 74 --SLOW--

da25 3815447MB 368 261 71 --SLOW-- Awaiting completion: initial parallel array read

Thu Jan 03 13:35:09 CET 2019

Completed: initial parallel array read Disk's average time is 14355 seconds per disk Disk Bytes Transferred Seconds %ofAvg

------- ----------------- ------- ------

da2 4000787030016 14408 100

da3 4000787030016 14392 100

da4 4000787030016 14390 100

da5 4000787030016 14381 100

da6 4000787030016 14384 100

da7 4000787030016 14363 100

da8 4000787030016 14359 100

da9 4000787030016 14355 100

da10 4000787030016 14358 100

da11 4000787030016 14333 100

da12 4000787030016 14336 100

da13 4000787030016 14327 100

da14 4000787030016 14335 100

da15 4000787030016 14302 100

da16 4000787030016 14297 100

da17 4000787030016 14292 100

da18 4000787030016 14306 100

da19 4000787030016 14277 99

da20 4000787030016 14262 99

da21 4000787030016 14259 99

da22 4000787030016 14269 99

da23 4000787030016 14238 99

da24 4000787030016 14630 102

da25 4000787030016 14661 102 Performing initial parallel seek-stress array read

Thu Jan 03 13:35:09 CET 2019

The disk da2 appears to be 3815447 MB.

Disk is reading at about 268 MB/sec

This suggests that this pass may take around 237 minutes Serial Parall % of

Disk Disk Size MB/sec MB/sec Serial

------- ---------- ------ ------ ------

da2 3815447MB 436 265 61

da3 3815447MB 430 272 63

da4 3815447MB 436 252 58

da5 3815447MB 432 270 62

da6 3815447MB 436 270 62

da7 3815447MB 428 270 63

da8 3815447MB 428 269 63

da9 3815447MB 438 265 61

da10 3815447MB 433 263 61

da11 3815447MB 431 262 61

da12 3815447MB 428 268 63

da13 3815447MB 428 270 63

da14 3815447MB 434 273 63

da15 3815447MB 431 277 64

da16 3815447MB 428 268 63

da17 3815447MB 428 264 62

da18 3815447MB 433 272 63

da19 3815447MB 428 256 60

da20 3815447MB 431 264 61

da21 3815447MB 427 268 63

da22 3815447MB 419 272 65

da23 3815447MB 427 269 63

da24 3815447MB 363 248 69

da25 3815447MB 368 234 64 Awaiting completion: initial parallel seek-stress array read

Fri Feb 21 16:05:40 CET 2019

Completed: initial parallel seek-stress array read Disk's average time is 87907 seconds per disk Disk Bytes Transferred Seconds %ofAvg

------- ----------------- ------- ------

da2 4000787030016 87748 100

da3 4000787030016 87785 100

da4 4000787030016 88430 101

da5 4000787030016 87897 100

da6 4000787030016 88185 100

da7 4000787030016 87324 99

da8 4000787030016 88432 101

da9 4000787030016 88100 100

da10 4000787030016 88233 100

da11 4000787030016 87097 99

da12 4000787030016 87893 100

da13 4000787030016 86799 99

da14 4000787030016 87664 100

da15 4000787030016 85425 97

da16 4000787030016 86986 99

da17 4000787030016 88213 100

da18 4000787030016 87218 99

da19 4000787030016 88281 100

da20 4000787030016 86297 98

da21 4000787030016 86504 98

da22 4000787030016 86374 98

da23 4000787030016 86234 98

da24 4000787030016 91774 104

da25 4000787030016 94882 108 --SLOW--

OMG, Sales are lying!! and that we dont get full speed of the drives based on the drive specifications. Well…Im ok with that. I have read several places that the different speed tests on the Samsung SSD’s showing similar results.

On average, a drive is 425MB serial speed & 265MB parallel speed.

By adding them into VDEV’s we should get somewhere around ;

4 drives in a VDEV = 1,7GB of serial speed & 1.0 GB parallel speed.

12 drives in a VDEV = 5,1GB of serial speed & 3.1 GB parallel speed.

Now, I need to point out that iXsystems wrote a great article about measuring ZFS performance. In the article, they had some interesting pointers.

ZFS breaks write data into pieces called blocks and stripes them across the vdevs. Each vdev breaks those blocks into even smaller chunks called sectors. For striped vdevs, the sectors are simply written sequentially to the drive. For mirrored vdevs, all sectors are written sequentially to each disk. On RAIDZ vdevs however, ZFS has to add additional sectors for the parity information. When a RAIDZ vdev gets a block to write out, it will divide that block into sectors, compute all the parity information, and hand each disk either a set of data sectors or a set of parity sectors. ZFS ensures that there are p parity sectors for each stripe written to the RAIDZ vdev.

One VDEV, = One drive’s IOPS.

If I were building a single server for one Artist, I could have one large VDEV, and they would get the speed of all drives, but only the IOPS of one. As we are a studio with several artists, I need more IOPS.

As my drives are Samsung EVO’s, then I believe 6x VDEV’s should be sufficient.

With 6x VDEVS that lands us somewhere around 6 GB to 10GB +/-

Perfect, I am aiming a bit high on my speed expectations,

but I guess ZFS, and the processes around will have some overhead.

ZFS design:

4x SSD´s as a RAIDZ1, x6 VDEV’s , 4K aligned and ashift=12

raidz1-0

gptid/6a1a9b96-0ce6-11e9-bf1a-

gptid/6a839b39-0ce6-11e9-bf1a-

gptid/6afbfe31-0ce6-11e9-bf1a-

gptid/6b69a87e-0ce6-11e9-bf1a-

raidz1-1

gptid/6bdb5a02-0ce6-11e9-bf1a-

gptid/6c4d1773-0ce6-11e9-bf1a-

gptid/6cd5a4d1-0ce6-11e9-bf1a-

gptid/6d4ef698-0ce6-11e9-bf1a-

raidz1-2

gptid/6dca5729-0ce6-11e9-bf1a-

gptid/6e5880c3-0ce6-11e9-bf1a-

gptid/6edb8639-0ce6-11e9-bf1a-

gptid/6f6e9795-0ce6-11e9-bf1a-

raidz1-3

gptid/6ff2f2fa-0ce6-11e9-bf1a-

gptid/70852a7a-0ce6-11e9-bf1a-

gptid/710bea3b-0ce6-11e9-bf1a-

gptid/71aab5d5-0ce6-11e9-bf1a-

raidz1-4

gptid/723a936b-0ce6-11e9-bf1a-

gptid/72dd80a4-0ce6-11e9-bf1a-

gptid/7381b9c1-0ce6-11e9-bf1a-

gptid/742bb354-0ce6-11e9-bf1a-

raidz1-5

gptid/94f3add7-8d04-11e9-b9f2-

gptid/9aa3c40e-8d04-11e9-b9f2-

gptid/a0517c13-8d04-11e9-b9f2-

gptid/a5f34b2b-8d04-11e9-b9f2-

Testing

A Basic FreeNAS Testing Google search gives us some tips on running DD, and that we need to be aware of the ARC that will kick inn and harm the testing.

However, I chose to test with the ZFS features enabled without any tuning, only natively out of the box, but I kicked up the block size and tested mostly with 50 or 100 GB Files. Here are the results;

DD. WRITE Test— 2,54 GB

dd if=/dev/zero of=tmp.dd bs=2048k count=100k 41590+0 records in

41590+0 records out 87220551680 bytes transferred in 34.294313 secs (2543294911 bytes/sec)

DD. READ Test — 7,90 GB

dd if=tmp.dd of=/dev/null bs=2048k count=100k 41590+0 records in

41590+0 records out 87220551680 bytes transferred in 11.030011 secs (7907566892 bytes/sec)

That’s decent results, I aimed for a high READ, and I know that we have some write penalty, but I will also test with FIO, as DD is only single thread.

FIO Sequential WRITE— 6,1 GiB

# fio --filename=test --sync=1 --rw=write --bs=2048k --numjobs=16 --iodepth=24 --group_reporting --name=test --filesize=50G --runtime=300 && rm test WRITE: bw=8138MiB/s (8533MB/s), 8138MiB/s-8138MiB/s (8533MB/s-8533MB/s), io=800GiB (859GB), run=100662-100662msec

FIO Sequential READ — 18.4 GiB

# fio --filename=test --sync=1 --rw=read --bs=2048k --numjobs=16 --iodepth=24 --group_reporting --name=test --filesize=50G --runtime=300 && rm test READ: bw=17.1GiB/s (18.4GB/s), 17.1GiB/s-17.1GiB/s (18.4GB/s-18.4GB/s), io=800GiB (859GB), run=46767-46767msec

Whoha!

We defiantly ran into some cashe on the FIO test there,

but, that’s what I call a fast server.

FIO Sequential READ, different Block Sizes.

4K

READ: bw=1883MiB/s (1974MB/s), 1883MiB/s-1883MiB/s (1974MB/s-1974MB/s), io=552GiB (592GB), run=300001-300001msec

8K

READ: bw=3717MiB/s (3898MB/s), 3717MiB/s-3717MiB/s (3898MB/s-3898MB/s), io=800GiB (859GB), run=220385-220385msec

128K

READ: bw=22.9GiB/s (24.6GB/s), 22.9GiB/s-22.9GiB/s (24.6GB/s-24.6GB/s), io=800GiB (859GB), run=34960-34960msec

I will say that the server is production ready.

What’s next?

The next article in this series Solving the mystery of a Master Archive,

will focus on the TANK storage server.