In this blog post I’ll be continuing the ZFS on Linux project we’ve been going over. If you’ve found yourself on this page directly and are completely confused, no worries, just check out the earlier articles to get you going in the right direction:

832 TB – ZFS on Linux – Project “Cheap and Deep”: Part 1

832 TB – ZFS on Linux – Setting Up Ubuntu: Part 2

With that out of the way let’s talk about this phase of the project. If you’re following along then you know we’ve already got the hardware configured, the OS (Ubuntu 16.04 LTS in my case) installed, and we’re ready to actually start setting up the ZFS side of things.

Prepare the OS

The first thing I always, always do is bring the OS up to date and let it install all updates. For Ubuntu, it looks like this:

sudo apt-get update && sudo apt-get upgrade -y && sudo reboot 1 sudo apt - get update && sudo apt - get upgrade - y && sudo reboot

I keep my VM templates (relatively) up to date but if you’re installing on bare metal you’ll surely need to update a bunch. Here’s what I am faced with:

~$ sudo apt-get update && sudo apt-get upgrade Hit:1 http://us.archive.ubuntu.com/ubuntu xenial InRelease [...] Get:12 http://security.ubuntu.com/ubuntu xenial-security/universe i386 Packages [146 kB] Fetched 3,586 kB in 1s (2,061 kB/s) Reading package lists... Done Reading package lists... Done Building dependency tree Reading state information... Done Calculating upgrade... Done The following packages have been kept back: linux-generic linux-headers-generic linux-image-generic The following packages will be upgraded: apparmor bind9-host cryptsetup cryptsetup-bin dnsutils grub-legacy-ec2 libapparmor-perl libapparmor1 libbind9-140 libcryptsetup4 libdns-export162 libdns162 libisc-export160 libisc160 libisccc140 libisccfg140 liblwres141 libpython3.5 libpython3.5-minimal libpython3.5-stdlib libxml2 linux-firmware python3.5 python3.5-minimal snapd tcpdump 26 upgraded, 0 newly installed, 0 to remove and 3 not upgraded. Need to get 59.3 MB of archives. After this operation, 5,327 kB of additional disk space will be used. Do you want to continue? [Y/n] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 ~ $ sudo apt - get update && sudo apt - get upgrade Hit : 1 http : //us.archive.ubuntu.com/ubuntu xenial InRelease [ . . . ] Get : 12 http : //security.ubuntu.com/ubuntu xenial-security/universe i386 Packages [146 kB] Fetched 3 , 586 kB in 1s ( 2 , 061 kB / s ) Reading package lists . . . Done Reading package lists . . . Done Building dependency tree Reading state information . . . Done Calculating upgrade . . . Done The following packages have been kept back : linux - generic linux - headers - generic linux - image - generic The following packages will be upgraded : apparmor bind9 - host cryptsetup cryptsetup - bin dnsutils grub - legacy - ec2 libapparmor - perl libapparmor1 libbind9 - 140 libcryptsetup4 libdns - export162 libdns162 libisc - export160 libisc160 libisccc140 libisccfg140 liblwres141 libpython3 . 5 libpython3 . 5 - minimal libpython3 . 5 - stdlib libxml2 linux - firmware python3 . 5 python3 . 5 - minimal snapd tcpdump 26 upgraded , 0 newly installed , 0 to remove and 3 not upgraded . Need to get 59.3 MB of archives . After this operation , 5 , 327 kB of additional disk space will be used . Do you want to continue ? [ Y / n ]

For CentOS and RHEL, I’d run:

sudo yum check-update && sudo yum update -y && sudo reboot 1 sudo yum check - update && sudo yum update - y && sudo reboot

For RHEL this will only work with a valid subscription, etc.

The machine should update its sources, then upgrade existing packages, and reboot. Once it comes back, log back in and we’ll start to install the ZFS packages.

It’s important to note that you should read the following page(s) in regards to installing ZFS on CentOS/RHEL in order to decide whether you’ll install kABI-tracking kmod or DKMS packages.

Get your ZFS on

Ok – time to get down. If running Ubuntu 16.04 LTS, just run:

sudo apt-get install zfs nfs-kernel-server snmpd snmp mailutils pv lzop mbuffer fio 1 sudo apt - get install zfs nfs - kernel - server snmpd snmp mailutils pv lzop mbuffer fio

Let’s take a second to go over what we just did there. Below is what we installed and why:

zfs – obviously we’ll need the ZFS packages to do anything

nfs-kernel-server – even though we’re installing ZFS and it supports exporting via NFS, we need the NFS server

snmpd/snmp – we are going to want to monitor this thing for disk space, up time, load, etc.

mailutils – this part is optional but I prefer to just setup postfix to setup this server as an SMTP host to relay mail off of something else in the environment

pv lzop mbuffer – these will be useful later on when we talk about ZFS replication using Sanoid/Syncoid

fio – no, this isn’t FIOS spelled wrong, this is the flexible I/O tester for Linux. You will want to run some sort of benchmarks locally vs. over NFS or iSCSI, or, maybe you don’t

This should result in ~80MB of downloads. There’s only one portion of this that I am not going to go into super detail about configuring and that’s this:

The reason for this is because there’s just too many assumptions to make. That said, I will choose Satellite System because I relay off of another host. There’s some postfix post-configuration (say that 80 times fast) that needs to take place as well but unless I receive a ton of flak, I will omit that as well.

For good measure, reboot your system after installing all of those packages. Then, let’s see if ZFS is ready to get down:

:~$ sudo zpool status no pools available 1 2 : ~ $ sudo zpool status no pools available

Once you’ve got Ubuntu 16.04 recognizing ZFS commands (as above) you can start configuring stuff. The first thing we need to do is configure our disk layout. This is where you need to apply what you’ve read about ZFS and RAIDZ-1, RAIDZ-2, etc. So, as you might recall, I have a bunch of disks involved in this 832TB build:

Here’s the deal as concise as I can make it – pick your RAIDZx vdev configuration, find the disks you want to involve, and then create your zpool by referencing the disk id. Why use disk id? If you reference the device id such as /dev/sda, /dev/sdb, etc. when building the pool, you risk the system losing track of which disk is which should it decide to mount the disks in a different order upon boot. ZFS metadata should be able to put the pieces back together, but just avoid this all together. I have seen this happen! I have seen where, for some reason, various Linux distributions mount devices different are reboot randomly – what really, really sucks is when your /etc/fstab file is referencing the /dev/sdx device for a mount point and it’s flip flopped with /dev/sdy and all of a sudden your application is dumping data on the wrong disk.

It is for this reason that I ONLY mount (yes, even in singular disk environments) disks by XFS labels or by /dev/disk/by-id. What does this look like?

~$ ll /dev/disk/by-id/ drwxr-xr-x 2 root root 7080 Jul 28 08:57 ./ drwxr-xr-x 7 root root 140 Jul 27 11:57 ../ lrwxrwxrwx 1 root root 9 Jul 27 11:57 ata-HGST_HUH728080ALE600_R6G49Y6Y -> ../../sdz lrwxrwxrwx 1 root root 9 Jul 27 11:57 ata-HGST_HUH728080ALE600_R6G4RN5Y -> ../../sdg lrwxrwxrwx 1 root root 9 Jul 27 11:57 ata-HGST_HUH728080ALE600_R6G4RPTY -> ../../sdh lrwxrwxrwx 1 root root 9 Jul 27 11:57 ata-HGST_HUH728080ALE600_R6G4RZ5Y -> ../../sdi lrwxrwxrwx 1 root root 10 Jul 27 11:57 ata-HGST_HUH728080ALE600_R6G52EEY -> ../../sdaf lrwxrwxrwx 1 root root 9 Jul 27 11:57 ata-HGST_HUH728080ALE600_R6G52NWY -> ../../sdc lrwxrwxrwx 1 root root 10 Jul 27 11:57 ata-HGST_HUH728080ALE600_R6G537PY -> ../../sdak lrwxrwxrwx 1 root root 9 Jul 27 11:57 ata-HGST_HUH728080ALE600_R6G548YY -> ../../sdm lrwxrwxrwx 1 root root 10 Jul 27 11:57 ata-HGST_HUH728080ALE600_R6G54NXY -> ../../sdac lrwxrwxrwx 1 root root 9 Jul 27 11:57 ata-HGST_HUH728080ALE600_R6G54UJY -> ../../sdj [etc...] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 ~ $ ll / dev / disk / by - id / drwxr - xr - x 2 root root 7080 Jul 28 08 : 57 . / drwxr - xr - x 7 root root 140 Jul 27 11 : 57 . . / lrwxrwxrwx 1 root root 9 Jul 27 11 : 57 ata - HGST_HUH728080ALE600_R6G49Y6Y -> . . / . . / sdz lrwxrwxrwx 1 root root 9 Jul 27 11 : 57 ata - HGST_HUH728080ALE600_R6G4RN5Y -> . . / . . / sdg lrwxrwxrwx 1 root root 9 Jul 27 11 : 57 ata - HGST_HUH728080ALE600_R6G4RPTY -> . . / . . / sdh lrwxrwxrwx 1 root root 9 Jul 27 11 : 57 ata - HGST_HUH728080ALE600_R6G4RZ5Y -> . . / . . / sdi lrwxrwxrwx 1 root root 10 Jul 27 11 : 57 ata - HGST_HUH728080ALE600_R6G52EEY -> . . / . . / sdaf lrwxrwxrwx 1 root root 9 Jul 27 11 : 57 ata - HGST_HUH728080ALE600_R6G52NWY -> . . / . . / sdc lrwxrwxrwx 1 root root 10 Jul 27 11 : 57 ata - HGST_HUH728080ALE600_R6G537PY -> . . / . . / sdak lrwxrwxrwx 1 root root 9 Jul 27 11 : 57 ata - HGST_HUH728080ALE600_R6G548YY -> . . / . . / sdm lrwxrwxrwx 1 root root 10 Jul 27 11 : 57 ata - HGST_HUH728080ALE600_R6G54NXY -> . . / . . / sdac lrwxrwxrwx 1 root root 9 Jul 27 11 : 57 ata - HGST_HUH728080ALE600_R6G54UJY -> . . / . . / sdj [ etc . . . ]

You get the idea. All of those ata-HGST_HUH72808… values are the id of the disk(s) and is/are usually a concatenation of the manufacturer, model, and serial number all in one. You get the idea – this never changes.

Once you have that you’re ready to create your zpool! This is done with the following command:

sudo zpool create -o ashift=12 [poolname] raidz2 ata-HGST_HUH728080ALE600_R6G49Y6Y ... 1 sudo zpool create - o ashift = 12 [ poolname ] raidz2 ata - HGST_HUH728080ALE600 _ R6G49Y6Y . . .

The above command would create a zpool with a given name that consists of a RAIDZ-2 vdev with the disks listed thereafter. If you want to create multiple vdevs in the pool that’s easy, too! Just list the disks out for each vdev and then throw in another RAIDZx type and the rest of the disks, etc.

HOLD UP WAIT A MINUTE!

You can see above we’re specifying -o ashift=12 and the reason for this is clear but may not be obvious at first. We’re using modern disks that have 4k sector sizes. Almost all disks today support 4k sectors but may also support 512b sectors to remain backwards compatible with legacy systems. That said, if you do not specify the ashift (alignment shift) above, then you will incur significant performance penalties. I won’t bore you, but the reason for this is because 2^ashift_value is the smallest I/O allowed on the vdev. So, match that to your sector size and you’re golden. This cannot be retroactively set. Do this at the creation of each vdev even if adding a new vdev to an existing pool!

How do you find out what your disks support as far as sector size? Easy! Run the two commands below:

~$ sudo fdisk -l Disk /dev/sdac: 7.3 TiB, 8001563222016 bytes, 15628053168 sectors Units: sectors of 1 * 512 = 512 bytes Sector size (logical/physical): 512 bytes / 4096 bytes I/O size (minimum/optimal): 4096 bytes / 4096 bytes ~$ sudo blockdev --getbsz /dev/sdg 4096 1 2 3 4 5 6 7 8 ~ $ sudo fdisk - l Disk / dev / sdac : 7.3 TiB , 8001563222016 bytes , 15628053168 sectors Units : sectors of 1 * 512 = 512 bytes Sector size ( logical / physical ) : 512 bytes / 4096 bytes I / O size ( minimum / optimal ) : 4096 bytes / 4096 bytes ~ $ sudo blockdev -- getbsz / dev / sdg 4096

You see above that our disk supports 512 byte logical and 4k physical sectors which is confirmed by the blockdev command.

Note: Depending on what model disk(s) you’re using, ZFS may correctly identify the sector size and create the vdev/zpool with the right alignment shift withough specifying it. However, do not bet on this. If you created a zpool with the default sector size for 512b (ashift=9) and future disks stop reporting 512b compatibility, you will not be able to replace failed disks with new! Be super cautious here!

But what about SSDs? Same process! However, depending on your SSD you may find that the sector size is 8192 bytes which is an 8k device. This is common on SSDs. If this is the case, you would want to create your SSD vdev with -o ashift=13 .

A special case with Intel NVMe devices

Ok so we know -o ashift=13 is for SSDs if they show 8192 byte sectors. However, what does an Intel P3700 800GB PCIe NVMe disk support? Well, using the Intel isdct command (available here) we can report on the sector sizes straight from the device:

~$ sudo isdct show -a intelssd ProductFamily : Intel SSD DC P3700 Series ProductProtocol : NVME ProtectionInformation : 0 ProtectionInformationLocation : 0 ReadErrorRecoveryTimer : Device does not support this command set. SMARTEnabled : True SMARTHealthCriticalWarningsConfiguration : 0 SMBusAddress : 106 SectorSize : 512 1 2 3 4 5 6 7 8 9 10 ~ $ sudo isdct show - a intelssd ProductFamily : Intel SSD DC P3700 Series ProductProtocol : NVME ProtectionInformation : 0 ProtectionInformationLocation : 0 ReadErrorRecoveryTimer : Device does not support this command set . SMARTEnabled : True SMARTHealthCriticalWarningsConfiguration : 0 SMBusAddress : 106 SectorSize : 512

Hrm… 512b sectors. Oh well.. wait – not so fast! Intel NVMe devices have variable sector sizes according to this article. First, update the firmware after downloading the utility to the host:

sudo isdct load -intelssd 0 sudo isdct load -intelssd 1 1 2 sudo isdct load - intelssd 0 sudo isdct load - intelssd 1

Once complete, reboot the host.

Then, to set this device to use 4k sectors, we just issue the following commands:

~$ sudo isdct start -intelssd 0 -nvmeformat LBAformat=3 SecureEraseSetting=0 ProtectionInformation=0 MetadataSettings=0 ~$ sudo isdct start -intelssd 1 -nvmeformat LBAformat=3 SecureEraseSetting=0 ProtectionInformation=0 MetadataSettings=0 1 2 ~ $ sudo isdct start - intelssd 0 - nvmeformat LBAformat = 3 SecureEraseSetting = 0 ProtectionInformation = 0 MetadataSettings = 0 ~ $ sudo isdct start - intelssd 1 - nvmeformat LBAformat = 3 SecureEraseSetting = 0 ProtectionInformation = 0 MetadataSettings = 0

This assumes you have two Intel P3700 NVMe devices that you want to update (hence the 0 and 1 for the index). We can confirm the settings by checking with:

~$ sudo isdct show -a -intelssd | grep Sec PhysicalSectorSize : The selected drive does not support this feature. SectorSize : 4096 1 2 3 ~ $ sudo isdct show - a - intelssd | grep Sec PhysicalSectorSize : The selected drive does not support this feature . SectorSize : 4096

Boom – 4k sectors!

Now that we’ve updated the NVMe firmware and set the sectors properly, let’s create the SLOG vdev! First, get your device id just like previously:

~$ sudo ls -l /dev/disk/by-id/ | grep nvme lrwxrwxrwx 1 root root 13 Jul 28 08:56 nvme-INTEL_SSDPEDMD800G4_CVFT6484003U800CGN -> ../../nvme0n1 lrwxrwxrwx 1 root root 13 Jul 28 08:56 nvme-INTEL_SSDPEDMD800G4_CVFT64840094800CGN -> ../../nvme1n1 1 2 3 ~ $ sudo ls - l / dev / disk / by - id / | grep nvme lrwxrwxrwx 1 root root 13 Jul 28 08 : 56 nvme - INTEL_SSDPEDMD800G4_CVFT6484003U800CGN -> . . / . . / nvme0n1 lrwxrwxrwx 1 root root 13 Jul 28 08 : 56 nvme - INTEL_SSDPEDMD800G4_CVFT64840094800CGN -> . . / . . / nvme1n1

Then create the SLOG vdev as part of the original pool we created:

sudo zpool add -o ashift=12 [poolname] log mirror nvme-INTEL_SSDPEDMD800G4_CVFT6484003U800CGN nvme-INTEL_SSDPEDMD800G4_CVFT64840094800CGN 1 sudo zpool add - o ashift = 12 [ poolname ] log mirror nvme - INTEL_SSDPEDMD800G4_CVFT6484003U800CGN nvme - INTEL_SSDPEDMD800G4_CVFT64840094800CGN

Finally, let’s look at the zpool as a whole:

~$ sudo zpool status pool: [poolname] state: ONLINE scan: scrub repaired 0 in 0h7m with 0 errors on Sun Sep 10 00:31:16 2017 config: NAME STATE READ WRITE CKSUM [poolname] ONLINE 0 0 0 raidz2-0 ONLINE 0 0 0 ata-HGST_HUH728080ALE600_R6G49Y6Y ONLINE 0 0 0 ... raidz2-1 ONLINE 0 0 0 ata-HGST_HUH728080ALE600_R6G54V7Y ONLINE 0 0 0 ... raidz2-2 ONLINE 0 0 0 ata-HGST_HUH728080ALE600_R6G5A7TY ONLINE 0 0 0 ... raidz2-3 ONLINE 0 0 0 ata-HGST_HUH728080ALE600_R6G5JZRY ONLINE 0 0 0 ... raidz2-4 ONLINE 0 0 0 ... logs mirror-5 ONLINE 0 0 0 nvme-INTEL_SSDPEDMD800G4_CVFT6484003U800CGN ONLINE 0 0 0 nvme-INTEL_SSDPEDMD800G4_CVFT64840094800CGN ONLINE 0 0 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 ~ $ sudo zpool status pool : [ poolname ] state : ONLINE scan : scrub repaired 0 in 0h7m with 0 errors on Sun Sep 10 00 : 31 : 16 2017 config : NAME STATE READ WRITE CKSUM [ poolname ] ONLINE 0 0 0 raidz2 - 0 ONLINE 0 0 0 ata - HGST_HUH728080ALE600_R6G49Y6Y ONLINE 0 0 0 . . . raidz2 - 1 ONLINE 0 0 0 ata - HGST_HUH728080ALE600_R6G54V7Y ONLINE 0 0 0 . . . raidz2 - 2 ONLINE 0 0 0 ata - HGST_HUH728080ALE600_R6G5A7TY ONLINE 0 0 0 . . . raidz2 - 3 ONLINE 0 0 0 ata - HGST_HUH728080ALE600_R6G5JZRY ONLINE 0 0 0 . . . raidz2 - 4 ONLINE 0 0 0 . . . logs mirror - 5 ONLINE 0 0 0 nvme - INTEL_SSDPEDMD800G4_CVFT6484003U800CGN ONLINE 0 0 0 nvme - INTEL_SSDPEDMD800G4_CVFT64840094800CGN ONLINE 0 0 0

I’ve obviously truncated device id’s from the output above but you get the idea. Because I have 52 disks and created 5 RAIDZ-2 vedvs with 10 disks each, I have 2 spares. Let’s add the two remaining disks to the pool as spares:

~$ sudo zpool add [poolname] spare ata-HGST_HUH728080ALE600_VJGRVR1X ata-HGST_HUH728080ALE600_VJGRW57X 1 ~ $ sudo zpool add [ poolname ] spare ata - HGST_HUH728080ALE600_VJGRVR1X ata - HGST_HUH728080ALE600_VJGRW57X

Alright!

If you happen to have 50 HGST 8TB disks in this configuration with two Intel P3700 800GB NVMe disks, then you can see the pool configuration should match mine with the following command:

~$ sudo zpool list NAME SIZE ALLOC FREE EXPANDSZ FRAG CAP DEDUP HEALTH ALTROOT drpool1 362T 1.20T 361T - 0% 0% 1.00x ONLINE - 1 2 3 ~ $ sudo zpool list NAME SIZE ALLOC FREE EXPANDSZ FRAG CAP DEDUP HEALTH ALTROOT drpool1 362T 1.20T 361T - 0 % 0 % 1.00x ONLINE -

The command above is showing the total zpool size with RAIDZx configuration and counting ZFS “losses”, etc.

A couple final tweaks

ZFS on Linux is set to use something like 50% of your system RAM for ARC by default. If you’re like me, and have 256GB of RAM in a box, you don’t want to devote 128GB for the OS. So, instead, we can edit /etc/modprobe.d/zfs.conf and add a line that reads options zfs zfs_arc_max=206158430208 which comes out to 192GB of RAM (the setting is defined in bytes) dedicated to the ARC max size. Granted, even 64GB of RAM (remainder of 256GB – 192GB) available to the OS is a lot, but I am just being cautious.

One last thing we absolutely want to configure is Zed! Zed is a daemon that runs and alerts us on disk failures, etc. It’s included with ZFS on Linux and the configuration file can be found in /etc/zfs/zed.d/zed.rc by default in Ubuntu. Let’s look at my configuration:

~$ sudo cat /etc/zfs/zed.d/zed.rc ZED_EMAIL_ADDR="[email address to receive all the stuff]" ZED_EMAIL_OPTS="-s '@SUBJECT@' @ADDRESS@" ZED_NOTIFY_INTERVAL_SECS=120 ZED_NOTIFY_VERBOSE=1 1 2 3 4 5 6 7 8 9 ~ $ sudo cat / etc / zfs / zed . d / zed . rc ZED_EMAIL_ADDR = "[email address to receive all the stuff]" ZED_EMAIL_OPTS = "-s '@SUBJECT@' @ADDRESS@" ZED_NOTIFY_INTERVAL_SECS = 120 ZED_NOTIFY_VERBOSE = 1

I’ve removed all comments from the above output so that you can see what I have set. You can see that it’s pretty simple overall. You want to make sure all of the fields above are set and are not commented out so that Zed runs properly. If you’re unsure of what to set, check your config file as the comments will still be in place explaining what each option does.

Assuming you have SMTP relay/postfix/etc. configured properly, you should be able to run the follow command:

~$ sudo zpool scrub [poolname] 1 ~ $ sudo zpool scrub [ poolname ]

Because there’s no data on the pool, it should run very quickly (minutes), and you should receive the following email:

The reason we get this email is because we have ZED_NOTIFY_VERBOSE=1 which will send all output generated by Zed via email, even if not critical.

At this point, we’re now ready to start creating datasets (or zvols if that’s your thing). But, for now, that’s a wrap. Stay tuned for more on this topic and let me know if you are lost or want to see any aspect of this article highlighted in more detail! Thanks everyone!

Share this: Twitter

Facebook

