Reliably boot Fedora with root on ZFS

Revised 2020-09-22

What's all this?

This article is a walk-through for installing Fedora linux with root on ZFS. It has been tested with:

Fedora 32, kernel-5.7.14, zfs-0.8.4

Fedora 32, kernel-5.7.15, zfs-0.8.4

Fedora 32, kernel-5.7.16, zfs-0.8.4

Fedora 32, kernel-5.7.17, zfs-0.8.4

Fedora 32, kernel-5.8.4, zfs-0.8.4*

Fedora 32, kernel-5.8.6, zfs-0.8.4*

Fedora 32, kernel-5.8.7, zfs-0.8.4*

Fedora 32, kernel-5.8.8, zfs-0.8.4*

Fedora 32, kernel-5.8.9, zfs-0.8.4*

Fedora 32, kernel-5.8.10, zfs-0.8.4*

*= zfs-0.8.4 requires a patch to work with kernel-5.8.x. The procedure for obtaining and applying the patch is described below at the appropriate step.

UPDATE AND WARNING:

If you're already running with root-on-zfs, doing a routine update from kernel-5.7.x to kernel-5.8.x will leave you without zfs modules and a careless reboot will present the Black Screen Of Dracut. To avoid this, apply the patch before running "dnf update". Otherwise, you can apply the patch in chroot using your rescue/installer system.

Prior art

Earlier (and unfortunately far more complex) versions of this document exist:

Success with BLS

This version of the guide features BLS - "Boot Loader Specification", which (along with other improvements by the packagers) makes it possible to update the kernel or upgrade Fedora and reboot successfully.

Fear, Uncertainty and Doubt

Fedora is a rapidly evolving distribution. Sometimes the kernel package gets ahead of the ZFS package making it impossible to build zfs modules. To make matters worse, the documentation for ZFS is sometimes out of date so you really have no recourse but reading the ZFS Issue Tracker to see if people are complaining. Since it takes only a few minutes to create a virtual machine using these instructions, you can use that to foresee difficulties.

Followup

Preliminaries

Hardware

You can work with real hardware or a virtual machine. Some section names start with [RH] "Real hardware" or [VM] "Virtual machine" - they only apply to those respective cases. Everything else applies to both. If this is your first time, following the virtual machine path is good way to learn without commiting hardware or accidentally reformatting your working system disk.

Installer system

You'll need a fedora linux system that has support for ZFS to follow this guide. After installing Fedora, visit the ZFS on Linux site and follow the instructions.

I suggest creating this system on a removable device and keeping it in a safe place because it's occasionally necessary to rescue root-on-zfs systems.

Helper script

We will create a root-on-zfs operating system by running commands mostly in the host environment. But some steps have to taken inside the target which is done via the "chroot" command. But without additional configuration, many linux commands won't work inside a chroot. To fix that, we need special script, "zenter." Some-but-not-all linux distributions provide a command that does this. (Not Fedora...)

Here's the source. Save it in a file "zenter.sh" and proceed. (Or you can download zenter here.)

#!/bin/bash # zenter - Mount system directories and enter a chroot target=$1 mount -t proc proc $target/proc mount -t sysfs sys $target/sys mount -o bind /dev $target/dev mount -o bind /dev/pts $target/dev/pts chroot $target /bin/env -i \ HOME=/root TERM="$TERM" PS1='[\u@chroot \W]\$ ' \ PATH=/bin:/usr/bin:/sbin:/usr/sbin:/bin \ /bin/bash --login echo "Exiting chroot environment..." umount $target/dev/pts umount $target/dev/ umount $target/sys/ umount $target/proc/

Install the script to a directory on your PATH:

cp -a zenter.sh /usr/local/sbin/zenter

Variables

Installation variables

VER=32 POOL=Magoo USER=hugh PASW=mxyzptlk NAME="Hugh Sparks"

Define a group of variables from one of the following two sections:

[RH] Variables for working with a real storage device

DEVICE=/dev/sda PART1=1 PART2=2 PART3=3

The device name is only an example: when you add a physical disk, you must identify the new device name and carefully avoid blasting a device that's already part of your operating system.

IMPORTANT: Adding or removing devices can alter all device and partition names after reboot. This is why modern linux distributions avoid using them in places like fstab. We will convert device names to UUIDs as we proceed.

[VM] Variables for working with a virtual machine

DEVICE=/dev/nbd0 PART1=p1 PART2=p2 PART3=p3 IMAGE=/var/lib/libvirt/images/$POOL.qcow2

In the virtual machine case, the device name will always be the same unless you're using nbd devices for some other purpose.

[VM] Create a virtual disk

qemu-img create -f qcow2 ${IMAGE} 10G

[VM] Mount the virtual disk in the host file system

modprobe nbd qemu-nbd --connect=/dev/nbd0 ${IMAGE} -f qcow2

[RH] Deal with old ZFS residue

If your target disk was ever part of a zfs pool, you need to clear the label before you repartition the device. First list all partitions:

sgdisk -p $DEVICE

For each partition number "n" that has type BF01 "Solaris /usr & Mac ZFS", execute:

zpool labelclear -f ${DEVICE}n

If you suspect the whole disk (no partitions) was part of a zfs array, clear that label using:

zpool labelclear -f ${DEVICE}

Partition the target

This example uses a very simple layout: An EFI partition, a boot partition and a ZFS partition that fills the rest of the disk.

Erase the existing partition table

sgdisk -Z $DEVICE

Create a 200MB EFI partition (PART1)

sgdisk -n 1:0:+200Mib -t 1:EF00 -c 1:EFI $DEVICE

Create a 500MB boot partition (PART2)

sgdisk -n 2:0:+500Mib -t 2:8300 -c 2:Boot $DEVICE

Create a ZFS partition (PART3) using the rest of the disk:

sgdisk -n 3:0:0 -t 3:BF01 -c 3:ZFS $DEVICE

Format EFI and boot partitions

mkfs.fat -F32 ${DEVICE}${PART1} mkfs.ext4 ${DEVICE}${PART2}

Create the ZFS pool and datasets

Create a pool

zpool create $POOL -m none ${DEVICE}${PART3} -o ashift=12 -o cachefile=none

This is a very simple layout that has no redundancy. For a production system, you would create a mirror, raidz array or some combination. These topics are covered on many websites such as ZFS Without Tears

If for some reason you want to keep using a system with one device, adding the following option to zpool create will give you 2x redundancy (and half the space):

-o copies=2

Set pool properties

zfs set compression=on $POOL zfs set atime=off $POOL

Re-import the pool so devices are identified by UUIDs

zpool export $POOL udevadm trigger --settle zpool import $POOL -d /dev/disk/by-uuid -o altroot=/target -o cachefile=none

Create datasets

zfs create $POOL/fedora -o xattr=sa -o acltype=posixacl zfs create $POOL/fedora/var -o exec=off -o setuid=off -o canmount=off zfs create $POOL/fedora/var/cache zfs create $POOL/fedora/var/log zfs create $POOL/fedora/var/spool zfs create $POOL/fedora/var/lib -o exec=on zfs create $POOL/fedora/var/tmp -o exec=on zfs create $POOL/www -o exec=off -o setuid=off zfs create $POOL/home -o setuid=off zfs create $POOL/root

The motivation for using multiple datasets is similar to the reason more conventional systems use multiple LVM volumes:

To preserve user data between operating systems.

To assign special properties to selected datasets and their children.

To isolate user accounts and enforce quotas.

To avoid mixing user data with operating system files.

To segregate static and dynamic operating system files.

To control the snapshot process

Set ZFS mountpoints

zfs set mountpoint=/ $POOL/fedora zfs set mountpoint=/var $POOL/fedora/var zfs set mountpoint=/var/www $POOL/www zfs set mountpoint=/home $POOL/home zfs set mountpoint=/root $POOL/root

The reason for using ZFS mountpoints during installation is to avoid modifying the host system's fstab and to smooth the transition to the chroot environment for the final installation steps.

Later we'll switch to legacy mountpoints. During Fedora updates or upgrades, files sometimes get saved in mountpoint directories before ZFS gets around to mounting the datasets at boot time. This is a catastrophy because datasets can't be mounted on non-empty directories. The files they contain will become invisible and the system will fail to boot or exhibit bizarre symptoms. Fedora's update scripts know about fstab and make sure things are mounted at the right time. Hence we must accommodate.

Don't snapshot useless data

zfs set com.sun:auto-snapshot=false $POOL/fedora/var/tmp zfs set com.sun:auto-snapshot=false $POOL/fedora/var/cache

When com.sun:auto-snapshot=false, 3rd party snapshot software is supposed to exclude the dataset. Otherwise all datasets are included in snapshots.

This is an example of a user-created property. ZFS itself doesn't attach any meaning to such properties. They conventionally have "owned" names based on DNS to avoid conflicts.

Mount the boot partition

mkdir /target/boot mount -U `lsblk -nr ${DEVICE}${PART2} -o UUID` /target/boot rm -rf /target/boot/*

Mount the EFI partition

mkdir /target/boot/efi mount -U `lsblk -nr ${DEVICE}${PART1} -o UUID` /target/boot/efi -o umask=0077,shortname=winnt rm -rf /target/boot/efi/*

The "rm -f" expressions are there in case you're repeating these instructions on a previously partitioned device where an operating system was installed.

Install the operating system

Install a minimal Fedora system

dnf install -y --installroot=/target --releasever=$VER \ @minimal-environment \ kernel kernel-modules kernel-modules-extra \ grub2-efi-x64 shim-x64 mactel-boot

Optional: Add your favorite desktop environment to the list e.g. @cinnamon-desktop.

Install ZFS

dnf install -y --installroot=/target --releasever=$VER \ http://download.zfsonlinux.org/fedora/zfs-release.fc$VER.noarch.rpm dnf install -y --installroot=/target --releasever=$VER \ zfs zfs-dracut

Configure the target

Configure name resolver

cat > /target/etc/resolv.conf <<-EOF search csparks.com nameserver 192.168.1.2 EOF

(Be yourself.)

You may object that NetworkManager likes to use a symbolic link here that vectors off into NetworkManager Land. This concept has caused numerous boot failures on most of the systems I manage because of permission problems in the target directory. These can be corrected by hand, but I've had an easier life since I took over this file and used the traditional contents. Your mileage may vary. Someday Fedora will correct the problem. If you're in the mood to find out, don't create this file.

Show full path names in "zpool status"

cat > /target/etc/profile.d/grub2_zpool_fix.sh <<-EOF export ZPOOL_VDEV_NAME_PATH=YES EOF

[VM] Tell dracut to include the virtio_blk device

cat > /target/etc/dracut.conf.d/fs.conf <<-EOF filesystems+=" virtio_blk " EOF

Keep the spaces around virtio_blk!

Don't use zfs.cache

cat > /target/etc/default/zfs <<-EOF ZPOOL_CACHE="none" ZPOOL_IMPORT_OPTS="-o cachefile=none" EOF

Set grub parameters

cat > /target/etc/default/grub <<-EOF GRUB_TIMEOUT=5 GRUB_DISTRIBUTOR=Fedora GRUB_DEFAULT=saved GRUB_DISABLE_SUBMENU=true GRUB_TERMINAL_OUTPUT=console GRUB_DISABLE_RECOVERY=true GRUB_DISABLE_OS_PROBER=true GRUB_PRELOAD_MODULES=zfs GRUB_ENABLE_BLSCFG=false EOF

We're going to switch to BLS later.

Disable selinux

sed -i 's/SELINUX=enforcing/SELINUX=disabled/' /target/etc/selinux/config

Create a hostid file

chroot /target zgenhostid

Add user+password

chroot /target useradd $USER -c "$NAME" -G wheel echo "$USER:$PASW" | chpasswd -R /target

Prepare for first boot

systemd-firstboot \ --root=/target \ --locale=C.UTF-8 \ --keymap=us \ --hostname=$POOL \ --setup-machine-id

Create fstab for legacy mountpoints

cat > /target/etc/fstab <<-EOF UUID=`lsblk -nr ${DEVICE}${PART2} -o UUID` /boot ext4 defaults 0 0 UUID=`lsblk -nr ${DEVICE}${PART1} -o UUID` /boot/efi vfat umask=0077,shortname=winnt 0 2 $POOL/fedora/var/cache /var/cache zfs defaults 0 0 $POOL/fedora/var/lib /var/lib zfs defaults 0 0 $POOL/fedora/var/log /var/log zfs defaults 0 0 $POOL/fedora/var/spool /var/spool zfs defaults 0 0 $POOL/fedora/var/tmp /var/tmp zfs defaults 0 0 $POOL/www /var/www zfs defaults 0 0 $POOL/home /home zfs defaults 0 0 $POOL/root /root zfs defaults 0 0 EOF

Switch to legacy mountpoints

zfs set mountpoint=legacy $POOL/fedora/var zfs set mountpoint=legacy $POOL/www zfs set mountpoint=legacy $POOL/home zfs set mountpoint=legacy $POOL/root

Chroot into the target

zenter /target mount -a

Prepare for grub2-mkconfig

source /etc/profile.d/grub2_zpool_fix.sh

Running grub2-mkconfig will fail without this definition. It will always be defined after logging into the target, but we're not there yet.

Configure boot loader

grub2-mkconfig -o /boot/efi/EFI/fedora/grub.cfg grub2-switch-to-blscfg

Use import scanning instead of zfs cache:

systemctl disable zfs-import-cache systemctl enable zfs-import-scan

Collect kernel and zfs version strings

kver=`rpm -q --last kernel | sed '1q' | sed 's/kernel-//' | sed 's/ .*$//'` zver=`rpm -q zfs | sed 's/zfs-//' | sed 's/\.fc.*$//' | sed 's/-[0-9]//'`

Patch zfs for kernel-5.8.x

dnf install -y patch wget cd /usr/src/zfs-$zver wget -q https://server.csparks.com/BootFedoraZFS/vmalloc.patch patch -u -s -p1 < vmalloc.patch

This ugliness patches two lines in the zfs source code where the the __vmalloc function is used. As of kernel-5.8.x, the function has two parameters rather than three.

This topic is very active on the ZOL issue tracker and has already been fixed in the upstream project. I expect we'll see an update Real Soon Now.

Build and install zfs modules

dkms install -m zfs -v $zver -k $kver

Exit the chroot

umount /boot/efi umount /boot exit

Export the pool

zpool export $POOL

Boot the target

[RH] Reboot and select the new UEFI disk

It works!

[VM] Disconnect the virtual disk

qemu-nbd --disconnect /dev/nbd0

If you forget to disconnect the nbd device, the virtual machine won't be able to access the virtual disk.

[VM] Create a virtual machine

virt-install \ --name=$POOL \ --os-variant=fedora$VER \ --vcpus=4 \ --memory=32000 \ --boot=uefi \ --disk path=$IMAGE,format=qcow2 \ --import \ --noreboot \ --noautoconsole \ --wait=-1

You only need to do this once. By replacing the disk image file, other configurations can be tested on the same vm.

[VM] Startup

Use the VirtManager GUI or:

virsh start $POOL virt-viewer $POOL

Additional configuration

Things to do after you've successfully logged in.

Set the timezone

timedatectl set-timezone America/Chicago timedatectl set-ntp true

Give your system a nice name

hostnamectl set-hostname magoo

Complaints and suggestions

I detest superstitions, gratuitous complications, obscure writing, and bugs. If you get stuck or if your understanding exceeds mine, please share your thoughts. (I like to hear good news too.)

References

In the past, it was necessary to be vigilant when doing "dnf update" or a Fedora upgrade because a new kernel or zfs version made it necessary to run a fixup script before rebooting. In the dark ages before Fedora 31, this script was fairly complicated.

With the advent of BLS combined with other improvements by the kernel and zfs packagers, this is no longer necessary. After any update you can reboot with confidence that you'll never see the Black Screen Of Dracut or the Dread Prompt Of Grub.

If you're rash enough to be booting Fedora on ZFS in a production system, it's almost imperative that you maintain a simple virtual machine in parallel. When you see that updates are available, clone the VM and update that first. If it won't boot, attempt your fixes there. If all else fails, freeze kernel updates on your production system and wait for better times. (See Appendix - Freeze kernel updates )

Appendix - Pure ZFS systems

With UEFI motherboards, the only way to "ZFS purity" is to put your EFI partition on a separate device, rather than on a partition of a device that also has all or part of a ZFS pool. It's also possible to do away with the ext4 /boot partition by keeping it in a dataset, but this will put you into contention with the "pool features vs grub supported features" typhoon of uncertainty. (See Grub-compatible pool creation.)

A better way, in my opinion, is to use a small SSD with both EFI and boot partitions. The ZFS pool for the rest of the operating system can be assembled from disks without partitions, "whole disks", which most ZFS pundits recommend. This example doesn't follow that advice because it's intended to be a simplified tutorial.

If you still want to have /boot on ZFS, it's necessary to add the grub2 zfs modules to the efi partition:

dnf install grub2-efi-x64-modules mkdir -p /target/boot/efi/EFI/fedora/x86_64-efi cp -a /target/usr/lib/grub/x86_64-efi/zfs* /target/boot/efi/EFI/fedora/x86_64-efi

The zfs.mod file in that collection does not support all possible pool features, but it will work if you find a compromise. Currently, the zfs.mod with Fedora32 will handle a ZFS pool with default "compression=on" settings created using zfs-0.8.4.

Appendix - Fix boot problems

You'll need a thumb drive or other detachable device that has a linux system and ZFS support. Boot the device.

Import the pool

zpool import -f $POOL -o altroot=/target

Chroot into the system

zenter /target mount -a

Rebuild the zfs modules

dnf reinstall zfs-dkms

If you see errors from dkms, you'll probably have have to revert to an earlier kernel and/or version of zfs. Such problems are temporary and rare.

Rebuild the EFI partition

First make sure you're running in chroot (zenter) and that the right /boot/efi partition is mounted:

df -h

Next run:

rm -rf /boot/efi/* dnf reinstall grub2-efi-x64 shim-x64 fwupdate-efi mactel-boot

Reinstall BLS

Edit /etc/default/grub and disable BLS:

... GRUB_ENABLE_BLSCFG=false ...

Then run:

grub2-mkconfig -o /boot/efi/EFI/fedora/grub.cfg grub2-switch-to-blscfg

Delete the Abominable Cache File

rm -f /etc/zfs/zfs.cache

This thing has a way of rising from the dead..

kver=`rpm -q --last kernel | sed '1q' | sed 's/kernel-//' | sed 's/ .*$//'` dracut -fv --kver $kver

After any or all of these interventions, exit with:

umount /boot/efi exit zpool export $POOL

Reboot

Learn from others

Visit the ZFS Issue Tracker and see what others discover. If your problem is unique, join up and post a question.

If you discover that you can't build the zfs modules for a new kernel, you'll have to use your recovery device and revert. (Or use a virtual machine to find out without blowing yourself up.)

Once you've got your system running again, you can "version lock" the kernel packages. This will allow other fedora updates to proceed, but hold the kernel at the current version:

dnf versionlock add kernel-`uname -r` dnf versionlock add kernel-core-`uname -r` dnf versionlock add kernel-devel-`uname -r` dnf versionlock add kernel-modules-`uname -r` dnf versionlock add kernel-modules-extra-`uname -r` dnf versionlock add kernel-headers-`uname -r`

When it's safe to allow kernel updates, you can release all locks using the expression:

dnf versionlock clear

If you have locks on other packages and don't want to clear all of them, you can release only the previous kernel locks:

dnf versionlock delete kernel-`uname -r` dnf versionlock delete kernel-core-`uname -r` dnf versionlock delete kernel-devel-`uname -r` dnf versionlock delete kernel-modules-`uname -r` dnf versionlock delete kernel-modules-extra-`uname -r` dnf versionlock delete kernel-headers-`uname -r`

Appendix - Stuck in the emergency shell

The screen is mostly black with plain text. You see:

[ OK ] Started Emergency Shell. [ OK ] Reached target Emergency Mode.

This is the Black Screen Of Dracut.

You'll be invited to run journalctl which will list the whole boot sequence. Near the end, carefully inspect lines that mention ZFS. There are three common cases:

1) Journal entry looks like this:

systemd[1]: Failed to start Import ZFS pools by cache file.

You are a victim of the Abominable Cache File. The fix is easy. Boot your recovery device, enter the target, and follow the section that deals with getting rid of the cache file in Appendix - Fix boot problems.

2) Journal entry looks like this:

... Starting Import ZFS pools by device scanning... cannot import 'Magoo': pool was previously in use from another system.

You probably forget to export the pool after tampering with it from another system. (Such as when you previously used the recovery device.) You can fix the problem from the emergency shell:

zpool import -f myPool -N zpool export myPool reboot

3) If you see messages about not being able to load the zfs modules, that may be normal because it takes several tries during the boot sequence. But if ends up being unable to load the modules, try this:

modprobe zfs

If that fails, the zfs modules were never built or they were left out of the initramfs. To fix that, go through the entire s equence describe in Appendix - Fix boot problems.

If you can execute the modprobe sucessfully, you should try the next fix:

Appendix - Work-around for a race condition

During boot, it's normal to see a few entries like this in the journal:

dracut-pre-mount[508]: The ZFS modules are not loaded. dracut-pre-mount[508]: Try running '/sbin/modprobe zfs' as root to load them.

But if the zfs modules aren't loaded by the time dracut wants to mount the root filesystem, the boot will fail. This problems was reported in 2019 ZOL 0.8 Not Loading Modules or ZPools on Boot #8885. I never saw this until I tried to boot a fast flash drive on a slow computer. Since I knew the flash drive worked on other machines, I was surprised to see The Black Screen Of Dracut.

Here's a fix you can apply when your root-on-zfs device is mounted for repair on /target:

mkdir /target/etc/systemd/system/systemd-udev-settle.service.d cat > /target/etc/systemd/system/systemd-udev-settle.service.d/override.conf <<-EOF [Service] ExecStartPre=/usr/bin/sleep 5 EOF

Appendix - Stuck in grub

A black screen with an enigmatic prompt:

grub>

This is the Dread Prompt Of Grub.

Navigating this little world merits a separate document Grub Expressions. A nearly-foolproof solution is to run through Appendix - Fix boot problems. Pay particular attention to the step where the entire /boot/efi partition is recreated.

Appendix - Enable swapping

Using a zvol for swapping is problematic. (as of 2020-08, zfs 0.8.4) If you feel the urge to try, first read the swap deadlock thread.

Sooner or later, the issues will be fixed. (Maybe now?) Here's how to try it out:

Create a swap dataset

zfs create $POOL/swap \ -o volsize=4G \ -o volblocksize=4k \ -o compression=zle \ -o refreservation=4.13G \ -o primarycache=metadata \ -o secondarycache=none \ -o logbias=throughput \ -o sync=always \ -o com.sun:auto-snapshot=false

Add the swap volume to fstab:

... /dev/zvol/pool/swap none swap defaults 0 0 ...

After you're running the target, enable swapping:

swapon -av

This setting is remembered so swapping will operate after reboot.

Don't enable hibernation. It tries to use swap space for the memory image but the dataset is not available early enough in the boot process.