Battle testing data integrity verification with ZFS, Btrfs and mdadm+dm-integrity

Posted on 2019-05-05. Last updated on 2020-01-23.

In this article I share the results of a home-lab experiment in which I threw some different problems at ZFS, Btrfs and mdadm+dm-integrity in a RAID-5 setup.

Table of Contents

Introduction

Let me start by saying that this is a simple write up and it wasn't originally intended to be anything but some personal notes which I then decided to share.

I did my tests on and off during the course of about a week, but I have tried to be consistent. I have repeated many of the tests more than once, but did not document everything as the results where very similar.

My main interest was to see how the different systems would handle multiple breakdown situations in a RAID-5 setup. I also tested mirror (RAID-1) setups, but due to the length of the article I later decided not to include those.

I have used the word "pool" from the world of ZFS and Btrfs whenever I am dealing with the RAID-5 array.

Please forgive any short comings, missing parts, and mistakes in my attempt to make this write up. Also English is not my native language.

Now, on to the subject.

Whenever we use any kind of precautionary measures against data corruption, such as backup and/or filesystem data integrity verification, we need to test our setup with at least some simulated failures before we implement a solution. If we never "battle test" our solution, we have no real idea how it's going to handle a breakdown.

We need to ask questions like:

If my system breaks down right now do I have adequate measures in place or will I lose important data?

If I do have backup in place, will my backup suffice? Is it recent enough? Is it secure enough?

What if my backup solution breaks during restoration? Do I need multiple backup solutions?

What about bit rot?

Do I need running data integrity verification?

Do I need to backup everything or can I perhaps split data into important and non-important categories?

Do I need to automate some of the procedures?

Have I tested my solution?

Have I tested my solution?

Have I tested my solution?

Yes, we really need to test our solutions throughly :)

In any case, ZFS and Btrfs are both very amazing open source data integrity verification filesystems.

I have more experience using ZFS, and the last time I tested Btrfs, it was not performing well. File transfers where slow and a situation did occur where I lost some files. However, this was a very long time ago. I have since looked through the Btrfs source code and commit logs and Btrfs has received many fixes and improvements - especially during the last couple of years.

I therefore decided to put up a simple home-test environment with bare metal and throw some simulated problems against both ZFS and Btrfs and then try to deal with the problems in an as-identical-as-possible manner on both systems. Later I added mdadm+dm-integrity.

I managed to get a long with all the systems using only their respective man pages even though I think the Btrfs documentation could benefit a lot by using some examples.

I used old and cheap hardware suitable for a home-lab.

The computer I used has only 4 SATA-II connectors and I decided to use one for the boot device itself. I then used the rest for a RAID-5 (RAID-Z in ZFS) with just three hard drives. I could have booted from a USB stick and then used four drives, but I wanted to speed up both the installation time and boot time.

A RAID-5 requires 3 or more physical drives. RAID-5 stores parity blocks distributed across each disk. In the event of a failed disk, these parity blocks are used to reconstruct the data on a replacement disk. RAID-5 can withstand the loss of one disk.

I know some people "frown" upon RAID-5, but a RAID-5 is a really great way to utilize both speed and space. In any case no kind of RAID setup is a replacement for proper backup. If your data is important to you, you should always back it up.

On the picture below I have setup two identical computers. During the main testing I always used the same machine and hardware, but for extensive and repeated testing I put the second machine to work too.

Both computers are equipped with an Intel Core 2 Duo CPU E8400 3.00 Ghz CPU with 8 GB of memory and an Intel Pro 1000 PT PCIe x1 Gigabit NIC. All the hard drives are some really amazing but old 1 TB Seagate Barracude ES.2 drives from 2010 (with the latest firmware) that all have been through quite a lot of "beating" over the years. I believe I have about ten of these drives and (if I remember correctly) only one drive has failed, about a year ago. The rest is still going strong.

For ZFS I ran Debian Linux "Stretch" with kernel version 4.19.28 and zfs-dkms version 0.7.12-1 from backports. For Btrfs I ran Arch Linux with kernel version 5.0.9 and Btrfs version 4.20.2. For both ZFS and Btrfs I used Samba. In my experience Samba performs better than NFS even between Linux-only machines and even though NFS uses less resources, I prefer Samba for various reasons.

At some point during my testing with Btrfs I discovered dm-integrity. I therefore decided to setup a RAID-5 with mdadm+dm-integrity on the Arch Linux installation and repeat the tests.

During the process I sometimes jumped back and forth between different tests on the different systems. For example, I first tested ZFS, then repeated the tests with Btrfs, then began testing mdadm+dm-integrity, then went back and performed some more tests on both ZFS and Btrfs, etc. The article is therefor put together from the various tests and the date and time in the different terminal outputs don't always match up. I also sometimes changed the disks in my setup so the disk IDs occasionally change. Please ignore that.

Myths and misunderstandings

One thing that really bothers me is how much false information that exists on the Internet regarding both ZFS and Btrfs.

Some misinformation has been spread due to inexperience, wrong expectations, and/or misunderstandings about the usage of these systems.

Let's get some of the myths and misunderstandings out of the way:

Myth: ZFS requires tons of memory!

This is one of the biggest misunderstandings about ZFS. The only situation in which ZFS requires lots of memory is if you specifically use de-duplication. I have run ZFS successfully using FreeBSD 12 on a Raspberry Pi 3 with two 1 TB USB disks attached to a single USB 3 hub. ZFS never used more than half the memory available during any kind of procedure and you can change the settings so it even runs with much less than that.

This is one of the biggest misunderstandings about ZFS. The only situation in which ZFS requires lots of memory is if you specifically use de-duplication. I have run ZFS successfully using FreeBSD 12 on a Raspberry Pi 3 with two 1 TB USB disks attached to a single USB 3 hub. ZFS never used more than half the memory available during any kind of procedure and you can change the settings so it even runs with much less than that. Myth: Red Hat has removed Btrfs because they consider it useless!

No, that is not why Red Hat removed Btrfs. A former Red Hat developer explains the situation on Hacker News.

No, that is not why Red Hat removed Btrfs. A former Red Hat developer explains the situation on Hacker News. Myth: ZFS and Btrfs requires ECC memory!

ZFS or Btrfs without ECC memory is no worse than any other file system without ECC memory. Using ECC memory is recommended in situations where the strongest data integrity guarantees are required. Random bit flips caused by cosmic rays or by faulty memory can go undetected without ECC memory. Any filesystem will write the damaged data from memory to disk and be unable to automatically detect the corruption. Also note that ECC memory is often not supported by consumer grade hardware. And ECC memory is also more expensive. In any way you can run ZFS and Btrfs without using ECC memory, it's not a requirement.

ZFS or Btrfs without ECC memory is no worse than any other file system without ECC memory. Using ECC memory is recommended in situations where the strongest data integrity guarantees are required. Random bit flips caused by cosmic rays or by faulty memory can go undetected without ECC memory. Any filesystem will write the damaged data from memory to disk and be unable to automatically detect the corruption. Also note that ECC memory is often not supported by consumer grade hardware. And ECC memory is also more expensive. In any way you can run ZFS and Btrfs without using ECC memory, it's not a requirement. Myth: Restoring a RAID-5 puts more stress on the drives!

Drives are not stressed! It's their job to read and write data! You are using your drives, not stressing them. It takes longer to restore a RAID-5 because the parity data needs to be calculated using CPU and it is slower than simply copying data between disks in a mirror (RAID-1), but there is no stress involved.

Drives are not stressed! It's their job to read and write data! You are using your drives, not stressing them. It takes longer to restore a RAID-5 because the parity data needs to be calculated using CPU and it is slower than simply copying data between disks in a mirror (RAID-1), but there is no stress involved. Myth: Using USB disk devices with ZFS or Btrfs is okay!

Sometimes you can get away with it without any problems what so ever, but many USB controllers and USB storage devices are really bad. If things break you cannot blame the filesystem. On Btrfs a Parent transid verify failed error is often the result of a failed internal consistency check of the filesystem's metadata due to a bad USB storage device. Other issues such as automatic and sudden un-mounting, wrong file size, data corruption, sudden shutdown, and several other problems are often caused by a bad USB storage device and/or USB power issues.

Sometimes you can get away with it without any problems what so ever, but many USB controllers and USB storage devices are really bad. If things break you cannot blame the filesystem. On Btrfs a Parent transid verify failed error is often the result of a failed internal consistency check of the filesystem's metadata due to a bad USB storage device. Other issues such as automatic and sudden un-mounting, wrong file size, data corruption, sudden shutdown, and several other problems are often caused by a bad USB storage device and/or USB power issues. Myth: Btrfs still has the write hole issue and is completely useless!

The myth part of this is that Btrfs is completely useless, not the problems with the write hole issue. As of writing Btrfs still has some issues, but it is definitely not useless and you can even run RAID5/6 if you take some specific precautions. Check the RAID5/6 information. The "write hole" problem with Btrfs only potentially exist if you experience a power loss (an unclean shutdown) while having a disk that is failing immediately thereafter (or possibly at the same time) - without running a scrub in between. These two distinct failures combined breaks the Btrfs RAID-5 redundancy. However I was not able to reproduce the problem in any of my many tests with Btrfs. Update 2020-01-23: People have been emailing me with examples of the write hole problem persisting, where they have lost data, even in the Btrfs version of the 5.x kernel.

The myth part of this is that Btrfs is completely useless, not the problems with the write hole issue. As of writing Btrfs still has some issues, but it is definitely not useless and you can even run RAID5/6 if you take some specific precautions. Check the RAID5/6 information. The "write hole" problem with Btrfs only potentially exist if you experience a power loss (an unclean shutdown) while having a disk that is failing immediately thereafter (or possibly at the same time) - without running a scrub in between. These two distinct failures combined breaks the Btrfs RAID-5 redundancy. However I was not able to reproduce the problem in any of my many tests with Btrfs. People have been emailing me with examples of the write hole problem persisting, where they have lost data, even in the Btrfs version of the 5.x kernel. Myth: Btrfs is abandoned!

Btrfs is used in production world wide. Btrfs is deployed by Facebook on millions of servers with significant effiency gains. And it is also used by many other companies and projects and Btrfs keeps getting better and better.

Btrfs is used in production world wide. Btrfs is deployed by Facebook on millions of servers with significant effiency gains. And it is also used by many other companies and projects and Btrfs keeps getting better and better. Myth: mdadm+XYZ can replace ZFS or Btrfs!

No. They don't even compare.

Some advice

Most data loss reported on the mailing lists of ZFS, Btrfs, and mdadm, is down to user error while attempting to recover a failed array. Never use a trial-and-error approach when something goes wrong with your filesystem or backup solution!

Very often a really bad situation is caused by a trial-and-error approach to a problem. With Btrfs many people immediately use the btrfs check --repair command when they experience an issue, but this is actually the very last command you want to run.

Understand what you can expect from the filesystem you're using, how it works, and how each system implements a specific functionality. Don't blame the filesystem when it doesn't fulfill your wrong expectations.

ZFS RAID-Z

Let's begin the testing with ZFS.

The three disks are listed by "by-id" and I'll create the ZFS pool using those ID's as they also contain the serial number which makes it very easy to identify each drive.

$ ls -gG /dev/disk/by-id/ ata-ST31000340NS_9QJ089LF -> ../../sdd ata-ST31000340NS_9QJ0EQ1V -> ../../sdb ata-ST31000340NS_9QJ0F2YQ -> ../../sdc

With a RAID-Z (RAID-5) I can stand to lose one drive and the pool will still function, however I need to "resilver" the pool as soon as possible with a replacement drive.

Resilvering is the same concept as rebuilding a RAID array. With most other RAID implementations, there is no distinction between which blocks are in use, and which aren't. A typical rebuild therefore starts at the beginning of the disk until it reaches the end of the disk - this is how mdadm works and it is extremely slow. But because ZFS knows about structure of the RAID system and the metadata, ZFS rebuilds only the blocks in use. The ZFS developers therefore thought of the term "resilvering" rather than "rebuilding".

I'm going to create a pool using the -f option because ZFS will detect that the attached drives used to belong to an old pool and will not allow for it to be used in a new pool unless forced to do so (I have used the drives in a previous setup).

# zpool create -f -O xattr=sa -O dnodesize=auto -O atime=off -o ashift=12 pool1 raidz ata-ST31000340NS_9QJ0F2YQ ata-ST31000340NS_9QJ0EQ1V ata-ST31000340NS_9QJ089LF

I'm then going to create a ZFS dataset on the pool with lz4 compression enabled.

# zfs create -o compress=lz4 pool1/pub # zfs list NAME USED AVAIL REFER MOUNTPOINT pool1 575K 1.75T 128K /pool1 pool1/pub 128K 1.75T 128K /pool1/pub

I have then exported the "pub" directory using Samba and will begin by copying some files over from a client computer using rsync .

$ rsync -a --progress --stats tmp/ mnt/testbox/pub/tmp/ 1.pdf 18,576,345 100% 196.49MB/s 0:00:00 (xfr#1, to-chk=6/8) 2.pdf 30,255,102 100% 70.89MB/s 0:00:00 (xfr#2, to-chk=5/8) 3.pdf 22,016,195 100% 23.28MB/s 0:00:00 (xfr#3, to-chk=4/8) bar.mkv 35,456,180,485 100% 112.92MB/s 0:04:59 (xfr#4, to-chk=3/8) boo.iso 625,338,368 100% 21.64MB/s 0:00:27 (xfr#5, to-chk=2/8) foo.mkv 1,548,841,922 100% 135.76MB/s 0:00:10 (xfr#6, to-chk=1/8) moo.iso 415,633,408 100% 25.86MB/s 0:00:15 (xfr#7, to-chk=0/8) Number of files: 8 (reg: 7, dir: 1) Number of created files: 8 (reg: 7, dir: 1) Number of deleted files: 0 Number of regular files transferred: 7 Total file size: 38,116,841,825 bytes Total transferred file size: 38,116,841,825 bytes Literal data: 38,116,841,825 bytes Matched data: 0 bytes File list size: 0 File list generation time: 0.001 seconds File list transfer time: 0.000 seconds Total bytes sent: 38,126,148,150 Total bytes received: 202 sent 38,126,148,150 bytes received 202 bytes 106,945,717.68 bytes/sec total size is 38,116,841,825 speedup is 1.0

Now the ZFS pool has some data:

# zfs list NAME USED AVAIL REFER MOUNTPOINT pool1 35.5G 1.72T 128K /pool1 pool1/pub 35.5G 1.72T 35.5G /pool1/pub

ZFS - Power outage

I'll then add yet another file using rsync and then pull the power cord to the ZFS machine half way through the transfer.

I have then aborted the rest of the file transfer on the client and turned the ZFS machine back on.

$ rsync -a --progress --stats tmp/ mnt/testbox/pub/tmp/ sending incremental file list zoo.mkv 5,918,261,248 54% 64.88kB/s 21:11:16 ^C

Because ZFS is using transactional transfers the file is going to be lost, but nothing has happened to the files already on the system, and there will be no kind of damage to the filesystem and no kind of filesystem checking needs to be run.

Let's take a look at the ZFS documentation from ORACLE regarding the Transactional Semantics:

ZFS is a transactional file system, which means that the file system state is always consistent on disk. Traditional file systems overwrite data in place, which means that if the system loses power, for example, between the time a data block is allocated and when it is linked into a directory, the file system will be left in an inconsistent state. Historically, this problem was solved through the use of the fsck command. This command was responsible for reviewing and verifying the file system state, and attempting to repair any inconsistencies during the process. This problem of inconsistent file systems caused great pain to administrators, and the fsck command was never guaranteed to fix all possible problems. More recently, file systems have introduced the concept of journaling. The journaling process records actions in a separate journal, which can then be replayed safely if a system crash occurs. This process introduces unnecessary overhead because the data needs to be written twice, often resulting in a new set of problems, such as when the journal cannot be replayed properly. With a transactional file system, data is managed using copy on write semantics. Data is never overwritten, and any sequence of operations is either entirely committed or entirely ignored. Thus, the file system can never be corrupted through accidental loss of power or a system crash. Although the most recently written pieces of data might be lost, the file system itself will always be consistent. In addition, synchronous data (written using the O_DSYNC flag) is always guaranteed to be written before returning, so it is never lost.

This is confirmed by a look at the status of the pool:

# zpool status pool: pool1 state: ONLINE scan: none requested config: NAME STATE READ WRITE CKSUM pool1 ONLINE 0 0 0 raidz1-0 ONLINE 0 0 0 ata-ST31000340NS_9QJ0F2YQ ONLINE 0 0 0 ata-ST31000340NS_9QJ0EQ1V ONLINE 0 0 0 ata-ST31000340NS_9QJ089LF ONLINE 0 0 0 errors: No known data errors # zfs list NAME USED AVAIL REFER MOUNTPOINT pool1 35.5G 1.72T 128K /pool1 pool1/pub 35.5G 1.72T 35.5G /pool1/pub

And from the clients point of view:

$ ls -gG mnt/testbox/pub/tmp total 37194300 -rwxr-xr-x 1 18576345 Apr 21 09:08 1.pdf -rwxr-xr-x 1 30255102 Apr 21 09:08 2.pdf -rwxr-xr-x 1 22016195 Apr 21 09:08 3.pdf -rwxr-xr-x 1 35456180485 Apr 21 07:58 bar.mkv -rwxr-xr-x 1 625338368 Mar 5 2018 boo.iso -rwxr-xr-x 1 1548841922 Apr 15 23:50 foo.mkv -rwxr-xr-x 1 415633408 Mar 5 2018 moo.iso

ZFS - Drive failure

Now I want to simulate a simple drive failure. I'm going to remove one of the drives from the ZFS machine, then replace it with another drive, and then resilver the ZFS pool.

I have removed the drive:

# zpool status pool: pool1 state: DEGRADED status: One or more devices could not be used because the label is missing or invalid. Sufficient replicas exist for the pool to continue functioning in a degraded state. action: Replace the device using 'zpool replace'. see: http://zfsonlinux.org/msg/ZFS-8000-4J scan: none requested config: NAME STATE READ WRITE CKSUM pool1 DEGRADED 0 0 0 raidz1-0 DEGRADED 0 0 0 ata-ST31000340NS_9QJ0F2YQ ONLINE 0 0 0 1803500998269517419 UNAVAIL 0 0 0 was /dev/disk/by-id/ata-ST31000340NS_9QJ0EQ1V-part1 ata-ST31000340NS_9QJ089LF ONLINE 0 0 0 errors: No known data errors # zfs list NAME USED AVAIL REFER MOUNTPOINT pool1 35.5G 1.72T 128K /pool1 pool1/pub 35.5G 1.72T 35.5G /pool1/pub

Even though the pool is in a degraded state, I can still mount the pool on the client and use the files.

$ mount mnt/testbox/pub $ ls -gG mnt/testbox/pub/tmp total 37194300 -rwxr-xr-x 1 18576345 Apr 21 09:08 1.pdf -rwxr-xr-x 1 30255102 Apr 21 09:08 2.pdf -rwxr-xr-x 1 22016195 Apr 21 09:08 3.pdf -rwxr-xr-x 1 35456180485 Apr 21 07:58 bar.mkv -rwxr-xr-x 1 625338368 Mar 5 2018 boo.iso -rwxr-xr-x 1 1548841922 Apr 15 23:50 foo.mkv -rwxr-xr-x 1 415633408 Mar 5 2018 moo.iso

I can also write to the pool.

echo Hello > mnt/testbox/pub/tmp/hello.txt $ ls -gG mnt/testbox/pub/tmp/ total 37194304 -rwxr-xr-x 1 18576345 Apr 21 09:08 1.pdf -rwxr-xr-x 1 30255102 Apr 21 09:08 2.pdf -rwxr-xr-x 1 22016195 Apr 21 09:08 3.pdf -rwxr-xr-x 1 35456180485 Apr 21 07:58 bar.mkv -rwxr-xr-x 1 625338368 Mar 5 2018 boo.iso -rwxr-xr-x 1 1548841922 Apr 15 23:50 foo.mkv -rwxr-xr-x 1 6 Apr 24 23:11 hello.txt -rwxr-xr-x 1 415633408 Mar 5 2018 moo.iso

Now I need to identify the new drive:

$ ls -l /dev/disk/by-id/ ata-ST31000340NS_9QJ0DVN2 -> ../../sdb

Then I need to replace the old drive with the new. The procedure, since the old drive is completely gone, is not to detach and then replace, but simply to replace with zpool replace pool old_device new_device .

# zpool replace pool1 ata-ST31000340NS_9QJ0EQ1V ata-ST31000340NS_9QJ0DVN2

ZFS will immediately and automatically begin the resilvering of the pool:

# zpool status pool: pool1 state: DEGRADED status: One or more devices is currently being resilvered. The pool will continue to function, possibly in a degraded state. action: Wait for the resilver to complete. scan: resilver in progress since Wed Apr 24 23:19:33 2019 10.5G scanned out of 53.3G at 228M/s, 0h3m to go 3.49G resilvered, 19.68% done config: NAME STATE READ WRITE CKSUM pool1 DEGRADED 0 0 0 raidz1-0 DEGRADED 0 0 0 ata-ST31000340NS_9QJ0F2YQ ONLINE 0 0 0 replacing-1 DEGRADED 0 0 0 1803500998269517419 UNAVAIL 0 0 0 was /dev/disk/by-id/ata-ST31000340NS_9QJ0EQ1V-part1 ata-ST31000340NS_9QJ0DVN2 ONLINE 0 0 0 (resilvering) ata-ST31000340NS_9QJ089LF ONLINE 0 0 0 errors: No known data errors

After about 3 minutes the pool is back up and ready for usage:

# zpool status pool: pool1 state: ONLINE scan: resilvered 17.8G in 0h3m with 0 errors on Wed Apr 24 23:22:56 2019 config: NAME STATE READ WRITE CKSUM pool1 ONLINE 0 0 0 raidz1-0 ONLINE 0 0 0 ata-ST31000340NS_9QJ0F2YQ ONLINE 0 0 0 ata-ST31000340NS_9QJ0DVN2 ONLINE 0 0 0 ata-ST31000340NS_9QJ089LF ONLINE 0 0 0 errors: No known data errors

And from the client:

$ ls -gG mnt/testbox/pub/tmp total 37194304 -rwxr-xr-x 1 18576345 Apr 21 09:08 1.pdf -rwxr-xr-x 1 30255102 Apr 21 09:08 2.pdf -rwxr-xr-x 1 22016195 Apr 21 09:08 3.pdf -rwxr-xr-x 1 35456180485 Apr 21 07:58 bar.mkv -rwxr-xr-x 1 625338368 Mar 5 2018 boo.iso -rwxr-xr-x 1 1548841922 Apr 15 23:50 foo.mkv -rwxr-xr-x 1 6 Apr 24 23:11 hello.txt -rwxr-xr-x 1 415633408 Mar 5 2018 moo.iso

Just to make sure all data has been resilvered without any errors during writing I'll perform a scrub and validate that everything is alright:

# zpool scrub pool1

And about 3 minutes after the scrub is finished:

# zpool status pool: pool1 state: ONLINE scan: scrub repaired 0B in 0h3m with 0 errors on Thu Apr 24 23:56:01 2019 config: NAME STATE READ WRITE CKSUM pool1 ONLINE 0 0 0 raidz1-0 ONLINE 0 0 0 ata-ST31000340NS_9QJ0F2YQ ONLINE 0 0 0 ata-ST31000340NS_9QJ0DVN2 ONLINE 0 0 0 ata-ST31000340NS_9QJ089LF ONLINE 0 0 0 errors: No known data errors

Since ZFS has only restored the used data blocks, not the entire disk, the procedure was very was as was the scrubbing.

ZFS - Drive failure during file transfer

Now I want to remove a disk in the middle of an active file transfer in order to simulate a total failure of a disk, but not a permanent failure. This might happen if the disk power cord managed to wiggle itself loose, or if the disk is located in a slot and hasn't been pushed all the way through, etc.

$ rsync -a --progress --stats tmp/ mnt/testbox/pub/tmp/ sending incremental file list zoo.mkv 10,867,033,488 100% 127.95MB/s 0:01:20 (xfr#1, to-chk=0/9) Number of files: 9 (reg: 8, dir: 1) Number of created files: 1 (reg: 1) Number of deleted files: 0 Number of regular files transferred: 1 Total file size: 48,983,875,313 bytes Total transferred file size: 10,867,033,488 bytes Literal data: 10,867,033,488 bytes Matched data: 0 bytes File list size: 0 File list generation time: 0.001 seconds File list transfer time: 0.000 seconds Total bytes sent: 10,869,686,827 Total bytes received: 38 sent 10,869,686,827 bytes received 38 bytes 102,062,787.46 bytes/sec total size is 48,983,875,313 speedup is 4.5

I removed the drive by disconnection the individual power cord to the drive. The ZFS machine reacted by halting the file transfer for about a second, then it resumed with full speed and the client only experienced a momentary drop in the file transfer speed.

The file transfer was then completed without any problems on the client side.

On the ZFS machine the pool has now changed the state to DEGRADED:

# zpool status pool: pool1 state: DEGRADED status: One or more devices could not be used because the label is missing or invalid. Sufficient replicas exist for the pool to continue functioning in a degraded state. action: Replace the device using 'zpool replace'. see: http://zfsonlinux.org/msg/ZFS-8000-4J scan: none requested config: NAME STATE READ WRITE CKSUM pool1 DEGRADED 0 0 0 raidz1-0 DEGRADED 0 0 0 ata-ST31000340NS_9QJ0ES1V ONLINE 0 0 0 ata-ST31000340NS_9QJ0ET8D ONLINE 0 0 0 ata-ST31000340NS_9QJ0EZZC UNAVAIL 0 0 0 errors: No known data error

I powered down the machine in order to safely reattach the drive and then rebooted.

ZFS has detected the error:

# zpool status pool: pool1 state: ONLINE status: One or more devices has experienced an unrecoverable error. An attempt was made to correct the error. Applications are unaffected. action: Determine if the device needs to be replaced, and clear the errors using 'zpool clear' or replace the device with 'zpool replace'. see: http://zfsonlinux.org/msg/ZFS-8000-9P scan: none requested config: NAME STATE READ WRITE CKSUM pool1 ONLINE 0 0 0 raidz1-0 ONLINE 0 0 0 ata-ST31000340NS_9QJ0ES1V ONLINE 0 0 0 ata-ST31000340NS_9QJ0ET8D ONLINE 0 0 0 ata-ST31000340NS_9QJ0EZZC ONLINE 0 0 11 errors: No known data errors

This situation simulates a physical drive failure when the ZFS pool is under active use and it is probably one of the most common situations in real life.

In order to handle the problem correctly I would normally need to investigate the situation.

Has the drive physically failed and therefore needs a replacement?

Or is it perhaps a wire that has managed to wiggle itself loose?

Or is it perhaps the wire itself that is broken?

Or has the disk connector (both on the disk itself and on the motherboard) experienced any physical corrosion? (This actually happens).

It's important to remember that whether a disk is good or bad is not a simple yes or no question. A disk can be "mostly" good, with a few physical sectors that give errors. A disk can be bad for a few seconds, hours, or days, and then go back to working fine again for years.

Due to firmware issues, a disk may be able to do most operations fine, but certain operations don't work well. Disk problems are shaded, multi-dimensional and time-dependent!

Now, since this is just a simulation I know what to do, but in a real life situation you need to investigate the above questions as any of the above issues might be the cause of the problem.

If there isn't any physical problems with the setup you might be able to get some useful information from S.M.A.R.T.

In my situation I have determined that the problem was caused by a system administrator who managed to pull the power cord from the disk "by mistake" so I don't need to replace the drive :)

The correct approach is therefore to do a scrub after the drive has been reattached. During a scrub ZFS will detect any checksum errors and will restore the data using the parity data.

# zpool scrub pool1 # zpool status pool: pool1 state: DEGRADED status: One or more devices has experienced an unrecoverable error. An attempt was made to correct the error. Applications are unaffected. action: Determine if the device needs to be replaced, and clear the errors using 'zpool clear' or replace the device with 'zpool replace'. see: http://zfsonlinux.org/msg/ZFS-8000-9P scan: scrub repaired 4.19G in 0h6m with 0 errors on Fri Apr 26 01:32:23 2019 config: NAME STATE READ WRITE CKSUM pool1 DEGRADED 0 0 0 raidz1-0 DEGRADED 0 0 0 ata-ST31000340NS_9QJ0ES1V ONLINE 0 0 0 ata-ST31000340NS_9QJ0ET8D ONLINE 0 0 0 ata-ST31000340NS_9QJ0EZZC DEGRADED 0 0 67.2K too many errors errors: No known data error

After the scrubbing is done ZFS tells us that it has repaired 4.19GB of data with 0 errors.

Even though ZFS has managed to repair everything without any errors it still keeps the pool in a degraded state because it is up to the system administrator to decide what needs to be done. This is important because even though ZFS has managed to rescue all data we might still be dealing with am unhealthy device.

Had there been any unrecoverable errors during the scrubbing we would be facing a disk that is too damaged for ZFS to continue working with it.

Can we clear the log and bring the pool status into the ONLINE and healthy state? Or do we need to replace the drive anyway? Perhaps S.M.A.R.T has warned us that the drive is now currently working, but it is experiencing occasional issues and soon needs to be fully replaced.

In this case we know that the drive is working fine so I'll just clear the log:

# zpool clear pool1 # zpool status pool: pool1 state: ONLINE scan: scrub repaired 0B in 0h3m with 0 errors on Fri Apr 26 02:09:44 2019 config: NAME STATE READ WRITE CKSUM pool1 ONLINE 0 0 0 raidz1-0 ONLINE 0 0 0 ata-ST31000340NS_9QJ0ES1V ONLINE 0 0 0 ata-ST31000340NS_9QJ0ET8D ONLINE 0 0 0 ata-ST31000340NS_9QJ0EZZC ONLINE 0 0 0 errors: No known data error

As a side note I can mention that I have been working a lot with tons of hardware over the past 25+ years and I have seen several situations in which S.M.A.R.T has reported problems with drives that still went on going for many years after being reported as both old and worn out. Of course you cannot ignore such reports, but depending on the situation you might need to replace the drive, but it can still be used in a less important capacity.

ZFS - Data corruption during file transfer

Now I want to simulate data corruption in the middle of a file transfer from the client. Not a drive failure, but some corruption of the data located on the pool.

I have removed the "zoo.mkv" file and while the rsync command is running again I'll do a couple of dd commands on the ZFS machine on one of the drives.

# dd if=/dev/urandom of=/dev/disk/by-id/ata-ST31000340NS_9QJ0ES1V seek=100000 count=1000 bs=1k ...

While the transfer is still running, I'm checking the pool status:

# zpool status pool: pool1 state: ONLINE scan: none requested config: NAME STATE READ WRITE CKSUM pool1 ONLINE 0 0 0 raidz1-0 ONLINE 0 0 1 ata-ST31000340NS_9QJ0ES1V ONLINE 0 0 0 ata-ST31000340NS_9QJ0ET8D ONLINE 0 0 0 ata-ST31000340NS_9QJ0EZZC ONLINE 0 0 0 errors: No known data errors

ZFS shows a checksum issue which has been fixed. dmesg and the log currently doesn't provide any further information. But we can take a look at the zpool events -v command if we want further information:

# zpool events -v Apr 26 2019 23:05:59.990726744 ereport.fs.zfs.checksum class = "ereport.fs.zfs.checksum" ena = 0x18549e4f2ec00401 detector = (embedded nvlist) version = 0x0 scheme = "zfs" pool = 0x4cdea36f1d7afa7c vdev = 0x772f5157f66ae182 (end detector) pool = "pool1" pool_guid = 0x4cdea36f1d7afa7c pool_state = 0x0 pool_context = 0x0 pool_failmode = "wait" vdev_guid = 0x772f5157f66ae182 vdev_type = "disk" vdev_path = "/dev/disk/by-id/ata-ST31000340NS_9QJ0ES1V-part1" vdev_ashift = 0xc vdev_complete_ts = 0x18549e3862a vdev_delta_ts = 0x19c5648 vdev_read_errors = 0x0 vdev_write_errors = 0x0 vdev_cksum_errors = 0x0 parent_guid = 0x3a01d1f81d93aaf8 parent_type = "raidz" vdev_spare_paths = vdev_spare_guids = zio_err = 0x34 zio_flags = 0x100080 zio_stage = 0x100000 zio_pipeline = 0xf80000 zio_delay = 0x0 zio_timestamp = 0x0 zio_delta = 0x0 zio_offset = 0x5cb6000 zio_size = 0x6000 zio_objset = 0x48 zio_object = 0x82 zio_level = 0x1 zio_blkid = 0x0 bad_ranges = 0x0 0x6000 bad_ranges_min_gap = 0x8 bad_range_sets = 0xcaa5 bad_range_clears = 0xb597 bad_set_histogram = 0x32c 0x32b 0x334 0x312 0x334 0x31c 0x306 0x31f 0x300 0x340 0x303 0x30d 0x330 0x318 0x324 0x2f0 0x304 0x32b 0x314 0x33c 0x339 0x2fd 0x33c 0x347 0x33c 0x379 0x33f 0x324 0x327 0x351 0x310 0x313 0x31f 0x31c 0x31e 0x334 0x354 0x32e 0x33e 0x312 0x32d 0x369 0x340 0x337 0x32a 0x330 0x32c 0x33a 0x319 0x328 0x30a 0x332 0x32a 0x320 0x333 0x333 0x34b 0x316 0x347 0x30c 0x34c 0x35a 0x34a 0x2ff bad_cleared_histogram = 0x2cc 0x2c3 0x2e2 0x2ca 0x29c 0x2fa 0x2f8 0x2d0 0x2e6 0x2cd 0x2d5 0x2c3 0x2bf 0x2d7 0x2d7 0x2fa 0x2c8 0x2d4 0x2d1 0x303 0x2ef 0x2fa 0x2f4 0x2c1 0x2a3 0x2b7 0x2b4 0x2e9 0x2e6 0x2c9 0x2d9 0x2eb 0x2c1 0x2b9 0x2e4 0x2d7 0x2c0 0x2ff 0x2c7 0x2dc 0x2e8 0x2bc 0x2c7 0x2d8 0x2ed 0x2db 0x2db 0x318 0x2e8 0x2c8 0x2db 0x2da 0x2de 0x2f7 0x2d0 0x2e6 0x2ae 0x2fb 0x2ca 0x2d5 0x2a9 0x2d2 0x2e2 0x2aa time = 0x5cc372b7 0x3b0d4a58 eid = 0x1f

ZFS events have never been publicly documented, but we do know from the above output that we have had some bad bits cleared out and that everything is in perfect order.

ZFS - The dd mistake

Have you ever made the mistake of running the rm -rf command as the root user on the / path of your disk? Or even worse what about the dd command?

I want to extent the above test and see what's going to happen if I by mistake type the dd command and let it run for a while during a file transfer from the client.

I have deleted all the files, restarted rsync , and I am now letting the dd run:

# dd if=/dev/urandom of=/dev/disk/by-id/ata-ST31000340NS_9QJ0ES1V bs=1k ^C348001+0 records in 348001+0 records out 356353024 bytes (356 MB, 340 MiB) copied, 47.1212 s, 7.6 MB/s

This should make a big mess of things.

Nothing noticeable has happened on the client:

$ rsync -a --progress --stats tmp/ mnt/testbox/pub/tmp/ sending incremental file list ./ 1.pdf 18,576,345 100% 178.63MB/s 0:00:00 (xfr#1, to-chk=7/9) 2.pdf 30,255,102 100% 76.33MB/s 0:00:00 (xfr#2, to-chk=6/9) 3.pdf 22,016,195 100% 28.68MB/s 0:00:00 (xfr#3, to-chk=5/9) bar.mkv 14,681,931,776 41% 112.62MB/s 0:03:00

ZFS has detected the errors:

# zpool status pool: pool1 state: ONLINE status: One or more devices has experienced an unrecoverable error. An attempt was made to correct the error. Applications are unaffected. action: Determine if the device needs to be replaced, and clear the errors using 'zpool clear' or replace the device with 'zpool replace'. see: http://zfsonlinux.org/msg/ZFS-8000-9P scan: none requested config: NAME STATE READ WRITE CKSUM pool1 ONLINE 0 0 0 raidz1-0 ONLINE 0 0 0 ata-ST31000340NS_9QJ0ES1V ONLINE 0 0 1 ata-ST31000340NS_9QJ0ET8D ONLINE 0 0 0 ata-ST31000340NS_9QJ0EZZC ONLINE 0 0 0 errors: No known data error

This shows the absolute incredible and unmatchable resilience of ZFS. Even though I just started doing a dd on one of the drives, the filesystem keeps working and clients can still read and write from the pool.

All I need to do is to perform a scrub to fix the problems:

# zpool scrub pool1 # zpool status pool: pool1 state: ONLINE status: One or more devices has experienced an unrecoverable error. An attempt was made to correct the error. Applications are unaffected. action: Determine if the device needs to be replaced, and clear the errors using 'zpool clear' or replace the device with 'zpool replace'. see: http://zfsonlinux.org/msg/ZFS-8000-9P scan: scrub in progress since Tue Apr 30 01:35:56 2019 2.24G scanned out of 68.5G at 209M/s, 0h5m to go 28K repaired, 3.28% done config: NAME STATE READ WRITE CKSUM pool1 ONLINE 0 0 0 raidz1-0 ONLINE 0 0 0 ata-ST31000340NS_9QJ0ES1V ONLINE 0 0 9 (repairing) ata-ST31000340NS_9QJ0ET8D ONLINE 0 0 0 ata-ST31000340NS_9QJ0EZZC ONLINE 0 0 0 errors: No known data errors

And the result:

# zpool status pool: pool1 state: DEGRADED status: One or more devices has experienced an unrecoverable error. An attempt was made to correct the error. Applications are unaffected. action: Determine if the device needs to be replaced, and clear the errors using 'zpool clear' or replace the device with 'zpool replace'. see: http://zfsonlinux.org/msg/ZFS-8000-9P scan: scrub repaired 60K in 0h3m with 0 errors on Tue Apr 30 01:39:50 2019 config: NAME STATE READ WRITE CKSUM pool1 DEGRADED 0 0 0 raidz1-0 DEGRADED 0 0 0 ata-ST31000340NS_9QJ0ES1V DEGRADED 0 0 17 too many errors ata-ST31000340NS_9QJ0ET8D ONLINE 0 0 0 ata-ST31000340NS_9QJ0EZZC ONLINE 0 0 0 errors: No known data errors

Again we have to investigate in order to determine if the disk that has suffered checksum errors needs to be replaced or if we can simply clear the log.

ZFS has managed to repair everything with 0 errors and all the disks are back up and working fine, I'll clear the log:

# zpool clear pool1 # zpool status pool: pool1 state: ONLINE scan: scrub repaired 0B in 0h3m with 0 errors on Tue Apr 30 01:59:34 2019 config: NAME STATE READ WRITE CKSUM pool1 ONLINE 0 0 0 raidz1-0 ONLINE 0 0 0 ata-ST31000340NS_9QJ0ES1V ONLINE 0 0 0 ata-ST31000340NS_9QJ0ET8D ONLINE 0 0 0 ata-ST31000340NS_9QJ0EZZC ONLINE 0 0 0 errors: No known data errors

ZFS - A second drive failure during a replacement

The most dredded situation in any RAID-5 setup is that a second drive fails during a restoration of the pool.

Let's see what's going to happen.

I have created a new pool with three disks and have transfered all the files from the client to the pool.

On the client:

# ls -gG /pool1/pub/tmp/ total 47803477 -rwxrw-r-- 1 18576345 Apr 21 09:08 1.pdf -rwxrw-r-- 1 30255102 Apr 21 09:08 2.pdf -rwxrw-r-- 1 22016195 Apr 21 09:08 3.pdf -rwxrw-r-- 1 35456180485 Apr 21 07:58 bar.mkv -rwxrw-r-- 1 625338368 Mar 5 2018 boo.iso -rwxrw-r-- 1 1548841922 Apr 15 23:50 foo.mkv -rwxrw-r-- 1 415633408 Mar 5 2018 moo.iso -rwxrw-r-- 1 10867033488 Apr 22 21:10 zoo.mkv

On the ZFS machine:

# zfs list NAME USED AVAIL REFER MOUNTPOINT pool1 45.6G 1.71T 128K /pool1 pool1/pub 45.6G 1.71T 45.6G /pool1/pub

I have then removed one of the drives from the pool to simulate the first break down:

# zpool status pool: pool1 state: DEGRADED status: One or more devices could not be used because the label is missing or invalid. Sufficient replicas exist for the pool to continue functioning in a degraded state. action: Replace the device using 'zpool replace'. see: http://zfsonlinux.org/msg/ZFS-8000-4J scan: none requested config: NAME STATE READ WRITE CKSUM pool1 DEGRADED 0 0 0 raidz1-0 DEGRADED 0 0 0 ata-ST31000340NS_9QJ089LF ONLINE 0 0 0 ata-ST31000340NS_9QJ0DVN2 ONLINE 0 0 0 1368416530025724573 UNAVAIL 0 0 0 was /dev/disk/by-id/ata-ST31000340NS_9QJ0ES1V-part1 errors: No known data errors

I am going to begin a replace procedure and while the resilvering of the new drive is running I am goint to disconnect one of the working drives.

# zpool replace -f pool1 ata-ST31000340NS_9QJ0ES1V ata-ST31000340NS_9QJ0EQ1V

Let's check the status:

# zpool status pool: pool1 state: DEGRADED status: One or more devices is currently being resilvered. The pool will continue to function, possibly in a degraded state. action: Wait for the resilver to complete. scan: resilver in progress since Fri May 3 23:23:37 2019 1.19G scanned out of 68.5G at 101M/s, 0h11m to go 404M resilvered, 1.74% done config: NAME STATE READ WRITE CKSUM pool1 DEGRADED 0 0 0 raidz1-0 DEGRADED 0 0 0 ata-ST31000340NS_9QJ089LF ONLINE 0 0 0 ata-ST31000340NS_9QJ0DVN2 ONLINE 0 0 0 replacing-2 DEGRADED 0 0 0 1368416530025724573 UNAVAIL 0 0 0 was /dev/disk/by-id/ata-ST31000340NS_9QJ0ES1V-part1 ata-ST31000340NS_9QJ0EQ1V ONLINE 0 0 0 (resilvering) errors: No known data errors

The resilvering is running and I am now disconnecting a second drive by removing the power cord for the drive. ZFS had any time to fully resilver the drive.

# zpool status pool: pool1 state: DEGRADED status: One or more devices is currently being resilvered. The pool will continue to function, possibly in a degraded state. action: Wait for the resilver to complete. scan: resilver in progress since Fri May 3 23:23:37 2019 11.3G scanned out of 68.5G at 138M/s, 0h7m to go 2.38G resilvered, 16.53% done config: NAME STATE READ WRITE CKSUM pool1 DEGRADED 0 0 22.2K raidz1-0 DEGRADED 0 0 44.5K ata-ST31000340NS_9QJ089LF DEGRADED 0 0 0 too many errors ata-ST31000340NS_9QJ0DVN2 UNAVAIL 0 0 0 replacing-2 DEGRADED 0 0 0 1368416530025724573 UNAVAIL 0 0 0 was /dev/disk/by-id/ata-ST31000340NS_9QJ0ES1V-part1 ata-ST31000340NS_9QJ0EQ1V ONLINE 0 0 0 (resilvering) errors: 22768 data errors, use '-v' for a list

The resilvering has run its course, but could not finish. ZFS not only informs us about the problem, but it also informs us about the files that are now unrecoverable.

# zpool status -v pool: pool1 state: DEGRADED status: One or more devices has experienced an error resulting in data corruption. Applications may be affected. action: Restore the file in question if possible. Otherwise restore the entire pool from backup. see: http://zfsonlinux.org/msg/ZFS-8000-8A scan: resilvered 2.38G in 0h7m with 235402 errors on Fri May 3 23:30:48 2019 config: NAME STATE READ WRITE CKSUM pool1 DEGRADED 0 0 230K raidz1-0 DEGRADED 0 0 461K ata-ST31000340NS_9QJ089LF DEGRADED 0 0 0 too many errors ata-ST31000340NS_9QJ0DVN2 UNAVAIL 0 0 0 replacing-2 DEGRADED 0 0 0 1368416530025724573 UNAVAIL 0 0 0 was /dev/disk/by-id/ata-ST31000340NS_9QJ0ES1V-part1 ata-ST31000340NS_9QJ0EQ1V ONLINE 0 0 0 errors: Permanent errors have been detected in the following files: /pool1/pub/tmp/bar.mkv /pool1/pub/tmp/foo.mkv /pool1/pub/tmp/moo.iso pool1/pub:<0x24> pool1/pub:<0x25> /pool1/pub/tmp/boo.iso /pool1/pub/tmp/zoo.mkv

In this situation trying to run any kind of repair process would not only be futile, but it would also be wrong. The filesystem itself isn't damaged and it doesn't require any kind of repairing.

The question is: What can we do the get as much data back from the broken pool as possible?

Let's run a scrub and see if by any change we can salvage some files and then restore as much of the pool as possible:

# zpool scrub pool1

Let's check:

# zpool status -v pool: pool1 state: DEGRADED status: One or more devices has experienced an error resulting in data corruption. Applications may be affected. action: Restore the file in question if possible. Otherwise restore the entire pool from backup. see: http://zfsonlinux.org/msg/ZFS-8000-8A scan: scrub repaired 0B in 0h1m with 1277 errors on Fri May 3 23:40:12 2019 config: NAME STATE READ WRITE CKSUM pool1 DEGRADED 0 0 235K raidz1-0 DEGRADED 0 0 479K ata-ST31000340NS_9QJ089LF DEGRADED 0 0 0 too many errors ata-ST31000340NS_9QJ0DVN2 UNAVAIL 0 0 0 replacing-2 DEGRADED 0 0 0 1368416530025724573 UNAVAIL 0 0 0 was /dev/disk/by-id/ata-ST31000340NS_9QJ0ES1V-part1 ata-ST31000340NS_9QJ0EQ1V ONLINE 0 0 0 errors: Permanent errors have been detected in the following files: /pool1/pub/tmp/bar.mkv /pool1/pub/tmp/foo.mkv /pool1/pub/tmp/moo.iso pool1/pub:<0x24> pool1/pub:<0x25> /pool1/pub/tmp/boo.iso /pool1/pub/tmp/zoo.mkv

As expected this was a no go, you cannot do a scrub on a RAID-Z pool with only one original disk and a second one that hasn't been resilvered correctly.

Without extensive debugging of the filesystem, the only thing left is to see if we can copy any of the healthy files from the pool to the client. ZFS has already told us which files are corrupted.

As a first attempt I want to see if I can mount the directory on the client and then grab files one at a time:

rsync -a --progress --stats mnt/testbox/pub/tmp/ tmp3/ sending incremental file list ./ sending incremental file list ./ 1.pdf 18,576,345 100% 109.84MB/s 0:00:00 (xfr#1, to-chk=7/9) 2.pdf 30,255,102 100% 67.89MB/s 0:00:00 (xfr#2, to-chk=6/9) 3.pdf 22,016,195 100% 33.92MB/s 0:00:00 (xfr#3, to-chk=5/9)

The file transfer halted at "3.pdf".

I then tried copying files over picking one at a time but I could not get any file except the three pdf file - as ZFS already told me.

I got the following error on the client:

Cannot read source file. Bad file descriptor.

So these are the files that I managed to salvage from my broken RAID-5 pool:

ls -gG total 69196 -rwxr-xr-x 1 18576345 Apr 21 09:08 1.pdf -rwxr-xr-x 1 30255102 Apr 21 09:08 2.pdf -rwxr-xr-x 1 22016195 Apr 21 09:08 3.pdf

This means that I have reached a point where my RAID-Z pool has been destroyed and the files I am apple to restore is very limited. This isn't a surprise as ZFS is extremely good at spreading the data and parity data very evenly across multiple drives in a RAID-Z. If you lose two drives in a RAID-Z you almost always lose the entire pool.

Had the resilvering process managed to run for a longer time before the second drive "failed", perhaps I would have been able to salvage more files, but there really isn't anything more I can do now.

In my humble opinion a RAID-Z2 (RAID-6) is a minimum for very important files, but RAID-5 is still extremely useful too as long as you remember to always keep backup of your important data no matter what RAID setup you're using. A RAID setup is never a substitute for backup!

Alright, time to do some testing on Btrfs.

Btrfs RAID-5

According to the Btrfs wiki:

The parity RAID feature is mostly implemented, but has some problems in the case of power failure (or other unclean shutdown) which lead to damaged data. It is recommended that parity RAID be used only for testing purposes.

Let's setup a Btrfs RAID-5 system:

# mkfs.btrfs -f -m raid5 -d raid5 /dev/disk/by-id/ata-ST31000340NS_9QJ089LF /dev/disk/by-id/ata-ST31000340NS_9QJ0EQ1V /dev/disk/by-id/ata-ST31000340NS_9QJ0F2YQ btrfs-progs v4.20.2 See http://btrfs.wiki.kernel.org for more information. Label: (null) UUID: 520d615b-4151-4036-962a-ccc202e1f76c Node size: 16384 Sector size: 4096 Filesystem size: 2.73TiB Block group profiles: Data: RAID5 2.00GiB Metadata: RAID5 2.00GiB System: RAID5 16.00MiB SSD detected: no Incompat features: extref, raid56, skinny-metadata Number of devices: 3 Devices: ID SIZE PATH 1 931.51GiB /dev/disk/by-id/ata-ST31000340NS_9QJ089LF 2 931.51GiB /dev/disk/by-id/ata-ST31000340NS_9QJ0EQ1V 3 931.51GiB /dev/disk/by-id/ata-ST31000340NS_9QJ0F2YQ

Then enable lzo compression and mount the pool:

# mount -o noatime,compress=lzo /dev/disk/by-id/ata-ST31000340NS_9QJ089LF /pub/ # btrfs filesystem show -d Label: none uuid: 520d615b-4151-4036-962a-ccc202e1f76c Total devices 3 FS bytes used 128.00KiB devid 1 size 931.51GiB used 2.01GiB path /dev/sdc devid 2 size 931.51GiB used 2.01GiB path /dev/sdb devid 3 size 931.51GiB used 2.01GiB path /dev/sdd # btrfs device stats /pub/ [/dev/sdc].write_io_errs 0 [/dev/sdc].read_io_errs 0 [/dev/sdc].flush_io_errs 0 [/dev/sdc].corruption_errs 0 [/dev/sdc].generation_errs 0 [/dev/sdb].write_io_errs 0 [/dev/sdb].read_io_errs 0 [/dev/sdb].flush_io_errs 0 [/dev/sdb].corruption_errs 0 [/dev/sdb].generation_errs 0 [/dev/sdd].write_io_errs 0 [/dev/sdd].read_io_errs 0 [/dev/sdd].flush_io_errs 0 [/dev/sdd].corruption_errs 0 [/dev/sdd].generation_errs 0

Time to transfer the files from the client using rsync :

$ rsync -a --progress --stats tmp/ mnt/testbox/pub/tmp/ 1.pdf 18,576,345 100% 165.28MB/s 0:00:00 (xfr#1, to-chk=6/8) 2.pdf 30,255,102 100% 84.86MB/s 0:00:00 (xfr#2, to-chk=5/8) 3.pdf 22,016,195 100% 31.81MB/s 0:00:00 (xfr#3, to-chk=4/8) bar.mkv 35,456,180,485 100% 107.72MB/s 0:05:13 (xfr#4, to-chk=3/8) boo.iso 625,338,368 100% 21.36MB/s 0:00:27 (xfr#5, to-chk=2/8) foo.mkv 1,548,841,922 100% 131.10MB/s 0:00:11 (xfr#6, to-chk=1/8) moo.iso 415,633,408 100% 24.38MB/s 0:00:16 (xfr#7, to-chk=0/8) Number of files: 8 (reg: 7, dir: 1) Number of created files: 8 (reg: 7, dir: 1) Number of deleted files: 0 Number of regular files transferred: 7 Total file size: 38,116,841,825 bytes Total transferred file size: 38,116,841,825 bytes Literal data: 38,116,841,825 bytes Matched data: 0 bytes File list size: 0 File list generation time: 0.001 seconds File list transfer time: 0.000 seconds Total bytes sent: 38,126,148,151 Total bytes received: 202 sent 38,126,148,151 bytes received 202 bytes 102,078,041.11 bytes/sec total size is 38,116,841,825 speedup is 1.00

Compared to the ZFS RAID-Z1 transfer:

sent 38,126,148,150 bytes received 202 bytes 106,945,717.68 bytes/sec

On the Btrfs machine I receive a clear warning about some of the missing functionality of the RAID5/6 capability which is also described on the Btrfs wiki status:

The write hole is the last missing part, preliminary patches have been posted but needed to be reworked. The parity not checksummed note has been removed.

# btrfs filesystem usage /pub WARNING: RAID56 detected, not implemented WARNING: RAID56 detected, not implemented WARNING: RAID56 detected, not implemented Overall: Device size: 2.73TiB Device allocated: 0.00B Device unallocated: 2.73TiB Device missing: 0.00B Used: 0.00B Free (estimated): 0.00B (min: 8.00EiB) Data ratio: 0.00 Metadata ratio: 0.00 Global reserve: 40.25MiB (used: 0.00B) Data,RAID5: Size:36.00GiB, Used:35.51GiB /dev/sdb 18.00GiB /dev/sdc 18.00GiB /dev/sdd 18.00GiB Metadata,RAID5: Size:2.00GiB, Used:40.44MiB /dev/sdb 1.00GiB /dev/sdc 1.00GiB /dev/sdd 1.00GiB System,RAID5: Size:16.00MiB, Used:16.00KiB /dev/sdb 8.00MiB /dev/sdc 8.00MiB /dev/sdd 8.00MiB Unallocated: /dev/sdb 912.50GiB /dev/sdc 912.50GiB /dev/sdd 912.50GiB

Btrfs - Power outage

I have then again added the "zoo.mkv" file to the files on the client and will begin the rsync transfer and pull the power cord to the Btrfs machine at about 50% of the transfer.

$ rsync -a --progress --stats tmp/ mnt/testbox/pub/tmp/ sending incremental file list zoo.mkv 5,887,590,400 54% 71.49kB/s 19:20:49 ^C

The power cord has been pulled. I have aborted the file transfer on the client and the Btrfs machine has been powered back up again:

# btrfs filesystem show -d Label: none uuid: 520d615b-4151-4036-962a-ccc202e1f76c Total devices 3 FS bytes used 35.55GiB devid 1 size 931.51GiB used 19.01GiB path /dev/sdc devid 2 size 931.51GiB used 19.01GiB path /dev/sdb devid 3 size 931.51GiB used 19.01GiB path /dev/sdd # btrfs device stats /pub [/dev/sdc].write_io_errs 0 [/dev/sdc].read_io_errs 0 [/dev/sdc].flush_io_errs 0 [/dev/sdc].corruption_errs 0 [/dev/sdc].generation_errs 0 [/dev/sdb].write_io_errs 0 [/dev/sdb].read_io_errs 0 [/dev/sdb].flush_io_errs 0 [/dev/sdb].corruption_errs 0 [/dev/sdb].generation_errs 0 [/dev/sdd].write_io_errs 0 [/dev/sdd].read_io_errs 0 [/dev/sdd].flush_io_errs 0 [/dev/sdd].corruption_errs 0 [/dev/sdd].generation_errs 0

Btrfs is also a transactional filesystem and the pool is back up. There are no errors and everything is mountable from the client. As with the ZFS test we have only lost the file that was being transfered.

Btrfs - Drive failure

Time to simulate a drive failure. I will remove the same drive as with ZFS, then afterwards attach a new drive and try to restore the pool.

# btrfs filesystem show -d warning, device 1 is missing checksum verify failed on 83820544 found C780E0CF wanted 23635D79 bad tree block 83820544, bytenr mismatch, want=83820544, have=65536 Label: none uuid: 520d615b-4151-4036-962a-ccc202e1f76c Total devices 3 FS bytes used 35.55GiB devid 2 size 931.51GiB used 19.01GiB path /dev/sdb devid 3 size 931.51GiB used 19.01GiB path /dev/sdd *** Some devices missin

Btrfs is informing us about the missing disk. Let's locate the new one and replace the old with it:

$ ls -gG /dev/disk/by-id ata-ST31000340NS_9QJ0DVN2 -> ../../sdc

I need to mount the pool in a degraded state with one of the working disks:

# mount -o noatime,compress=lzo,degraded /dev/disk/by-id/ata-ST31000340NS_9QJ0F2YQ /pub/

Then because the "broken" device has been removed I have to use the "devid" parameter format in order to replace the device. This is one place in the Btrfs documentation that could benefit from an example.

The "devid" is the missing device ID from the btrfs filesystem show -d command, not from the "by-id" or "uuid". Also since the new disk already contains a filesystem from the previous test I need to use the -f option to force the command:

So the command basically is: btrfs replace start old_device new_device mount_point where old_device is the "devid" number Btrfs has supplied us with:

# btrfs replace start -f 1 /dev/disk/by-id/ata-ST31000340NS_9QJ0DVN2 /pub

We can then check the status of the replacement:

# btrfs replace status -1 /pub 0.4% done, 0 write errs, 0 uncorr. read errs # iostat -dh /dev/disk/by-id/ata-ST31000340NS_9QJ0DVN2 Linux 5.0.9-arch1-1-ARCH (testbox) 04/25/2019 _x86_64_ (2 CPU) tps kB_read/s kB_wrtn/s kB_dscd/s kB_read kB_wrtn kB_dscd Device 148.08 5.1k 11.9M 0.0k 5.1M 11.7G 0.0k sdc

ZFS completed the restoration in about 3 minutes while Btrfs was a little more than twice as long about it:

# btrfs replace status -1 /pub Started on 25.Apr 01:39:20, finished on 25.Apr 01:46:59, 0 write errs, 0 uncorr. read errs

The pool is back up again with no missing files or any other problems:

# btrfs filesystem show -d Label: none uuid: 520d615b-4151-4036-962a-ccc202e1f76c Total devices 3 FS bytes used 35.55GiB devid 1 size 931.51GiB used 19.00GiB path /dev/sdc devid 2 size 931.51GiB used 20.03GiB path /dev/sdb devid 3 size 931.51GiB used 20.03GiB path /dev/sdd # btrfs filesystem usage /pub WARNING: RAID56 detected, not implemented WARNING: RAID56 detected, not implemented WARNING: RAID56 detected, not implemented Overall: Device size: 2.73TiB Device allocated: 0.00B Device unallocated: 2.73TiB Device missing: 0.00B Used: 0.00B Free (estimated): 0.00B (min: 8.00EiB) Data ratio: 0.00 Metadata ratio: 0.00 Global reserve: 40.25MiB (used: 0.00B) Data,RAID5: Size:36.00GiB, Used:35.51GiB /dev/sdb 18.00GiB /dev/sdc 18.00GiB /dev/sdd 18.00GiB Metadata,RAID5: Size:3.00GiB, Used:40.44MiB /dev/sdb 2.00GiB /dev/sdc 1.00GiB /dev/sdd 2.00GiB System,RAID5: Size:32.00MiB, Used:16.00KiB /dev/sdb 32.00MiB /dev/sdd 32.00MiB Unallocated: /dev/sdb 911.48GiB /dev/sdc 912.51GiB /dev/sdd 911.48GiB # btrfs device stats /pub [/dev/sdc].write_io_errs 0 [/dev/sdc].read_io_errs 0 [/dev/sdc].flush_io_errs 0 [/dev/sdc].corruption_errs 0 [/dev/sdc].generation_errs 0 [/dev/sdb].write_io_errs 0 [/dev/sdb].read_io_errs 0 [/dev/sdb].flush_io_errs 0 [/dev/sdb].corruption_errs 0 [/dev/sdb].generation_errs 0 [/dev/sdd].write_io_errs 0 [/dev/sdd].read_io_errs 0 [/dev/sdd].flush_io_errs 0 [/dev/sdd].corruption_errs 0 [/dev/sdd].generation_errs 0

In the above I have noticed that the files are not as equally spread across the devices as before the simulated failure.

Before:

devid 1 size 931.51GiB used 19.01GiB path /dev/sdc devid 2 size 931.51GiB used 19.01GiB path /dev/sdb devid 3 size 931.51GiB used 19.01GiB path /dev/sdd

After:

devid 1 size 931.51GiB used 19.00GiB path /dev/sdc devid 2 size 931.51GiB used 20.03GiB path /dev/sdb devid 3 size 931.51GiB used 20.03GiB path /dev/sdd

But the usage command shows that this is only due to metadata:

# btrfs filesystem usage /pub ... Metadata,RAID5: Size:3.00GiB, Used:51.58MiB /dev/sdb 2.00GiB /dev/sdc 1.00GiB /dev/sdd 2.00GiB

Let's perform a scrub now and validate that everything is alright:

# btrfs scrub start /pub/ # btrfs scrub status -d /pub/ scrub status for 520d615b-4151-4036-962a-ccc202e1f76c scrub device /dev/sdc (id 1) history scrub started at Thu Apr 25 04:04:57 2019 and finished after 00:28:11 total bytes scrubbed: 15.23GiB with 0 errors scrub device /dev/sdb (id 2) history scrub started at Thu Apr 25 04:04:57 2019 and finished after 00:27:31 total bytes scrubbed: 15.23GiB with 0 errors scrub device /dev/sdd (id 3) history scrub started at Thu Apr 25 04:04:57 2019 and finished after 00:27:32 total bytes scrubbed: 15.23GiB with 0 errors

So far no problems.

Btrfs - Drive failure during file transfer

Now it's time to remove a drive during an active file transfer:

$ rsync -a --progress --stats tmp/ mnt/testbox/pub/tmp/ sending incremental file list zoo.mkv 10,867,033,488 100% 119.28MB/s 0:01:26 (xfr#3, to-chk=0/9) Number of files: 9 (reg: 8, dir: 1) Number of created files: 1 (reg: 1) Number of deleted files: 0 Number of regular files transferred: 3 Total file size: 48,983,875,313 bytes Total transferred file size: 10,919,304,785 bytes Literal data: 10,919,304,785 bytes Matched data: 0 bytes File list size: 0 File list generation time: 0.001 seconds File list transfer time: 0.000 seconds Total bytes sent: 10,921,970,963 Total bytes received: 76 sent 10,921,970,963 bytes received 76 bytes 96,228,819.73 bytes/sec total size is 48,983,875,313 speedup is 4.48

Btrfs reacted exactly the same way ZFS did. It momentarily halted the file transfer for about a second, then resumed the transfer without the client being able to notice anything other than the momentary drop in the file transfer speed.

On the Btrfs machine the pool has changed the state to a missing device:

# btrfs filesystem show -d /pub Label: none uuid: e4f04b17-c62b-4847-beeb-753bbb64c79a Total devices 3 FS bytes used 36.99GiB devid 1 size 931.51GiB used 21.01GiB path /dev/sdc devid 2 size 931.51GiB used 21.01GiB path /dev/sdd *** Some devices missing # btrfs filesystem usage /pub WARNING: RAID56 detected, not implemented WARNING: RAID56 detected, not implemented WARNING: RAID56 detected, not implemented Overall: Device size: 2.73TiB Device allocated: 0.00B Device unallocated: 2.73TiB Device missing: 931.51GiB Used: 0.00B Free (estimated): 0.00B (min: 8.00EiB) Data ratio: 0.00 Metadata ratio: 0.00 Global reserve: 51.75MiB (used: 0.00B) Data,RAID5: Size:48.00GiB, Used:45.63GiB /dev/sdb 24.00GiB /dev/sdc 24.00GiB /dev/sdd 24.00GiB Metadata,RAID5: Size:2.00GiB, Used:51.92MiB /dev/sdb 1.00GiB /dev/sdc 1.00GiB /dev/sdd 1.00GiB System,RAID5: Size:16.00MiB, Used:16.00KiB /dev/sdb 8.00MiB /dev/sdc 8.00MiB /dev/sdd 8.00MiB Unallocated: /dev/sdb 906.50GiB /dev/sdc 906.50GiB /dev/sdd 906.50GiB

As with ZFS I powered down the machine in order to safely reattach the drive and then rebooted.

# btrfs filesystem show -d Label: none uuid: e4f04b17-c62b-4847-beeb-753bbb64c79a Total devices 3 FS bytes used 35.38GiB devid 1 size 931.51GiB used 25.01GiB path /dev/sdc devid 2 size 931.51GiB used 25.01GiB path /dev/sdd devid 3 size 931.51GiB used 19.01GiB path /dev/sdb

The show command reveals that the pool is not in balance. In order to get more information I need to mount the pool and then use the device stats command.

The device stats keep a persistent record of several error classes related to doing IO. The current values are printed at mount time and updated during filesystem lifetime or from a scrub:

# btrfs device stats /pub/ [/dev/sdc].write_io_errs 0 [/dev/sdc].read_io_errs 0 [/dev/sdc].flush_io_errs 0 [/dev/sdc].corruption_errs 0 [/dev/sdc].generation_errs 0 [/dev/sdd].write_io_errs 0 [/dev/sdd].read_io_errs 0 [/dev/sdd].flush_io_errs 0 [/dev/sdd].corruption_errs 0 [/dev/sdd].generation_errs 0 [/dev/sdb].write_io_errs 16 [/dev/sdb].read_io_errs 1 [/dev/sdb].flush_io_errs 5 [/dev/sdb].corruption_errs 0 [/dev/sdb].generation_errs 0

The status report clearly shows write errors.

In the current situation the correct approach is to do a scrub:

# btrfs scrub start /pub/ scrub started on /pub/, fsid e4f04b17-c62b-4847-beeb-753bbb64c79a (pid=583)

As with ZFS, Btrfs is now running through the data and the checksums and is trying to repair the data.

After the scrubbing is done Btrfs tells us that it has repaired quite a lot of data, all with 0 uncorrectable errors.

# btrfs scrub status -d /pub/ scrub status for e4f04b17-c62b-4847-beeb-753bbb64c79a scrub device /dev/sdc (id 1) history scrub started at Fri Apr 26 03:18:57 2019 and finished after 00:49:34 total bytes scrubbed: 15.23GiB with 25452 errors error details: csum=25452 corrected errors: 25452, uncorrectable errors: 0, unverified errors: 0 scrub device /dev/sdd (id 2) history scrub started at Fri Apr 26 03:18:57 2019 and finished after 00:47:19 total bytes scrubbed: 15.23GiB with 27768 errors error details: csum=27768 corrected errors: 27768, uncorrectable errors: 0, unverified errors: 0 scrub device /dev/sdb (id 3) history scrub started at Fri Apr 26 03:18:57 2019 and finished after 00:49:34 total bytes scrubbed: 15.23GiB with 2316 errors error details: csum=2316 corrected errors: 2316, uncorrectable errors: 0, unverified errors: 0

Btrfs has managed to repair everything without any errors but it still keeps the logs of the errors.

What is noticeable is that ZFS finished the scrubbing and repair in just about 6 minutes while Btrfs took about 50 minutes.

This is because Btrfs has also brought the pool into balance during the scrubbing and Btrfs is famous for being very slow at re-balancing drives:

# btrfs filesystem show -d Label: none uuid: e4f04b17-c62b-4847-beeb-753bbb64c79a Total devices 3 FS bytes used 45.68GiB devid 1 size 931.51GiB used 25.56GiB path /dev/sdc devid 2 size 931.51GiB used 25.56GiB path /dev/sdd devid 3 size 931.51GiB used 25.56GiB path /dev/sdb

Again it is up to the system administrator to decide what he or she wants to do. And this is important because even though Btrfs has managed to restore the pool we might still be dealing with am unhealthy device.

Can we clear the log? Or do we need to replace the drive anyway? Perhaps S.M.A.R.T has warned us that the drive is currently working, but it is experiencing occasional issues and soon needs to be fully replaced.

# btrfs device stats -c /pub/ [/dev/sdc].write_io_errs 0 [/dev/sdc].read_io_errs 0 [/dev/sdc].flush_io_errs 0 [/dev/sdc].corruption_errs 0 [/dev/sdc].generation_errs 0 [/dev/sdd].write_io_errs 0 [/dev/sdd].read_io_errs 0 [/dev/sdd].flush_io_errs 0 [/dev/sdd].corruption_errs 0 [/dev/sdd].generation_errs 0 [/dev/sdb].write_io_errs 16 [/dev/sdb].read_io_errs 1 [/dev/sdb].flush_io_errs 5 [/dev/sdb].corruption_errs 55536 [/dev/sdb].generation_errs 0

The result of the scrubbing showed zero uncorrectable errors and I know the drive is working fine so I'll just clear the log with the -z option:

# btrfs device stats -z /pub/ [/dev/sdc].write_io_errs 0 [/dev/sdc].read_io_errs 0 [/dev/sdc].flush_io_errs 0 [/dev/sdc].corruption_errs 0 [/dev/sdc].generation_errs 0 [/dev/sdd].write_io_errs 0 [/dev/sdd].read_io_errs 0 [/dev/sdd].flush_io_errs 0 [/dev/sdd].corruption_errs 0 [/dev/sdd].generation_errs 0 [/dev/sdb].write_io_errs 0 [/dev/sdb].read_io_errs 0 [/dev/sdb].flush_io_errs 0 [/dev/sdb].corruption_errs 0 [/dev/sdb].generation_errs 0

Btrfs - Data corruption during file transfer

Now I want to simulate the same disk corruption in the middle of a file transfer from the client as I did with ZFS.

I have removed the "zoo.mkv" file and while rsync is running I will use dd a couple of times on the Btrfs machine on one of the drives:

# dd if=/dev/urandom of=/dev/disk/by-id/ata-ST31000340NS_9QJ0ET8D seek=100000 count=1000 bs=1k

The device stats command did not show any problems:

# btrfs device stats -c /pub [/dev/sdc].write_io_errs 0 [/dev/sdc].read_io_errs 0 [/dev/sdc].flush_io_errs 0 [/dev/sdc].corruption_errs 0 [/dev/sdc].generation_errs 0 [/dev/sdd].write_io_errs 0 [/dev/sdd].read_io_errs 0 [/dev/sdd].flush_io_errs 0 [/dev/sdd].corruption_errs 0 [/dev/sdd].generation_errs 0 [/dev/sdb].write_io_errs 0 [/dev/sdb].read_io_errs 0 [/dev/sdb].flush_io_errs 0 [/dev/sdb].corruption_errs 0 [/dev/sdb].generation_errs 0

However, both dmesg and the log reveals something:

[ 1932.091249] BTRFS error (device sdc): csum mismatch on free space cache [ 1932.091262] BTRFS warning (device sdc): failed to load free space cache for block group 42988470272, rebuilding it now [ 1932.334063] BTRFS error (device sdc): csum mismatch on free space cache [ 1932.334076] BTRFS warning (device sdc): failed to load free space cache for block group 47283437568, rebuilding it now [ 2005.178214] BTRFS error (device sdc): space cache generation (17) does not match inode (19) [ 2005.178222] BTRFS warning (device sdc): failed to load free space cache for block group 38693502976, rebuilding it now

Btrfs did detect the problem and automatically fixed it, but I had expected this kind of error to show up in the device stat result perhaps as a corruption error count.

Btrfs - The dd mistake

Time to to see what's going to happen if I by mistake type the dd command on one of the drives during a file transfer from the client.

As with the ZFS test I have deleted all the files, restarted rsync :

# dd if=/dev/urandom of=/dev/sdb bs=1k ^C232089+0 records in 232089+0 records out 237659136 bytes (238 MB, 227 MiB) copied, 27.3843 s, 8.7 MB/s

Again the device stats command didn't show any problems:

# btrfs device stats -c /pub/ [/dev/sdc].write_io_errs 0 [/dev/sdc].read_io_errs 0 [/dev/sdc].flush_io_errs 0 [/dev/sdc].corruption_errs 0 [/dev/sdc].generation_errs 0 [/dev/sdb].write_io_errs 0 [/dev/sdb].read_io_errs 0 [/dev/sdb].flush_io_errs 0 [/dev/sdb].corruption_errs 0 [/dev/sdb].generation_errs 0 [/dev/sdd].write_io_errs 0 [/dev/sdd].read_io_errs 0 [/dev/sdd].flush_io_errs 0 [/dev/sdd].corruption_errs 0 [/dev/sdd].generation_errs 0

But dmesg did after about a minut:

[ 867.808813] BTRFS error (device sdc): bad tree block start, want 53133312 have 17920600362259148199 [ 867.848391] BTRFS info (device sdc): read error corrected: ino 0 off 53133312 (dev /dev/sdb sector 32480) [ 867.886255] BTRFS info (device sdc): read error corrected: ino 0 off 53137408 (dev /dev/sdb sector 32488) [ 867.893746] BTRFS info (device sdc): read error corrected: ino 0 off 53141504 (dev /dev/sdb sector 32496) [ 867.903079] BTRFS info (device sdc): read error corrected: ino 0 off 53145600 (dev /dev/sdb sector 32504) [ 867.928986] BTRFS error (device sdc): bad tree block start, want 53100544 have 125614526405871379 [ 867.946912] BTRFS info (device sdc): read error corrected: ino 0 off 53100544 (dev /dev/sdb sector 32416) [ 867.948135] BTRFS info (device sdc): read error corrected: ino 0 off 53104640 (dev /dev/sdb sector 32424) [ 867.948793] BTRFS info (device sdc): read error corrected: ino 0 off 53108736 (dev /dev/sdb sector 32432) [ 867.952210] BTRFS info (device sdc): read error corrected: ino 0 off 53112832 (dev /dev/sdb sector 32440) [ 868.128686] BTRFS error (device sdc): bad tree block start, want 43614208 have 15420301482281005013 [ 868.130861] BTRFS error (device sdc): bad tree block start, want 43614208 have 15420301482281005013 [ 868.196118] BTRFS error (device sdc): bad tree block start, want 43614208 have 15420301482281005013 [ 868.296277] BTRFS error (device sdc): bad tree block start, want 43614208 have 15420301482281005013 [ 868.333942] BTRFS info (device sdc): read error corrected: ino 0 off 43614208 (dev /dev/sdb sector 23104) [ 868.337820] BTRFS info (device sdc): read error corrected: ino 0 off 43618304 (dev /dev/sdb sector 23112) [ 868.353572] BTRFS error (device sdc): bad tree block start, want 43630592 have 10676903441545527670 [ 868.378400] BTRFS error (device sdc): bad tree block start, want 43597824 have 485580186567037103 [ 868.531339] BTRFS error (device sdc): bad tree block start, want 46039040 have 1852668134064264900 [ 868.569488] BTRFS error (device sdc): bad tree block start, want 46055424 have 418370625237599952

On the client, as with ZFS, there is nothing noticeable going on during file transfer.

Time to run a scrub in order to correct the errors:

# btrfs scrub start /pub/ scrub started on /pub/, fsid 045b8eb9-267a-479b-92af-a996d9a27d12 (pid=468) # btrfs scrub status -d /pub/ scrub status for 045b8eb9-267a-479b-92af-a996d9a27d12 scrub device /dev/disk/by-id/ata-ST31000340NS_9QJ089LF (id 1) status scrub started at Tue Apr 30 22:58:47 2019, running for 00:01:00 total bytes scrubbed: 582.61MiB with 9 errors error details: csum=9 corrected errors: 9, uncorrectable errors: 0, unverified errors: 0 scrub device /dev/sdb (id 2) status scrub started at Tue Apr 30 22:58:47 2019, running for 00:01:00 total bytes scrubbed: 543.27MiB with 639 errors error details: csum=639 corrected errors: 639, uncorrectable errors: 0, unverified errors: 0 scrub device /dev/sdd (id 3) status scrub started at Tue Apr 30 22:58:47 2019, running for 00:01:00 total bytes scrubbed: 480.40MiB with 2 errors error details: csum=2 corrected errors: 2, uncorrectable errors: 0, unverified errors: 0 WARNING: errors detected during scrubbing, corrected

Btrfs has detected the errors and fixed them:

scrub status for 045b8eb9-267a-479b-92af-a996d9a27d12 scrub device /dev/disk/by-id/ata-ST31000340NS_9QJ089LF (id 1) history scrub started at Tue Apr 30 22:58:47 2019 and finished after 00:25:08 total bytes scrubbed: 15.23GiB with 9 errors error details: csum=9 corrected errors: 9, uncorrectable errors: 0, unverified errors: 0 scrub device /dev/sdb (id 2) history scrub started at Tue Apr 30 22:58:47 2019 and finished after 00:25:06 total bytes scrubbed: 15.23GiB with 639 errors error details: csum=639 corrected errors: 639, uncorrectable errors: 0, unverified errors: 0 scrub device /dev/sdd (id 3) history scrub started at Tue Apr 30 22:58:47 2019 and finished after 00:25:16 total bytes scrubbed: 15.23GiB with 2 errors error details: csum=2 corrected errors: 2, uncorrectable errors: 0, unverified errors: 0 # btrfs device stats -c /pub/ [/dev/sdc].write_io_errs 0 [/dev/sdc].read_io_errs 0 [/dev/sdc].flush_io_errs 0 [/dev/sdc].corruption_errs 0 [/dev/sdc].generation_errs 0 [/dev/sdb].write_io_errs 0 [/dev/sdb].read_io_errs 0 [/dev/sdb].flush_io_errs 0 [/dev/sdb].corruption_errs 650 [/dev/sdb].generation_errs 0 [/dev/sdd].write_io_errs 0 [/dev/sdd].read_io_errs 0 [/dev/sdd].flush_io_errs 0 [/dev/sdd].corruption_errs 0 [/dev/sdd].generation_errs 0

Btrfs handled the problem just as well as ZFS. The only difference was the time it took to do the scrub.

Btrfs - The "write hole" issue

Since Btrfs still has warnings about the write hole issue I would like to see if it's possible to recreate the problem in this test.

Parity may be inconsistent after a crash (the "write hole"). The problem born when after "an unclean shutdown" a disk failure happens. But these are two distinct failures. These together break the BTRFS raid5 redundancy. If you run a scrub process after "an unclean shutdown" (with no disk failure in between) those data which match their checksum can still be read out while the mismatched data are lost forever.

These two issues has to exist at the same time:

An unclean shutdown.

A disk failure.

So pulling the power cord to the machine during a file transfer and then simulating a disk failure by removing one of the drives should potentially re-create the issue.

I have removed the "zoo.mkv" file from the files on the Btrfs machine and will pull the power cord during the file transfer of the file and will then remove a drive and see what's going to happen.

$ rsync -a --progress --stats tmp/ mnt/testbox/pub/tmp/ sending incremental file list zoo.mkv 7,176,814,592 66% 58.80kB/s 17:25:53 ^C

The Btrfs machine has now suffered an unclean shutdown. I have aborted the file transfer on the client and unmounted the Btrfs export. I have then physically changed one of the drives in the Btrfs machine and will now try to do a replacement.

# btrfs filesystem show -d warning, device 2 is missing checksum verify failed on 117506048 found E6CE304B wanted 022D8DFD bad tree block 117506048, bytenr mismatch, want=117506048, have=65536 Couldn't setup extent tree checksum verify failed on 117538816 found 151B2790 wanted F1F89A26 bad tree block 117538816, bytenr mismatch, want=117538816, have=65536 Couldn't setup device tree Label: none uuid: 045b8eb9-267a-479b-92af-a996d9a27d12 Total devices 3 FS bytes used 38.61GiB devid 1 size 931.51GiB used 21.01GiB path /dev/sdc devid 3 size 931.51GiB used 21.01GiB path /dev/sdd *** Some devices missin

In the previous test where I simulated a drive failure I got the same error messages except that this time Btrfs is complaining about "couldn't setup device tree".

I will now mount the pool in a degraded state and replace the faulty drive and see if we can't salvage any data from the pool. The mounting has to be performed with a healty drive:

# mount -o noatime,compress=lzo,degraded /dev/disk/by-id/ata-ST31000340NS_9QJ0 /pub

This time it is "devid" 2 I need to replace. The new disk is the "9QJ0ET8D" one:

# btrfs replace start -f 2 /dev/disk/by-id/ata-ST31000340NS_9QJ0ET8D /pub/

Let's check the status of the replacement:

# btrfs replace status -1 /pub/ 0.4% done, 0 write errs, 0 uncorr. read errs

Then after a little while:

# btrfs replace status -1 /pub/ Started on 30.Apr 23:58:34, finished on 1.May 00:06:39, 0 write errs, 0 uncorr. read errs # btrfs filesystem show -d Label: none uuid: 045b8eb9-267a-479b-92af-a996d9a27d12 Total devices 3 FS bytes used 38.61GiB devid 1 size 931.51GiB used 22.04GiB path /dev/sdc devid 2 size 931.51GiB used 21.00GiB path /dev/sdb devid 3 size 931.51GiB used 22.04GiB path /dev/sdd # btrfs device stats -c /pub [/dev/sdc].write_io_errs 0 [/dev/sdc].read_io_errs 0 [/dev/sdc].flush_io_errs 0 [/dev/sdc].corruption_errs 0 [/dev/sdc].generation_errs 0 [/dev/sdb].write_io_errs 0 [/dev/sdb].read_io_errs 0 [/dev/sdb].flush_io_errs 0 [/dev/sdb].corruption_errs 0 [/dev/sdb].generation_errs 0 [/dev/sdd].write_io_errs 0 [/dev/sdd].read_io_errs 0 [/dev/sdd].flush_io_errs 0 [/dev/sdd].corruption_errs 0 [/dev/sdd].generation_errs 0 # ls -gG /pub/tmp/ total 37223496 -rwxrw-r-- 1 18576345 Apr 21 09:08 1.pdf -rwxrw-r-- 1 30255102 Apr 21 09:08 2.pdf -rwxrw-r-- 1 22016195 Apr 21 09:08 3.pdf -rwxrw-r-- 1 35456180485 Apr 21 07:58 bar.mkv -rwxrw-r-- 1 625338368 Mar 5 2018 boo.iso -rwxrw-r-- 1 1548841922 Apr 15 23:50 foo.mkv -rwxrw-r-- 1 415633408 Mar 5 2018 moo.iso

Everything has been restored nicely and all three drives are performing well. I didn't lose any files or suffered any parity issues that made the replacement a problem.

I have repeated the above test with the same result more than once.

Btrfs - A second drive failure during a replacement

Now I want to see what's going to happen with Btrfs when I lose a second drive during a replacement procedure.

I have removed one of the drives and I mounting the Btrfs pool in a degraded state in order to begin a replacement:

# btrfs filesystem show -d Label: none uuid: 045b8eb9-267a-479b-92af-a996d9a27d12 Total devices 3 FS bytes used 38.61GiB devid 1 size 931.51GiB used 22.03GiB path /dev/sdc devid 3 size 931.51GiB used 22.03GiB path /dev/sdd *** Some devices missing # mount -o noatime,compress=lzo,degraded /dev/disk/by-id/ata-ST31000340NS_9QJ089LF /pub

As I did with ZFS, while the replacement procedure is running I will disconnect one of the working drives.

# btrfs replace start -f 2 /dev/disk/by-id/ata-ST31000340NS_9QJ0ET8D /pub/

Let's check the status:

# btrfs replace status -1 /pub/ 0.1% done, 0 write errs, 0 uncorr. read errs

I now disconnect a second drive by removing the power cord for the drive:

# btrfs replace status -1 /pub/ Started on 3.May 21:03:21, canceled on 3.May 21:04:12 at 0.0%, 0 write errs, 0 uncorr. read errs

Btrfs cancelled the replacement when the second drive went offline.

# ls -gG /pub/tmp/ ls: cannot access '/pub/tmp/boo.iso': Input/output error ls: cannot access '/pub/tmp/foo.mkv': Input/output error ls: cannot access '/pub/tmp/moo.iso': Input/output error total 34694376 -rwxrw-r-- 1 18576345 Apr 21 09:08 1.pdf -rwxrw-r-- 1 30255102 Apr 21 09:08 2.pdf -rwxrw-r-- 1 22016195 Apr 21 09:08 3.pdf -rwxrw-r-- 1 35456180485 Apr 21 07:58 bar.mkv -????????? ? ? ? boo.iso -????????? ? ? ? foo.mkv -????????? ? ? ? moo.iso

We clearly have a problem.

I have attached a new drive and the Btrfs machine now only has one healthy drive in the pool and two new drives of which one has only been partly replaced.

# umount /pub # btrfs filesystem show -d warning, device 3 is missing Label: none uuid: 045b8eb9-267a-479b-92af-a996d9a27d12 Total devices 3 FS bytes used 38.61GiB devid 1 size 931.51GiB used 22.03GiB path /dev/sdc *** Some devices missing # mount -o noatime,compress=lzo,degraded /dev/disk/by-id/ata-ST31000340NS_9QJ089LF /pub/ # btrfs filesystem show -d warning, device 3 is missing checksum verify failed on 119832576 found B67B4ABD wanted A302A7B3 checksum verify failed on 119832576 found B67B4ABD wanted A302A7B3 bad tree block 119832576, bytenr mismatch, want=119832576, have=5117397648563945276 Label: none uuid: 045b8eb9-267a-479b-92af-a996d9a27d12 Total devices 3 FS bytes used 38.61GiB devid 1 size 931.51GiB used 22.03GiB path /dev/sdc devid 2 size 931.51GiB used 21.01GiB path /dev/sdd *** Some devices missing

The drive that went through the partly replacement is at least recognized as belonging to the pool.

Now, in this situation trying to run any kind of repair process would not only be futile, but it would also be very wrong. The filesystem isn't damaged and it doesn't require any kind of repairing.

Again I will try to replace the third disk and see if I maybe have enough data and metadata lying around to actually restore the pool without loosing any data (as with ZFS this is very a long shot):

Let's locate the new disk:

# ls -l /dev/disk/by-id/ ata-ST31000340NS_9QJ089LF -> ../../sdc ata-ST31000340NS_9QJ0DVN2 -> ../../sdd ata-ST31000340NS_9QJ0ES1V -> ../../sdb

"devid" 3 needs to be replaced with the "9QJ0ES1V" one:

# btrfs replace start -f 3 /dev/disk/by-id/ata-ST31000340NS_9QJ0ES1V /pub/

No errors. Let's check the status:

# btrfs replace status -1 /pub/ Started on 3.May 21:03:21, suspended on 1.May 00:06:39 at 0.2%, 0 write errs, 0 uncorr. read errs

Suspended!

Let's see what dmesg says:

[ 509.084144] BTRFS info (device sdc): use lzo compression, level 0 [ 509.084147] BTRFS info (device sdc): allowing degraded mounts [ 509.084148] BTRFS info (device sdc): disk space caching is enabled [ 509.084150] BTRFS info (device sdc): has skinny extents [ 509.107081] BTRFS warning (device sdc): devid 3 uuid 9078bc78-a5ba-4178-96ca-53fb2e29b62c is missing [ 509.167206] BTRFS info (device sdc): cannot continue dev_replace, tgtdev is missing [ 509.167208] BTRFS info (device sdc): you may cancel the operation after 'mount -o degraded'

So a replacement is not possible.

On ZFS we get much better information using the zpool status -v about both the replacement status and about the specific files cannot be restored.

Let's run a scrub and see if by any change we can salvage some files and then restore as much of the pool as possible:

# btrfs scrub start /pub/ scrub started on /pub/, fsid 045b8eb9-267a-479b-92af-a996d9a27d12 (pid=497)

Let's check:

# btrfs scrub status -d /pub/ scrub status for 045b8eb9-267a-479b-92af-a996d9a27d12 scrub device /dev/sdc (id 1) history scrub started at Fri May 3 21:18:18 2019 and was aborted after 00:00:00 total bytes scrubbed: 0.00B with 0 errors scrub device /dev/sdd (id 2) history scrub started at Fri May 3 21:18:18 2019 and was aborted after 00:00:00 total bytes scrubbed: 0.00B with 0 errors scrub device /dev/sdd (id 3) history scrub started at Fri May 3 21:18:18 2019 and was aborted after 00:00:00 total bytes scrubbed: 0.00B with 0 errors

Aborted.

This was a no go, we cannot do a scrub on a RAID-5 pool with only one original disk and a second one that hasn't been replaced correctly.

# ls -gG /pub/tmp/ total 37223496 -rwxrw-r-- 1 18576345 Apr 21 09:08 1.pdf -rwxrw-r-- 1 30255102 Apr 21 09:08 2.pdf -rwxrw-r-- 1 22016195 Apr 21 09:08 3.pdf -rwxrw-r-- 1 35456180485 Apr 21 07:58 bar.mkv -rwxrw-r-- 1 625338368 Mar 5 2018 boo.iso -rwxrw-r-- 1 1548841922 Apr 15 23:50 foo.mkv -rwxrw-r-- 1 415633408 Mar 5 2018 moo.iso

The only thing left is to see how many files I can salvage:

rsync -a --progress --stats mnt/testbox/pub/tmp/ tmp3/ sending incremental file list ./ 1.pdf 18,576,345 100% 111.22MB/s 0:00:00 (xfr#1, to-chk=6/8) 2.pdf 30,255,102 100% 68.05MB/s 0:00:00 (xfr#2, to-chk=5/8) 3.pdf 22,016,195 100% 33.97MB/s 0:00:00 (xfr#3, to-chk=4/8) bar.mkv 41,451,520 0% 39.18MB/s 0:14:42

Then it halted.

I then tried copying files over picking one at a time and to my big supprise I actually managed to get all the files except the "bar.mkv" file!

ls -gG total 2598328 -rwxr-xr-x 1 18576345 Apr 21 09:08 1.pdf -rwxr-xr-x 1 30255102 Apr 21 09:08 2.pdf -rwxr-xr-x 1 22016195 Apr 21 09:08 3.pdf -rwxr-xr-x 1 625338368 Mar 5 2018 boo.iso -rwxr-xr-x 1 1548841922 Apr 15 23:50 foo.mkv -rwxr-xr-x 1 415633408 Mar 5 2018 moo.iso

During the attempt to transfer the "bar.mkv" file the following errors showed up on the Btrfs machine:

# dmesg [ 4177.376785] BTRFS error (device sdc): bad tree block start, want 38944768 have 7071809559058736496 [ 4177.378494] BTRFS error (device sdc): bad tree block start, want 38961152 have 16350034114213725736 [ 4177.378718] BTRFS error (device sdc): bad tree block start, want 38977536 have 8392528330119265768 [ 4177.379183] BTRFS error (device sdc): bad tree block start, want 38928384 have 6084014255993522895 [ 4181.808743] BTRFS critical (device sdc): corrupt node: root=7 block=39124992 slot=0, unaligned pointer, have 12335186693368 should be aligned to 4096 [ 4181.808757] BTRFS info (device sdc): no csum found for inode 261 start 52690944 [ 4181.808856] BTRFS critical (device sdc): corrupt node: root=7 block=39124992 slot=0, unaligned pointer, have 12335186693368 should be aligned to 4096 [ 4181.808866] BTRFS info (device sdc): no csum found for inode 261 start 52695040 [ 4181.808955] BTRFS critical (device sdc): corrupt node: root=7 block=39124992 slot=0, unaligned pointer, have 12335186693368 should be aligned to 4096 [ 4181.808965] BTRFS info (device sdc): no csum found for inode 261 start 52699136 [ 4181.809051] BTRFS critical (device sdc): corrupt node: root=7 block=39124992 slot=0, unaligned pointer, have 12335186693368 should be aligned to 4096 ...

Btrfs has the "btrfs restore" command which is used to try to salvage files from a damaged filesystem and restore them somewhere else. The man page explaines:

btrfs restore could be used to retrieve file data, as far as the metadata are readable. The checks done by restore are less strict and the process is usually able to get far enough to retrieve data from the whole filesystem. This comes at a cost that some data might be incomplete or from older versions if they’re available. There are several options to attempt restoration of various file metadata type. You can try a dry run first to see how well the process goes and use further options to extend the set of restored metadata.

I have 129G available on the boot disk so I can try to restore files to that drive.

I'm going to use "sdc" first, which is the heatly and original working drive. Then followed by "sdd" which is the disk that was partly replaced. The last disk "sdb" is useless.

# mkdir /restored-files # umount /pub # btrfs restore -D /dev/sdc /restored-files/ warning, device 3 is missing checksum verify failed on 115867648 found E486C552 wanted 006578E4 bad tree block 115867648, bytenr mismatch, want=115867648, have=65536 Could not open root, trying backup super warning, device 3 is missing checksum verify failed on 38895616 found 69CF6F65 wanted 8D2CD2D3 checksum verify failed on 38895616 found 69CF6F65 wanted 8D2CD2D3 bad tree block 38895616, bytenr mismatch, want=38895616, have=65536 checksum verify failed on 115867648 found E486C552 wanted 006578E4 bad tree block 115867648, bytenr mismatch, want=115867648, have=65536 Could not open root, trying backup super warning, device 3 is missing checksum verify failed on 38895616 found 69CF6F65 wanted 8D2CD2D3 checksum verify failed on 38895616 found 69CF6F65 wanted 8D2CD2D3 bad tree block 38895616, bytenr mismatch, want=38895616, have=65536 checksum verify failed on 115867648 found E486C552 wanted 006578E4 bad tree block 115867648, bytenr mismatch, want=115867648, have=65536 Could not open root, trying backup super # btrfs restore -D /dev/sdd /restored-files/ warning, device 3 is missing checksum verify failed on 22020096 found 7DCD7CC1 wanted 28699DE8 checksum verify failed on 22020096 found 7DCD7CC1 wanted 28699DE8 bad tree block 22020096, bytenr mismatch, want=22020096, have=899525736547221204 ERROR: cannot read chunk root Could not open root, trying backup super warning, device 3 is missing warning, device 1 is missing bad tree block 22020096, bytenr mismatch, want=22020096, have=0 ERROR: cannot read chunk root Could not open root, trying backup super warning, device 3 is missing warning, device 1 is missing bad tree block 22020096, bytenr mismatch, want=22020096, have=0 ERROR: cannot read chunk root Could not open root, trying backup super

Removing the useless disk in order to try to run on two disks only doesn't work because it is a RAID-5 which needs at least three disks:

# btrfs device remove missing 3 /pub ERROR: error removing device 'missing': unable to go below two devices on raid5 ERROR: error removing devid 3: unable to go below two devices on raid5

Adding a new disk in order to try to have Btrfs re-balance fails as expected:

# btrfs balance start -v /pub/ Dumping filters: flags 0x7, state 0x0, force is off DATA (flags 0x0): balancing METADATA (flags 0x0): balancing SYSTEM (flags 0x0): balancing WARNING: Full balance without filters requested. This operation is very intense and takes potentially very long. It is recommended to use the balance filters to narrow down the scope of balance. Use 'btrfs balance start --full-balance' option to skip this warning. The operation will start in 10 seconds. Use Ctrl-C to stop it. 10 9 8 7 6 5 4 3 2 1 Starting balance without any filters.

The balance ends prematurely:

# dmesg [ 1179.816473] BTRFS info (device sdc): balance: resume -dusage=90 -musage=90 -susage=90 [ 1179.816732] BTRFS info (device sdc): relocating block group 48524951552 flags data|raid5 [ 1180.074942] BTRFS info (device sdc): relocating block group 47451209728 flags metadata|raid5 [ 1180.391206] BTRFS info (device sdc): found 12 extents [ 1180.632952] BTRFS info (device sdc): relocating block group 47384100864 flags system|raid5 [ 1180.894086] BTRFS info (device sdc): found 1 extents [ 1181.132700] BTRFS info (device sdc): relocating block group 42988470272 flags data|raid5 [ 1211.850068] BTRFS info (device sdc): found 13 extents [ 1213.063935] BTRFS error (device sdc): bad tree block start, want 65650688 have 13914138350834705721 [ 1213.072832] BTRFS: error (device sdc) in btrfs_run_delayed_refs:3011: errno=-5 IO failure [ 1213.072834] BTRFS info (device sdc): forced readonly [ 1213.072859] BTRFS info (device sdc): balance: ended with status: -30

I was actually very surprised at the number of files that I managed to salvage with Btrfs.

This means that either all the files, except the missing one, was located physically on that single healthy drive or parts of the files plus the needed parity data was all located on that single healthy drive plus the second drive that was partially replaced.

Does this mean that Btrfs perhaps isn't very good at balancing data and parity data evenly across multiple drives in a RAID-5 setup so that I ended up having most of the data needed on only one drive?

Or does this mean that with Btrfs sometimes you just "get lucky" and stand a greater chance at getting your files back even when two drives fail in a RAID-5 setup?

I decided to re-test this in order to see if I would get the same results again. This time by pulling the "sdc" disk which was healthy before. Of course I might just get the same results because Btrfs is now using another disk in the same way.

I have created a completely fresh RAID-5 pool and mounted it:

]# mkfs.btrfs -f -m raid5 -d raid5 /dev/disk/by-id/ata-ST31000340NS_9QJ089LF /dev/disk/by-id/ata-ST31000340NS_9QJ0DVN2 /dev/disk/by-id/ata-ST31000340NS_9QJ0EZZC btrfs-progs v4.20.2 See http://btrfs.wiki.kernel.org for more information. Label: (null) UUID: 226b366f-64f0-447e-87eb-31c91e5992b6 Node size: 16384 Sector size: 4096 Filesystem size: 2.73TiB Block group profiles: Data: RAID5 2.00GiB Metadata: RAID5 2.00GiB System: RAID5 16.00MiB SSD detected: no Incompat features: extref, raid56, skinny-metadata Number of devices: 3 Devices: ID SIZE PATH 1 931.51GiB /dev/disk/by-id/ata-ST31000340NS_9QJ089LF 2 931.51GiB /dev/disk/by-id/ata-ST31000340NS_9QJ0DVN2 3 931.51GiB /dev/disk/by-id/ata-ST31000340NS_9QJ0EZZC # mount -o noatime,compress=lzo /dev/disk/by-id/ata-ST31000340NS_9QJ089LF /pub

Then I have transfered all the files from the client again:

# ls -gG /pub/tmp/ total 37223496 -rwxrw-r-- 1 18576345 Apr 21 09:08 1.pdf -rwxrw-r-- 1 30255102 Apr 21 09:08 2.pdf -rwxrw-r-- 1 22016195 Apr 21 09:08 3.pdf -rwxrw-r-- 1 35456180485 Apr 21 07:58 bar.mkv -rwxrw-r-- 1 625338368 Mar 5 2018 boo.iso -rwxrw-r-- 1 1548841922 Apr 15 23:50 foo.mkv -rwxrw-r-- 1 415633408 Mar 5 2018 moo.iso # btrfs filesystem df /pub/ Data, RAID5: total=36.00GiB, used=35.51GiB System, RAID5: total=16.00MiB, used=16.00KiB Metadata, RAID5: total=2.00GiB, used=40.39MiB GlobalReserve, single: total=40.20MiB, used=0.00B

I have now removed the device that was "sdc".

# btrfs filesystem show -d warning, device 1 is missing checksum verify failed on 85508096 found A0A8052D wanted 444BB89B bad tree block 85508096, bytenr mismatch, want=85508096, have=65536 Couldn't read tree root Label: none uuid: 663b05c8-c9b3-4c88-a450-36b5e25a39c2 Total devices 3 FS bytes used 35.55GiB devid 2 size 931.51GiB used 19.01GiB path /dev/sdb devid 3 size 931.51GiB used 19.01GiB path /dev/sdd *** Some devices missing

I am then mounting the Btrfs pool in a degraded state and beginning a replacement, then I will remove the next drive from the pool during the replacement:

# mount -o noatime,compress=lzo,degraded /dev/disk/by-id/ata-ST31000340NS_9QJ0DVN2 /pub/ # btrfs replace start -f 1 /dev/disk/by-id/ata-ST31000340NS_9QJ0ES1V /pub/ # btrfs replace status -1 /pub 0.2% done, 0 write errs, 0 uncorr. read errs

This time I am experiencing a crash:

# dmesg [ 581.184298] kernel BUG at fs/btrfs/raid56.c:1910! [ 581.184304] invalid opcode: 0000 [#3] PREEMPT SMP PTI [ 581.184309] CPU: 1 PID: 366 Comm: kworker/u8:0 Tainted: G D I 5.0.10-arch1-1-ARCH #1 [ 581.184315] Hardware name: Hewlett-Packard HP Compaq dc7900 Small Form Factor/3031h, BIOS 786G1 v01.08 08/25/2008 [ 581.184351] Workqueue: btrfs-endio-raid56 btrfs_endio_raid56_helper [btrfs] [ 581.184385] RIP: 0010:__raid_recover_end_io+0x37e/0x450 [btrfs] [ 581.184390] Code: 00 ff ff ff ff 85 c0 74 47 83 f8 02 0f 85 e3 00 00 00 48 83 c4 10 48 89 df 31 f6 5b 5d 41 5c 41 5d 41 5e 41 5f e9 f2 ee ff ff <0f> 0b 4c 8d a3 98 00 00 00 4c 89 e7 e8 51 73 f1 d7 f0 80 8b b0 00 [ 581.184399] RSP: 0018:ffff9eb141347e18 EFLAGS: 00010213 [ 581.184403] RAX: ffff92d37c72a800 RBX: ffff92d37f03d800 RCX: 0000000000000000 [ 581.184408] RDX: 0000000000000002 RSI: 0000000000000010 RDI: 0000000000000003 [ 581.184412] RBP: 0000000000000000 R08: 0000000000000008 R09: ffff92d391a0a000 [ 581.184417] R10: 0000000000000008 R11: 000000000000000c R12: 0000000000000003 [ 581.184426] R13: 0000000000000000 R14: 0000000000000001 R15: ffff92d384525e80 [ 581.184435] FS: 0000000000000000(0000) GS:ffff92d393a80000(0000) knlGS:0000000000000000 [ 581.184440] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 581.184445] CR2: 000055fdae19b1c8 CR3: 000000020f3ca000 CR4: 00000000000406e0 [ 581.184449] Call Trace: [ 581.184484] normal_work_helper+0xbd/0x350 [btrfs] [ 581.184491] process_one_work+0x1eb/0x410 [ 581.184496] worker_thread+0x2d/0x3d0 [ 581.184501] ? process_one_work+0x410/0x410 [ 581.184506] kthread+0x112/0x130 [ 581.184511] ? kthread_park+0x80/0x80 [ 581.184516] ret_from_fork+0x35/0x40 [ 581.184521] Modules linked in: snd_hda_codec_analog i915 snd_hda_codec_generic ledtrig_audio kvmgt vfio_mdev mdev btrfs vfio_iommu_type1 vfio i2c_algo_bit snd_hda_intel drm_kms_helper snd_hda_codec coretemp drm snd_hda_core libcrc32c syscopyarea kvm snd_hwdep sysfillrect snd_pcm sysimgblt xor fb_sys_fops irqbypass snd_timer input_leds snd raid6_pq joydev tpm_infineon psmouse tpm_tis soundcore hp_wmi tpm_tis_core intel_agp sparse_keymap mei_wdt iTCO_wdt e1000e mei_me tpm intel_gtt iTCO_vendor_support rfkill pcspkr mei gpio_ich agpgart wmi_bmof evdev rng_core mac_hid lpc_ich wmi pcc_cpufreq acpi_cpufreq ip_tables x_tables ext4 crc32c_generic crc16 mbcache jbd2 fscrypto hid_generic usbhid hid sd_mod serio_raw uhci_hcd atkbd libps2 ahci libahci ata_generic pata_acpi libata ehci_pci ehci_hcd scsi_mod floppy i8042 serio

Also the replacement have stalled. So I rebooted the Btrfs machine.

Now, I cannot mount the filesystem:

# mount -o noatime,compress=lzo,degraded /dev/disk/by-id/ata-ST31000340NS_9QJ0DVN2 /pub/ mount: /pub: wrong fs type, bad option, bad superblock on /dev/sdb, missing codepage or helper program, or other error.

I have tried btrfs device scan and to mount in "recovery" mode, and I have also tried using btrfs restore , and I have tried the btrfs rescue zero-log , but nothing worked.

Have I just now hit one of the RAID-5 bugs? The wiki do say:

The parity RAID code has multiple serious data-loss bugs in it. It should not be used for anything other than testing purposes.

The answer is actually no. Well, the crash is a bug, but the mount issue is not a bug.

The simple fact is that you cannot expect to survive a two-drive failure in a RAID-5 setup no matter what filesystem you are using.

Sometimes, as in the first attempt, you might get away with restoring some files. At other times you will simply lose the entire pool. Expect the later with both ZFS and Btrfs!

Enough Btrfs for now. Time to test mdadm+dm-integrity.

mdadm+dm-integrity RAID-5

UPDATE 2019-08-27: It has come to my attention (thank you Philip!) that I made an unfortunate mistake in my tests of mdadm+dm-integrity. When I tested for data integrity errors I wrote to /dev/mapper/sdb which also updates the dm-integrity checksum. Later when I do the sync-action check, the errors are not the dm-integrity checksum errors, but rather the RAID parity errors. The correct test should have been to write random data to /dev/sdb . At the end of the mdadm-dm+integrity section I have copy/pasted the result of a test Philip send me by email which contains an example of how the test should have been run. I have also updated the article with a note each time I made the mistake.

I stumbled upon dm-integrity as I was doing some of the tests with Btrfs and I haven't used it before. I therefore thought that it would be interesting to see how mdadm+dm-integrity handles the same problems that I have just tested ZFS and Btrfs with.

mdadm is used for administering pure software RAID using plain block devices, but it does not provide any kind of data integrity verification. If a read error is encountered with mdadm, the block in error is calculated and written back. If the pool is a mirror, as it can't calculate the correct data, it will take the data from the first available drive and assume it is correct and write it to the other drive. If it is a degraded RAID pool mdadm will terminate immediately without doing anything, as it cannot recalculate the faulty data.

The dm-integrity has nothing to do with any kind of RAID setup. dm-integrity will just return an EILSEQ (instead of EIO error) when it encounters a data integrity error, it is then up to the RAID driver, mdadm in this case, to handle integrity errors properly. dm-integrity does not require encryption when it is not desired by technical or other reasons.

From the documentation:

The dm-integrity target can also be used as a standalone target, in this mode it calculates and verifies the integrity tag internally. In this mode, the dm-integrity target can be used to detect silent data corruption on the disk or in the I/O path. To guarantee write atomicity, the dm-integrity target uses journal, it writes sector data and integrity tags into a journal, commits the journal and then copies the data and integrity tags to their respective location.

If you combine dm-integrity with a mdadm RAID (RAID-1/mirror, RAID-5, or any other setup) you now have disk redundancy and error checking and error correction. dm-integrity will cause checksum errors when it encounters invalid data which mdadm notices and then repairs with correct data.

With mdadm, you specify the raid device to create, the raid mode level (raid0, raid1, raid10, raid5, raid6 etc) and the devices. mdadm is very well documented and it contains tons of options with examples as well, but it is also very easy to make mistakes with mdadm.

If you just want simple data integrity verification without any of the extra functionality that ZFS or Btrfs offers, then dm-integrity alone can do the job - you just need to run regular scrubs of the filesystem and then make sure you have adequate backup to handle any potential integrity problems.

Since I'm not using encryption in these tests I will use the integritysetup command instead of the cryptsetup command to format the disks. However, it is worth noticing that dm-integrity is best integrated with dm-crypt+LUKS for disk encryption.

By default, integritysetup uses "crc32" which is relatively fast and requiring just 4 bytes per block. This gives the probability of a random corruption not being detected of about 2^32. This is then on top of any silent corruption on a hard drive.

With dm-integrity the devices needs to be wiped during format in order to avoid invalid checksums. As this takes extremely long time with the 1 TB disks, I have changed the disks to three old 160 GB disks and I'm just gonna use the shorthand sdX for device names (never do that, I am just doing it for the sake of test, always use device names with serial numbers for easy identification).

# integritysetup format --integrity sha256 /dev/sdb WARNING! ======== This will overwrite data on /dev/sdb irrevocably. Are you sure? (Type uppercase yes): YES WARNING: Device /dev/sdb already contains a 'dos' partition signature. Formatted with tag size 4, internal integrity sha256. Wiping device to initialize integrity checksum. You can interrupt this by pressing CTRL+c (rest of not wiped device will contain invalid checksum). Progress: 2.0%, ETA 49:12, 2991 MiB written, speed 50.3 MiB/s

Then opening the devices:

# integritysetup open --integrity sha256 /dev/sdb sdb # integritysetup open --integrity sha256 /dev/sdc sdc # integritysetup open --integrity sha256 /dev/sdd sdd

And creating the mdadm RAID-5 system:

# mdadm --create --verbose --assume-clean --level=5 --raid-devices=3 /dev/md/raid5 /dev/mapper/sdb /dev/mapper/sdc /dev/mapper/sdd mdadm: layout defaults to left-symmetric mdadm: layout defaults to left-symmetric mdadm: chunk size defaults to 512K mdadm: size set to 154882048K mdadm: automatically enabling write-intent bitmap on large pool mdadm: Defaulting to version 1.2 metadata mdadm: pool /dev/md/raid5 started.

Then create the ext4 filesystem on top of that:

# mkfs.ext4 /dev/md/raid5

In the above I have just used the defaults and I didn't calculate the correct stripe width and stride for a mdadm RAID-5 setup.

Time to get some status information:

$ cat /proc/mdstat Personalities : [raid6] [raid5] [raid4] md127 : acti