Linux users increasingly rely on massive file systems to house their business critical data sets. When I started to work in IT and entered data centers, colleagues were pointing at a rack containing a storage system, and respectfully explaining that this could contain one terabyte of data. Storage resources have grown by scales of magnitudes over the last few years—now often spanning many terabytes or reaching petabyte scale.

What is bit rot?

Of course, we expect to also reliably read what we wrote to storage. Hard disks and SSDs have an impressively low probability of giving different data to the upper layers than were written before. With increasing storage capacity, the chances of also getting back wrong data are rising. If a storage, for example a spinning disk, can not read a sector then it will report an I/O error to the upper layers.

When we get different data back from the disk than we wrote, that is called bit rot. What will happen in that case; will Red Hat Enterprise Linux notice? In which ways can we deal with this situation?

The following steps use a RHEL 8 system. Even a small KVM guest is enough so you can try out these commands for yourself.

Will RHEL file systems detect bit rot?

By definition, with bit rot we are getting different data from the block device than we wrote. Thus, if an application like a database is using the block device directly without a filesystem layer, then it would have to deal by itself with bit rot.

Let’s look at bit rot on a block device with XFS, the default filesystem on RHEL 7 and RHEL 8. Instead of using a real harddisk and waiting for bit rot, we will change a single bit, to simulate bit rot.

Let’s start in generating a 128MB file, consisting of zeros. We will then use losetup to make the file available as block device, create an XFS file system and mount it.

# dd if=/dev/zero of=rawfile bs=1M count=128 # MYLOOP=$(losetup -f --show rawfile) # echo $MYLOOP /dev/loop27 # mkfs.xfs $MYLOOP # mount $MYLOOP /mnt

We will now store a single file on the file system, filled with the ASCII character y and newline characters. After unmounting the file system, we will look at some bytes at the offset of 50MB.

# yes >/mnt/infile yes: standard output: No space left on device # md5sum /mnt/infile 66e48b263b313703fce56a8a5a848eef /mnt/infile # umount /mnt # losetup -d $MYLOOP # dd if=rawfile bs=1 count=10 skip=$((50*1024*1024)) 2>/dev/null|hexdump -vC 00000000 79 0a 79 0a 79 0a 79 0a 79 0a |y.y.y.y.y.| 0000000a

At offset 50MB, we have the character code for y , written in hexadecimal 0x79, or in binary 01111001. We will now flip the last bit into a 0, which makes this into hex 0x78, the ASCII code for x . After changing the character, we verify what we wrote with hexdump .

# echo 'x' | dd of=rawfile bs=1 count=1 seek=$((50*1024*1024)) conv=notrunc # dd if=rawfile bs=1 count=10 skip=$((50*1024*1024)) 2>/dev/null|hexdump -vC 00000000 78 0a 79 0a 79 0a 79 0a 79 0a |x.y.y.y.y.| 0000000a

Now we can mount the file again as filesystem, and check if the changed data gets noticed.

# MYLOOP=$(losetup -f --show rawfile) # mount $MYLOOP /mnt/ # md5sum /mnt/tmp/infile adf92be755d095e898281b1146be72ce /mnt/infile

The checksum changed, in other words: we are getting different data from our underlying block device. Nothing is at this point hinting at the data having changed: the supported filesystems in RHEL like Ext4 or XFS do not have checksums over data. As of RHEL 8, both have their metadata, so the structures they use for their own "housekeeping" protected with checksums, but this does not cover the real data.

For more sophisticated introduction of errors, dm-dust can be considered. Dm-dust emulates the behaviour of bad sectors, it is currently not in RHEL but for example in Fedora.

How can I detect bit rot?

With RHEL8 and later we can detect bit rot, thanks to the dm-integrity kernel code. It uses checksums to detect bit rot. Let’s look at our bit rot situation with dm-integrity as additional layer.

# dd if=/dev/zero of=rawfile bs=1M count=128 # MYLOOP=$(losetup -f --show rawfile) # integritysetup format $MYLOOP # integritysetup open $MYLOOP mydata # mkfs.xfs /dev/mapper/mydata # mount /dev/mapper/mydata /mnt # yes >/mnt/infile # md5sum /mnt/infile 13e14c50aaf2054d987663ed31b5f786 /mnt/infile # umount /mnt/ # losetup -d $MYLOOP # integritysetup close mydata # dd if=rawfile bs=1 count=10 skip=$((50*1024*1024)) 2>/dev/null|hexdump -vC 00000000 79 0a 79 0a 79 0a 79 0a 79 0a |y.y.y.y.y.| 0000000a # echo 'x' | dd of=rawfile bs=1 count=1 seek=$((50*1024*1024)) conv=notrunc # dd if=rawfile bs=1 count=10 skip=$((50*1024*1024)) 2>/dev/null|hexdump -vC 00000000 78 0a 79 0a 79 0a 79 0a 79 0a |x.y.y.y.y.| 0000000a

Now let’s see how we do with dm-integrity as additional layer:

# MYLOOP=$(losetup -f --show rawfile) # integritysetup open $MYLOOP mydata # mount /dev/mapper/mydata /mnt # md5sum /mnt/infile md5sum: /mnt/infile: Input/output error

What happened here? Using dm-integrity we noticed that the underlying data changed, and notified the layer above with the I/O error. Without dm-integrity, the changed bit was not noticed, now we get an I/O error. When using cp , we will notice that cp will copy the first part of the file, and stop when it receives the I/O error. With the following command, we can ask dd to continue reading the file despite I/O errors:

# dd if=/mnt/infile of=/tmp/infile.copy conv=noerror

The ddrescue utility, which is not part of RHEL, can be used to create a copy of the intact data, and create a ‘hole’ in the destinationfile which is exactly as big as the data which could not be read. ‘ddrescue’ works on sectors, so in our example the destination file has a hole of 4096 bytes.

How is bit rot handled, in general?

The bit rot topic is known across the industry and approached in various software projects.

dm-integrity is able to detect bit rot. As a device mapper layer, it can be used with various other layers on top, like file systems, LUKS, LVM or compression layers.

File systems like btrfs or ZFS consider dealing with bit rot. Btrfs seems to have bit rot detection, and also does not handle corrections automatically.

Ceph is a distributed file system. With the BlueStore backend, by default all data and metadata written to BlueStore are protected by one or more checksums. Data and metadata from the disk are verified before handover to the user. So, we have detection and report of corruptions when data is read, but no automatic fixing.

Summary and the future

In this post, we have looked at what bit rot is and how to detect it with dm-integrity . File systems like Btrfs and ZFS also aim at dealing with bit rot, but are, for various reasons, not available in RHEL.

The checksumming done by dm-integrity leads to more I/O to the underlying block device, and less usable space as also the checksums are stored. Alternatively to setups with integritysetup , dm-integrity can be used together with LUKS, which provides encryption.

In this post, we have not looked at fixing bit rot-induced errors. For this, data should be stored multiple times, for example with dm-integrity directly on top of disks, and on top of that RAID1. If bit rot is detected, the healthy half of the mirror can be used.

Regarding support status: dm-integrity is not labeled as TechPreview, it is supported. Heavy stacks of RAID1 on top, with mdadm or LVM-raid, are not yet widely tested or recommended for production.

Related to our topics here are DIF/DIX, Data Integrity Field/Data Integrity Extension. DIF is a standard to compute a checksum and store it on the disk, to be able to verify the integrity of the stored data. DIX uses checksums to protect data while it traverses various storage layers of systems, intended to be implemented by storage at the bottom and application on the top, to detect issues. This kbase solution has details.

Detecting bit rot solves half the problem - but it's more than half of the work. What's left is to build an automated way to use RAID to fix corrupted data by using a known good copy (that is, one with a valid checksum) to recreate the corrupted segments on a different part of the disk. This part is still being worked on, with Red Hat and in the upstream communities.