btrfs Past ^

This post is about XFS but it’s about features that first hit Linux in btrfs, so we need to talk about btrfs for a bit first.

For a long time now, btrfs has had a useful feature called reflinks. Basically this is exposed as cp --reflink=always and takes advantage of extents and copy-on-write in order to do a quick copy of data by merely adding another reference to the extents that the data is currently using, rather than having to read all the data and write it out again, as would be the case in other filesystems.

Here’s an excerpt from the man page for cp :

When –reflink[=always] is specified, perform a lightweight copy, where the data blocks are copied only when modified. If this is not possible the copy fails, or if –reflink=auto is specified, fall back to a standard copy.

Without reflinks a common technique for making a quick copy of a file is the hardlink. Hardlinks have a number of disadvantages though, mainly due to the fact that since there is only one inode all hardlinked copies must have the same metadata (owner, group, permissions, etc.). Software that might modify the files also needs to be aware of hardlinks: naive modification of a hardlinked file modifies all copies of the file.

With reflinks, life becomes much easier:

Each copy has its own inode so can have different metadata. Only the data extents are shared.

The filesystem ensures that any write causes a copy-on-write, so applications don’t need to do anything special.

Space is saved on a per-extent basis so changing one extent still allows all the other extents to remain shared. A change to a hardlinked file requires a new copy of the whole file.

Another feature that extents and copy-on-write allow is block-level out-of-band deduplication.

Deduplication – the technique of finding and removing duplicate copies of data.

– the technique of finding and removing duplicate copies of data. Block-level – operating on the blocks of data on storage, not just whole files.

– operating on the blocks of data on storage, not just whole files. Out-of-band – something that happens only when triggered or scheduled, not automatically as part of the normal operation of the filesystem.

btrfs has an ioctl that a userspace program can use—presumably after finding a sequence of blocks that are identical—to tell the kernel to turn one into a reference to the other, thus saving some space.

It’s necessary that the kernel does it so that any IO that may be going on at the same time that may modify the data can be dealt with. Modifications after the data is reflinked will just case a copy-on-write. If you tried to do it all in a userspace app then you’d risk something else modifying the files at the same time, but by having the kernel do it then in theory it becomes completely safe to do it at any time. The kernel also checks that the sequence of extents really are identical.

In-band deduplication is a feature that’s being worked on in btrfs. It already exists in ZFS though, and there is it rarely recommended for use as it requires a huge amount of memory for keeping hashes of data that has been written. It’s going to be the same story with btrfs, so out-of-band deduplication is still something that will remain useful. And it exists as a feature right now, which is always a bonus.

XFS Future ^

So what has all this got to do with XFS?

Well, in recognition that there might be more than one Linux filesystem with extents and so that reflinks might be more generally useful, the extent-same ioctl got lifted up to be in the VFS layer of the kernel instead of just in btrfs. And the good news is that XFS recently became able to make use of it.

When I say “recently” I do mean really recently. I mean like kernel release 4.9.1 which came out on 2017-01-04. At the moment it comes with massive EXPERIMENTAL warnings, requires a new filesystem to be created with a special format option, and will need an xfsprogs compiled from recent git in order to have a mkfs.xfs that can create such a filesystem.

So before going further, I’m going to assume you’ve compiled a new enough kernel and booted into it, then compiled up a new enough xfsprogs. Both of these are quite simple things to do, for example the Debian documentation for building kernel packages from upstream code works fine.

XFS Reflink Demo ^

Make yourself a new filesystem, with the reflink=1 format option.

# mkfs.xfs -L reflinkdemo -m reflink=1 /dev/xvdc meta-data=/dev/xvdc isize=512 agcount=4, agsize=3276800 blks = sectsz=512 attr=2, projid32bit=1 = crc=1 finobt=1, sparse=0, rmapbt=0, reflink=1 data = bsize=4096 blocks=13107200, imaxpct=25 = sunit=0 swidth=0 blks naming =version 2 bsize=4096 ascii-ci=0 ftype=1 log =internal log bsize=4096 blocks=6400, version=2 = sectsz=512 sunit=0 blks, lazy-count=1 realtime =none extsz=4096 blocks=0, rtextents=0 # mkfs.xfs -L reflinkdemo -m reflink=1 /dev/xvdc meta-data=/dev/xvdc isize=512 agcount=4, agsize=3276800 blks = sectsz=512 attr=2, projid32bit=1 = crc=1 finobt=1, sparse=0, rmapbt=0, reflink=1 data = bsize=4096 blocks=13107200, imaxpct=25 = sunit=0 swidth=0 blks naming =version 2 bsize=4096 ascii-ci=0 ftype=1 log =internal log bsize=4096 blocks=6400, version=2 = sectsz=512 sunit=0 blks, lazy-count=1 realtime =none extsz=4096 blocks=0, rtextents=0

Put it in /etc/fstab for convenience, and mount it somewhere.

# echo "LABEL=reflinkdemo /mnt/xfs xfs relatime 0 2" >> /etc/fstab # mkdir -vp /mnt/xfs mkdir: created directory ‘/mnt/xfs’ # mount /mnt/xfs # df -h /mnt/xfs Filesystem Size Used Avail Use% Mounted on /dev/xvdc 50G 339M 50G 1% /mnt/xfs # echo "LABEL=reflinkdemo /mnt/xfs xfs relatime 0 2" >> /etc/fstab # mkdir -vp /mnt/xfs mkdir: created directory ‘/mnt/xfs’ # mount /mnt/xfs # df -h /mnt/xfs Filesystem Size Used Avail Use% Mounted on /dev/xvdc 50G 339M 50G 1% /mnt/xfs

Create a few files with random data.

# mkdir -vp /mnt/xfs/reflink mkdir: created directory ‘/mnt/xfs/reflink’ # chown -c andy: /mnt/xfs/reflink changed ownership of ‘/mnt/xfs/reflink’ from root:root to andy:andy # exit $ for i in {1..5}; do > echo "Writing $i…"; dd if=/dev/urandom of=/mnt/xfs/reflink/$i bs=1M count=1024; > done Writing 1… 1024+0 records in 1024+0 records out 1073741824 bytes (1.1 GB) copied, 4.34193 s, 247 MB/s Writing 2… 1024+0 records in 1024+0 records out 1073741824 bytes (1.1 GB) copied, 4.33207 s, 248 MB/s Writing 3… 1024+0 records in 1024+0 records out 1073741824 bytes (1.1 GB) copied, 4.33527 s, 248 MB/s Writing 4… 1024+0 records in 1024+0 records out 1073741824 bytes (1.1 GB) copied, 4.33362 s, 248 MB/s Writing 5… 1024+0 records in 1024+0 records out 1073741824 bytes (1.1 GB) copied, 4.32859 s, 248 MB/s $ df -h /mnt/xfs Filesystem Size Used Avail Use% Mounted on /dev/xvdc 50G 5.4G 45G 11% /mnt/xfs $ du -csh /mnt/xfs 5.0G /mnt/xfs 5.0G total # mkdir -vp /mnt/xfs/reflink mkdir: created directory ‘/mnt/xfs/reflink’ # chown -c andy: /mnt/xfs/reflink changed ownership of ‘/mnt/xfs/reflink’ from root:root to andy:andy # exit $ for i in {1..5}; do > echo "Writing $i…"; dd if=/dev/urandom of=/mnt/xfs/reflink/$i bs=1M count=1024; > done Writing 1… 1024+0 records in 1024+0 records out 1073741824 bytes (1.1 GB) copied, 4.34193 s, 247 MB/s Writing 2… 1024+0 records in 1024+0 records out 1073741824 bytes (1.1 GB) copied, 4.33207 s, 248 MB/s Writing 3… 1024+0 records in 1024+0 records out 1073741824 bytes (1.1 GB) copied, 4.33527 s, 248 MB/s Writing 4… 1024+0 records in 1024+0 records out 1073741824 bytes (1.1 GB) copied, 4.33362 s, 248 MB/s Writing 5… 1024+0 records in 1024+0 records out 1073741824 bytes (1.1 GB) copied, 4.32859 s, 248 MB/s $ df -h /mnt/xfs Filesystem Size Used Avail Use% Mounted on /dev/xvdc 50G 5.4G 45G 11% /mnt/xfs $ du -csh /mnt/xfs 5.0G /mnt/xfs 5.0G total

Copy a file and as expected usage will go up by 1GiB. And it will take a little while, even on my nice fast SSDs.

$ time cp -v /mnt/xfs/reflink/{,copy_}1 ‘/mnt/xfs/reflink/1’ -> ‘/mnt/xfs/reflink/copy_1’ real 0m3.420s user 0m0.008s sys 0m0.676s $ df -h /mnt/xfs; du -csh /mnt/xfs/reflink Filesystem Size Used Avail Use% Mounted on /dev/xvdc 50G 6.4G 44G 13% /mnt/xfs 6.0G /mnt/xfs/reflink 6.0G total $ time cp -v /mnt/xfs/reflink/{,copy_}1 ‘/mnt/xfs/reflink/1’ -> ‘/mnt/xfs/reflink/copy_1’ real 0m3.420s user 0m0.008s sys 0m0.676s $ df -h /mnt/xfs; du -csh /mnt/xfs/reflink Filesystem Size Used Avail Use% Mounted on /dev/xvdc 50G 6.4G 44G 13% /mnt/xfs 6.0G /mnt/xfs/reflink 6.0G total

So what about a reflink copy?

$ time cp -v --reflink=always /mnt/xfs/reflink/{,reflink_}1 ‘/mnt/xfs/reflink/1’ -> ‘/mnt/xfs/reflink/reflink_1’ real 0m0.003s user 0m0.000s sys 0m0.004s $ df -h /mnt/xfs; du -csh /mnt/xfs/reflink Filesystem Size Used Avail Use% Mounted on /dev/xvdc 50G 6.4G 44G 13% /mnt/xfs 7.0G /mnt/xfs/reflink 7.0G total $ time cp -v --reflink=always /mnt/xfs/reflink/{,reflink_}1 ‘/mnt/xfs/reflink/1’ -> ‘/mnt/xfs/reflink/reflink_1’ real 0m0.003s user 0m0.000s sys 0m0.004s $ df -h /mnt/xfs; du -csh /mnt/xfs/reflink Filesystem Size Used Avail Use% Mounted on /dev/xvdc 50G 6.4G 44G 13% /mnt/xfs 7.0G /mnt/xfs/reflink 7.0G total

The apparent usage went up by 1GiB but the amount of free space as shown by df stayed the same. No more actual storage was used because the new copy is a reflink. And the copy got done in 4ms as opposed to 3,420ms.

Can we tell more about how these files are laid out? Yes, we can use the filefrag -v command to tell us more.

$ filefrag -v /mnt/xfs/reflink/{,copy_,reflink_}1 Filesystem type is: 58465342 File size of /mnt/xfs/reflink/1 is 1073741824 (262144 blocks of 4096 bytes) ext: logical_offset: physical_offset: length: expected: flags: 0: 0.. 262143: 1572884.. 1835027: 262144: last,shared,eof /mnt/xfs/reflink/1: 1 extent found File size of /mnt/xfs/reflink/copy_1 is 1073741824 (262144 blocks of 4096 bytes) ext: logical_offset: physical_offset: length: expected: flags: 0: 0.. 262143: 917508.. 1179651: 262144: last,eof /mnt/xfs/reflink/copy_1: 1 extent found File size of /mnt/xfs/reflink/reflink_1 is 1073741824 (262144 blocks of 4096 bytes) ext: logical_offset: physical_offset: length: expected: flags: 0: 0.. 262143: 1572884.. 1835027: 262144: last,shared,eof /mnt/xfs/reflink/reflink_1: 1 extent found $ filefrag -v /mnt/xfs/reflink/{,copy_,reflink_}1 Filesystem type is: 58465342 File size of /mnt/xfs/reflink/1 is 1073741824 (262144 blocks of 4096 bytes) ext: logical_offset: physical_offset: length: expected: flags: 0: 0.. 262143: 1572884.. 1835027: 262144: last,shared,eof /mnt/xfs/reflink/1: 1 extent found File size of /mnt/xfs/reflink/copy_1 is 1073741824 (262144 blocks of 4096 bytes) ext: logical_offset: physical_offset: length: expected: flags: 0: 0.. 262143: 917508.. 1179651: 262144: last,eof /mnt/xfs/reflink/copy_1: 1 extent found File size of /mnt/xfs/reflink/reflink_1 is 1073741824 (262144 blocks of 4096 bytes) ext: logical_offset: physical_offset: length: expected: flags: 0: 0.. 262143: 1572884.. 1835027: 262144: last,shared,eof /mnt/xfs/reflink/reflink_1: 1 extent found

What we can see here is that all three files are composed of a single extent which is 262,144 4KiB blocks in size, but it also tells us that /mnt/xfs/reflink/1 and /mnt/xfs/reflink/reflink_1 are using the same range of physical blocks: 1572884..1835027.

XFS Deduplication Demo ^

We’ve demonstrated that you can use cp --reflink=always to take a cheap copy of your data, but what about data that may already be duplicates without your knowledge? Is there any way to take advantage of the extent-same ioctl for deduplication?

There’s a couple of software solutions for out-of-band deduplication in btrfs, but one I know that works also in XFS is duperemove. You will need to use a git checkout of duperemove for this to work.

A quick reminder of the storage use before we start.

$ df -h /mnt/xfs; du -csh /mnt/xfs/reflink Filesystem Size Used Avail Use% Mounted on /dev/xvdc 50G 6.4G 44G 13% /mnt/xfs 7.0G /mnt/xfs/reflink 7.0G total $ filefrag -v /mnt/xfs/reflink/{,copy_,reflink_}1 Filesystem type is: 58465342 File size of /mnt/xfs/reflink/1 is 1073741824 (262144 blocks of 4096 bytes) ext: logical_offset: physical_offset: length: expected: flags: 0: 0.. 262143: 1572884.. 1835027: 262144: last,shared,eof /mnt/xfs/reflink/1: 1 extent found File size of /mnt/xfs/reflink/copy_1 is 1073741824 (262144 blocks of 4096 bytes) ext: logical_offset: physical_offset: length: expected: flags: 0: 0.. 262143: 917508.. 1179651: 262144: last,eof /mnt/xfs/reflink/copy_1: 1 extent found File size of /mnt/xfs/reflink/reflink_1 is 1073741824 (262144 blocks of 4096 bytes) ext: logical_offset: physical_offset: length: expected: flags: 0: 0.. 262143: 1572884.. 1835027: 262144: last,shared,eof /mnt/xfs/reflink/reflink_1: 1 extent found $ df -h /mnt/xfs; du -csh /mnt/xfs/reflink Filesystem Size Used Avail Use% Mounted on /dev/xvdc 50G 6.4G 44G 13% /mnt/xfs 7.0G /mnt/xfs/reflink 7.0G total $ filefrag -v /mnt/xfs/reflink/{,copy_,reflink_}1 Filesystem type is: 58465342 File size of /mnt/xfs/reflink/1 is 1073741824 (262144 blocks of 4096 bytes) ext: logical_offset: physical_offset: length: expected: flags: 0: 0.. 262143: 1572884.. 1835027: 262144: last,shared,eof /mnt/xfs/reflink/1: 1 extent found File size of /mnt/xfs/reflink/copy_1 is 1073741824 (262144 blocks of 4096 bytes) ext: logical_offset: physical_offset: length: expected: flags: 0: 0.. 262143: 917508.. 1179651: 262144: last,eof /mnt/xfs/reflink/copy_1: 1 extent found File size of /mnt/xfs/reflink/reflink_1 is 1073741824 (262144 blocks of 4096 bytes) ext: logical_offset: physical_offset: length: expected: flags: 0: 0.. 262143: 1572884.. 1835027: 262144: last,shared,eof /mnt/xfs/reflink/reflink_1: 1 extent found

Run duperemove .

# duperemove -hdr --hashfile=/var/tmp/dr.hash /mnt/xfs/reflink Using 128K blocks Using hash: murmur3 Gathering file list... Adding files from database for hashing. Loading only duplicated hashes from hashfile. Using 2 threads for dedupe phase Kernel processed data (excludes target files): 4.0G Comparison of extent info shows a net change in shared extents of: 1.0G $ df -h /mnt/xfs; du -csh /mnt/xfs/reflink Filesystem Size Used Avail Use% Mounted on /dev/xvdc 50G 5.4G 45G 11% /mnt/xfs 7.0G /mnt/xfs/reflink 7.0G total $ filefrag -v /mnt/xfs/reflink/{,copy_,reflink_}1 Filesystem type is: 58465342 File size of /mnt/xfs/reflink/1 is 1073741824 (262144 blocks of 4096 bytes) ext: logical_offset: physical_offset: length: expected: flags: 0: 0.. 262143: 1572884.. 1835027: 262144: last,shared,eof /mnt/xfs/reflink/1: 1 extent found File size of /mnt/xfs/reflink/copy_1 is 1073741824 (262144 blocks of 4096 bytes) ext: logical_offset: physical_offset: length: expected: flags: 0: 0.. 262143: 1572884.. 1835027: 262144: last,shared,eof /mnt/xfs/reflink/copy_1: 1 extent found File size of /mnt/xfs/reflink/reflink_1 is 1073741824 (262144 blocks of 4096 bytes) ext: logical_offset: physical_offset: length: expected: flags: 0: 0.. 262143: 1572884.. 1835027: 262144: last,shared,eof /mnt/xfs/reflink/reflink_1: 1 extent found # duperemove -hdr --hashfile=/var/tmp/dr.hash /mnt/xfs/reflink Using 128K blocks Using hash: murmur3 Gathering file list... Adding files from database for hashing. Loading only duplicated hashes from hashfile. Using 2 threads for dedupe phase Kernel processed data (excludes target files): 4.0G Comparison of extent info shows a net change in shared extents of: 1.0G $ df -h /mnt/xfs; du -csh /mnt/xfs/reflink Filesystem Size Used Avail Use% Mounted on /dev/xvdc 50G 5.4G 45G 11% /mnt/xfs 7.0G /mnt/xfs/reflink 7.0G total $ filefrag -v /mnt/xfs/reflink/{,copy_,reflink_}1 Filesystem type is: 58465342 File size of /mnt/xfs/reflink/1 is 1073741824 (262144 blocks of 4096 bytes) ext: logical_offset: physical_offset: length: expected: flags: 0: 0.. 262143: 1572884.. 1835027: 262144: last,shared,eof /mnt/xfs/reflink/1: 1 extent found File size of /mnt/xfs/reflink/copy_1 is 1073741824 (262144 blocks of 4096 bytes) ext: logical_offset: physical_offset: length: expected: flags: 0: 0.. 262143: 1572884.. 1835027: 262144: last,shared,eof /mnt/xfs/reflink/copy_1: 1 extent found File size of /mnt/xfs/reflink/reflink_1 is 1073741824 (262144 blocks of 4096 bytes) ext: logical_offset: physical_offset: length: expected: flags: 0: 0.. 262143: 1572884.. 1835027: 262144: last,shared,eof /mnt/xfs/reflink/reflink_1: 1 extent found

The output of du remained the same, but df says that there’s now 1GiB more free space, and filefrag confirms that what’s changed is that copy_1 now uses the same extents as 1 and reflink_1. The duplicate data in copy_1 that in theory we did not know was there, has been discovered and safely reference-linked to the extent from 1, saving us 1GiB of storage.

By the way, I told duperemove to use a hash file because otherwise it will keep that in RAM. For the sake of 7 files that won’t matter but it will if I have millions of files so it’s a habit I get into. It uses that hash file to avoid having to repeatedly re-hash files that haven’t changed.

All that has been demonstrated so far though is whole-file deduplication, as copy_1 was just a regular copy of 1. What about when a file is only partially composed of duplicate data? Well okay.

$ cat /mnt/xfs/reflink/{1,2} > /mnt/xfs/reflink/1_2 $ ls -lah /mnt/xfs/reflink/{1,2,1_2} -rw-r--r-- 1 andy andy 1.0G Jan 10 15:41 /mnt/xfs/reflink/1 -rw-r--r-- 1 andy andy 2.0G Jan 10 16:55 /mnt/xfs/reflink/1_2 -rw-r--r-- 1 andy andy 1.0G Jan 10 15:41 /mnt/xfs/reflink/2 $ df -h /mnt/xfs; du -csh /mnt/xfs/reflink Filesystem Size Used Avail Use% Mounted on /dev/xvdc 50G 7.4G 43G 15% /mnt/xfs 9.0G /mnt/xfs/reflink 9.0G total $ filefrag -v /mnt/xfs/reflink/{1,2,1_2} Filesystem type is: 58465342 File size of /mnt/xfs/reflink/1 is 1073741824 (262144 blocks of 4096 bytes) ext: logical_offset: physical_offset: length: expected: flags: 0: 0.. 262143: 1572884.. 1835027: 262144: last,shared,eof /mnt/xfs/reflink/1: 1 extent found File size of /mnt/xfs/reflink/2 is 1073741824 (262144 blocks of 4096 bytes) ext: logical_offset: physical_offset: length: expected: flags: 0: 0.. 262127: 20.. 262147: 262128: 1: 262128.. 262143: 2129908.. 2129923: 16: 262148: last,eof /mnt/xfs/reflink/2: 2 extents found File size of /mnt/xfs/reflink/1_2 is 2147483648 (524288 blocks of 4096 bytes) ext: logical_offset: physical_offset: length: expected: flags: 0: 0.. 262127: 262164.. 524291: 262128: 1: 262128.. 524287: 655380.. 917539: 262160: 524292: last,eof /mnt/xfs/reflink/1_2: 2 extents found $ cat /mnt/xfs/reflink/{1,2} > /mnt/xfs/reflink/1_2 $ ls -lah /mnt/xfs/reflink/{1,2,1_2} -rw-r--r-- 1 andy andy 1.0G Jan 10 15:41 /mnt/xfs/reflink/1 -rw-r--r-- 1 andy andy 2.0G Jan 10 16:55 /mnt/xfs/reflink/1_2 -rw-r--r-- 1 andy andy 1.0G Jan 10 15:41 /mnt/xfs/reflink/2 $ df -h /mnt/xfs; du -csh /mnt/xfs/reflink Filesystem Size Used Avail Use% Mounted on /dev/xvdc 50G 7.4G 43G 15% /mnt/xfs 9.0G /mnt/xfs/reflink 9.0G total $ filefrag -v /mnt/xfs/reflink/{1,2,1_2} Filesystem type is: 58465342 File size of /mnt/xfs/reflink/1 is 1073741824 (262144 blocks of 4096 bytes) ext: logical_offset: physical_offset: length: expected: flags: 0: 0.. 262143: 1572884.. 1835027: 262144: last,shared,eof /mnt/xfs/reflink/1: 1 extent found File size of /mnt/xfs/reflink/2 is 1073741824 (262144 blocks of 4096 bytes) ext: logical_offset: physical_offset: length: expected: flags: 0: 0.. 262127: 20.. 262147: 262128: 1: 262128.. 262143: 2129908.. 2129923: 16: 262148: last,eof /mnt/xfs/reflink/2: 2 extents found File size of /mnt/xfs/reflink/1_2 is 2147483648 (524288 blocks of 4096 bytes) ext: logical_offset: physical_offset: length: expected: flags: 0: 0.. 262127: 262164.. 524291: 262128: 1: 262128.. 524287: 655380.. 917539: 262160: 524292: last,eof /mnt/xfs/reflink/1_2: 2 extents found

I’ve concatenated 1 and 2 together into a file called 1_2 and as expected, usage goes up by 2GiB. filefrag confirms that the physical extents in 1_2 are new. We should be able to do better because this 1_2 file does not contain any new unique data.

$ duperemove -hdr --hashfile=/var/tmp/dr.hash /mnt/xfs/reflink Using 128K blocks Using hash: murmur3 Gathering file list... Adding files from database for hashing. Using 2 threads for file hashing phase Kernel processed data (excludes target files): 4.0G Comparison of extent info shows a net change in shared extents of: 3.0G $ df -h /mnt/xfs; du -csh /mnt/xfs/reflink Filesystem Size Used Avail Use% Mounted on /dev/xvdc 50G 5.4G 45G 11% /mnt/xfs 9.0G /mnt/xfs/reflink 9.0G total $ duperemove -hdr --hashfile=/var/tmp/dr.hash /mnt/xfs/reflink Using 128K blocks Using hash: murmur3 Gathering file list... Adding files from database for hashing. Using 2 threads for file hashing phase Kernel processed data (excludes target files): 4.0G Comparison of extent info shows a net change in shared extents of: 3.0G $ df -h /mnt/xfs; du -csh /mnt/xfs/reflink Filesystem Size Used Avail Use% Mounted on /dev/xvdc 50G 5.4G 45G 11% /mnt/xfs 9.0G /mnt/xfs/reflink 9.0G total

We can. Apparent usage stays at 9GiB but real usage went back to 5.4GiB which is where we were before we created 1_2.

And the physical layout of the files?

$ filefrag -v /mnt/xfs/reflink/{1,2,1_2} Filesystem type is: 58465342 File size of /mnt/xfs/reflink/1 is 1073741824 (262144 blocks of 4096 bytes) ext: logical_offset: physical_offset: length: expected: flags: 0: 0.. 262143: 1572884.. 1835027: 262144: last,shared,eof /mnt/xfs/reflink/1: 1 extent found File size of /mnt/xfs/reflink/2 is 1073741824 (262144 blocks of 4096 bytes) ext: logical_offset: physical_offset: length: expected: flags: 0: 0.. 262127: 20.. 262147: 262128: shared 1: 262128.. 262143: 2129908.. 2129923: 16: 262148: last,shared,eof /mnt/xfs/reflink/2: 2 extents found File size of /mnt/xfs/reflink/1_2 is 2147483648 (524288 blocks of 4096 bytes) ext: logical_offset: physical_offset: length: expected: flags: 0: 0.. 262143: 1572884.. 1835027: 262144: shared 1: 262144.. 524271: 20.. 262147: 262128: 1835028: shared 2: 524272.. 524287: 2129908.. 2129923: 16: 262148: last,shared,eof /mnt/xfs/reflink/1_2: 3 extents found $ filefrag -v /mnt/xfs/reflink/{1,2,1_2} Filesystem type is: 58465342 File size of /mnt/xfs/reflink/1 is 1073741824 (262144 blocks of 4096 bytes) ext: logical_offset: physical_offset: length: expected: flags: 0: 0.. 262143: 1572884.. 1835027: 262144: last,shared,eof /mnt/xfs/reflink/1: 1 extent found File size of /mnt/xfs/reflink/2 is 1073741824 (262144 blocks of 4096 bytes) ext: logical_offset: physical_offset: length: expected: flags: 0: 0.. 262127: 20.. 262147: 262128: shared 1: 262128.. 262143: 2129908.. 2129923: 16: 262148: last,shared,eof /mnt/xfs/reflink/2: 2 extents found File size of /mnt/xfs/reflink/1_2 is 2147483648 (524288 blocks of 4096 bytes) ext: logical_offset: physical_offset: length: expected: flags: 0: 0.. 262143: 1572884.. 1835027: 262144: shared 1: 262144.. 524271: 20.. 262147: 262128: 1835028: shared 2: 524272.. 524287: 2129908.. 2129923: 16: 262148: last,shared,eof /mnt/xfs/reflink/1_2: 3 extents found

It shows that 1_2 is now made up from the same extents as 1 and 2 combined, as expected.

Less of the urandom ^

These synthetic demonstrations using a handful of 1GiB blobs of data from /dev/urandom are all very well, but what about something a little more like the real world?

Okay well let’s see what happens when I take ~30GiB of backup data created by rsnapshot on another host.

rsnapshot is a backup program which makes heavy use of hardlinks. It runs periodically and compares the previous backup data with the new. If they are identical then instead of storing an identical copy it makes a hardlink. This saves a lot of space but does have a lot of limitations as discussed previously.

This won’t be the best example because in some ways there is expected to be more duplication; this data is composed of multiple backups of the same file trees. But on the other hand there shouldn’t be as much because any truly identical files have already been hardlinked together by rsnapshot . But it is a convenient source of real-world data.

So, starting state:

(I deleted all the reflink files)

$ df -h /mnt/xfs; sudo du -csh /mnt/xfs/rsnapshot Filesystem Size Used Avail Use% Mounted on /dev/xvdc 50G 30G 21G 59% /mnt/xfs 29G /mnt/xfs/rsnapshot 29G total $ df -h /mnt/xfs; sudo du -csh /mnt/xfs/rsnapshot Filesystem Size Used Avail Use% Mounted on /dev/xvdc 50G 30G 21G 59% /mnt/xfs 29G /mnt/xfs/rsnapshot 29G total

A small diversion about how rsnapshot lays out its backups may be useful here. They are stored like this:

rsnapshot_root / [iteration a] / [client foo] / [directory structure from client foo]

rsnapshot_root / [iteration a] / [client bar] / [directory structure from client bar]

…

…

rsnapshot_root / [iteration b] / [client foo] / [directory structure from client foo]

rsnapshot_root / [iteration b] / [client bar] / [directory structure from client bar]

The iterations are commonly things like daily.0, daily.1 … daily.6. As a consequence, the paths:

rsnapshot/daily.*/client_foo

would be backups only from host foo, and:

rsnapshot/daily.0/*

would be backups from all hosts but only the most recent daily sync.

Let’s first see what the savings would be like in looking for duplicates in just one client’s backups.

Here’s the backups I have in this blob of data. The names of the clients are completely made up, though they are real backups.

Client Size (MiB) darbee 14,504 achorn 11,297 spader 2,612 reilly 2,276 chino 2,203 audun 2,184

So let’s try deduplicating all of the biggest one’s—darbee‘s—backups:

$ df -h /mnt/xfs Filesystem Size Used Avail Use% Mounted on /dev/xvdc 50G 30G 21G 59% /mnt/xfs # time duperemove -hdr --hashfile=/var/tmp/dr.hash /mnt/xfs/rsnapshot/*/darbee Using 128K blocks Using hash: murmur3 Gathering file list... Kernel processed data (excludes target files): 8.8G Comparison of extent info shows a net change in shared extents of: 6.8G 9.85user 78.70system 3:27.23elapsed 42%CPU (0avgtext+0avgdata 23384maxresident)k 50703656inputs+790184outputs (15major+20912minor)pagefaults 0swaps $ df -h /mnt/xfs Filesystem Size Used Avail Use% Mounted on /dev/xvdc 50G 25G 26G 50% /mnt/xfs $ df -h /mnt/xfs Filesystem Size Used Avail Use% Mounted on /dev/xvdc 50G 30G 21G 59% /mnt/xfs # time duperemove -hdr --hashfile=/var/tmp/dr.hash /mnt/xfs/rsnapshot/*/darbee Using 128K blocks Using hash: murmur3 Gathering file list... Kernel processed data (excludes target files): 8.8G Comparison of extent info shows a net change in shared extents of: 6.8G 9.85user 78.70system 3:27.23elapsed 42%CPU (0avgtext+0avgdata 23384maxresident)k 50703656inputs+790184outputs (15major+20912minor)pagefaults 0swaps $ df -h /mnt/xfs Filesystem Size Used Avail Use% Mounted on /dev/xvdc 50G 25G 26G 50% /mnt/xfs

3m27s of run time, somewhere between 5 and 6.8GiB saved. That’s 35%!

Now to deduplicate the lot.

# time duperemove -hdr --hashfile=/var/tmp/dr.hash /mnt/xfs/rsnapshot Using 128K blocks Using hash: murmur3 Gathering file list... Kernel processed data (excludes target files): 5.4G Comparison of extent info shows a net change in shared extents of: 3.4G 29.12user 188.08system 5:02.31elapsed 71%CPU (0avgtext+0avgdata 34040maxresident)k 34978360inputs+572128outputs (18major+45094minor)pagefaults 0swaps $ df -h /mnt/xfs Filesystem Size Used Avail Use% Mounted on /dev/xvdc 50G 23G 28G 45% /mnt/xfs # time duperemove -hdr --hashfile=/var/tmp/dr.hash /mnt/xfs/rsnapshot Using 128K blocks Using hash: murmur3 Gathering file list... Kernel processed data (excludes target files): 5.4G Comparison of extent info shows a net change in shared extents of: 3.4G 29.12user 188.08system 5:02.31elapsed 71%CPU (0avgtext+0avgdata 34040maxresident)k 34978360inputs+572128outputs (18major+45094minor)pagefaults 0swaps $ df -h /mnt/xfs Filesystem Size Used Avail Use% Mounted on /dev/xvdc 50G 23G 28G 45% /mnt/xfs

5m02 used this time, and another 2–3.4G saved.

Since the actual deduplication does take some time (the kernel having to read the extents, mainly), and most of it was already done in the first pass, a full pass would more likely take the sum of the times, i.e. more like 8m29s.

Still, a total of about 7GiB was saved which is 23%.

It would be very interesting to try this on one of my much larger backup stores.

Why Not Just Use btrfs? ^

Using a filesystem that already has all of these features would certainly seem easier, but I personally don’t think btrfs is stable enough yet. I use it at home in a relatively unexciting setup (8 devices, raid1 for data and metadata, no compression or deduplication) and I wish I didn’t. I wouldn’t dream of using it in a production environment yet.

I’m on the btrfs mailing list and there are way too many posts regarding filesystems that give ENOSPC and become unavailable for writes, or systems that were unexpectedly powered off and when powered back on the btrfs filesystem is completely lost.

I expect the reflink feature in XFS to become non-experimental before btrfs is stable enough for production use.

ZFS is great. It doesn’t have out-of-band deduplication or reflinks though, and they don’t plan to any time soon.