How much space is this file taking from my hard drive? How much free space do I have? How many more files can I fit in the remaining free space?

The answer to these questions seems obvious. We all have an instinctive understanding of how filesystems work, and we often picture storing files in disk space in the same way as physically fitting apples inside a basket.

In modern Linux systems though, this intuition can be misleading. Let us see why.

File size

What is the size of a file? This one seems easy: the summation of all bytes of content from the beginning of the file until the end of the file.

We often picture all file contents layed out one byte after another until the end of the file.

And that’s how we commonly think about file size. We can get it with

ls -l file.c

, or the stat command that makes use of the stat() system call.

stat file.c

Inside the Linux kernel, the memory structure that represents the file is the inode. The metadata that we can access through the stat command lives in the inode

struct inode { /* excluded content */ loff_t i_size; /* file size */ struct timespec i_atime; /* access time */ struct timespec i_mtime; /* modification time */ struct timespec i_ctime; /* change time */ unsigned short i_bytes; /* bytes used (for quota) */ unsigned int i_blkbits; /* block size = 1 << i_blkbits */ blkcnt_t i_blocks; /* number of blocks used */ /* excluded content */ }

We can see some familiar attributes, such as the access and modification timestamps, and also we can see i_size, which is the file size as we defined earlier.

Thinking in terms of the file size is intuitive, but we are more interested in how space is actually used.

Blocks and block size

Regarding how the file is stored internally, the filesystem divides storage in blocks. Traditionally the block size was 512 bytes, and more recently 4 kilobytes. This value is chosen based on supported page size for typical MMU hardware.

The filesystem inserts our chunked file into those blocks, and keeps track of them in the metadata.

This ideally looks like this

, but in practice files are constantly created, resized and destroyed, and it looks more like this

This is known as external fragmentation, and traditionally results in a performance degradation due to the fact that the spinning head of the hard drive has to jump around gathering the fragments and that is a slow operation. Classic defragmentation tools try to keep this problem at bay.

What happens with files smaller than 4kiB? what happens with the contents of the last block after we have cut our file into pieces? Naturally there is going to be wasted space there, we call that phenomenon internal fragmentation. Obviously this is an undesirable side effect that can make unusable a lot of free space, much more so when we have a big number of very small files.

We can see the real disk usage of the file with stat, or

ls -ls file.c

, or

du file.c

For example, the contents of this one byte file still use 4kiB of disk space.

$ echo "" > file.c $ ls -l file.c -rw-r--r-- 1 nacho nacho 1 Apr 30 20:42 file.c $ ls -ls file.c 4 -rw-r--r-- 1 nacho nacho 1 Apr 30 20:42 file.c $ du file.c 4 file.c $ dutree file.c [ file.c 1 B ] $ dutree -u file.c [ file.c 4.00 KiB ] $ stat file.c File: file.c Size: 1 Blocks: 8 IO Block: 4096 regular file Device: 2fh/47d Inode: 2185244 Links: 1 Access: (0644/-rw-r--r--) Uid: ( 1000/ nacho) Gid: ( 1000/ nacho) Access: 2018-04-30 20:41:58.002124411 +0200 Modify: 2018-04-30 20:42:24.835458383 +0200 Change: 2018-04-30 20:42:24.835458383 +0200 Birth: -

We are therefore looking at two magnitudes, file size and blocks used. We tend to think in terms of the former, but we should think in terms of the latter.

Filesystem specific features

In addition to the actual contents of the file, the kernel needs to store all sorts of metadata. We have seen some of the metadata in the inode already, and there also is other that is familiar to any Unix user, such as mode, ownership, uid, gid , flags, and ACL.

struct inode { /* excluded content */ struct fown_struct f_owner; umode_t i_mode; unsigned short i_opflags; kuid_t i_uid; kgid_t i_gid; unsigned int i_flags; /* excluded content */ }

There are also other structures such as the superblock that represents the filesystem itself, vfsmount that represents the mountpoint, redundancy information, namespaces and more. Some of this metadata can also take up some significant space, as we’ll see.

Block allocation metadata

This one will highly depend on the filesystem that we are using, as they will keep track of which blocks correspond to a file in their own unique way. The traditional ext2 way of doing this is through the i_block table of direct and indirect blocks.

, which can be found in the following memory structure

/* * Structure of an inode on the disk */ struct ext2_inode { __le16 i_mode; /* File mode */ __le16 i_uid; /* Low 16 bits of Owner Uid */ __le32 i_size; /* Size in bytes */ __le32 i_atime; /* Access time */ __le32 i_ctime; /* Creation time */ __le32 i_mtime; /* Modification time */ __le32 i_dtime; /* Deletion Time */ __le16 i_gid; /* Low 16 bits of Group Id */ __le16 i_links_count; /* Links count */ __le32 i_blocks; /* Blocks count */ __le32 i_flags; /* File flags */ /* excluded content */ __le32 i_block[EXT2_N_BLOCKS];/* Pointers to blocks */ /* excluded content */ }

As files get bigger, this scheme can produce a huge overhead because we have to track thousands of blocks for a single file. Also, we have a size limitation, as the 32bit ext3 filesystem can handle to only 8TiB files using this mechanism. ext3 developers have been keeping up with the times by supporting 48 bytes, and by introducing extents.

struct ext3_extent { __le32 ee_block; /* first logical block extent covers */ __le16 ee_len; /* number of blocks covered by extent */ __le16 ee_start_hi; /* high 16 bits of physical block */ __le32 ee_start; /* low 32 bits of physical block */ };

The concept is really simple: allocate contiguous blocks in disk and just anotate where the extent starts and how big it is. This way, we can allocate big groups of blocks to a file using way less metadata, and also benefit from faster sequencial access and locality.

For the curious, ext4 is backwards compatible, so it supports both methods: indirect method and extents method. To see how space is allocated, we can look at a write operation. Writes don’t go straight to storage, but first they land in the file cache for performance reasons. At some point, the cache writes back the information to persistent storage.

The filesystem cache is represented by the struct address_space, and the writepages operation will be called on it. The sequence looks like this

(cache writeback) ext4_aops-> ext4_writepages() -> ... -> ext4_map_blocks()

, at which point, ext4_map_blocks() will call either ext4_ext_map_blocks() , or ext4_ind_map_blocks() depending on wether we are using extents or not. If we look at the former in extents.c, we’ll notice references to the notion of holes that we will cover in the next section.

Checksums

The latest generation of filesystems also store checksums for the data blocks, in order to fight against silent data corruption. This gives them the ability to detect and correct these random errors, and of course this also comes with a toll in terms of disk usage proportional to the file size.

Only more modern systems such as BTRFS and ZFS support data checksums, but some older ones like ext4 have included metadata checksums.

The journal

ext3 added journaling capabilities to ext2. The journal is a circular log that records transactions in process in order to provide enhanced resiliance against power failures. By default it only applies to metadata, but it can be enabled as well for the data with the data=journal option at some performance cost.

This is a special hidden file, normally at inode number 8 that has a typical size of 128MiB, as the official documentation explains

Introduced in ext3, the ext4 filesystem employs a journal to protect the filesystem against corruption in the case of a system crash. A small continuous region of disk (default 128MiB) is reserved inside the filesystem as a place to land “important” data writes on-disk as quickly as possible. Once the important data transaction is fully written to the disk and flushed from the disk write cache, a record of the data being committed is also written to the journal. At some later point in time, the journal code writes the transactions to their final locations on disk (this could involve a lot of seeking or a lot of small read-write-erases) before erasing the commit record. Should the system crash during the second slow write, the journal can be replayed all the way to the latest commit record, guaranteeing the atomicity of whatever gets written through the journal to the disk. The effect of this is to guarantee that the filesystem does not become stuck midway through a metadata update.

Tail packing

Also known as block suballocation, filesystems with this feature will make use of the tail space at the end of the last block, and share it between different files, effectively packing the tails in a single block.

While this is a nice feature to have that will save us a lot of space specially if we have a big number of small files (as explained above), we can see that it makes existing tools inaccurate to report disk usage. We cannot just add all used blocks of all our files to obtain real disk usage.

Only BTRFS and ReiserFS support this feature.

Sparse files

Most modern filesystems have supported sparse files for a while. Sparse files can have holes in them that are not actually allocated to them and therefore don’t occupy any space. This time, the file size will be bigger than the block usage.

This can be really useful for things like generate “big” files really fast, or to provide free space for our VM virtual hard drive on demand. For the first time, weird things can happen such as end up running out of space in the host while we are using our hard drive in the virtual machine.

In order to slowly create a 10GiB file that uses around 10GiB of disk space we can do

$ dd if=/dev/zero of=file bs=2M count=5120

In order to create the same big file instantly we can just write the last byte, or even just

$ dd of=file-sparse bs=2M seek=5120 count=0

We can also use the truncate command like so

$ truncate -s 10G

We can modify disk space allocated to a file with the fallocate command that uses the fallocate() system call. With this syscall we can do more advanced things such as

Preallocate space for the file inserting zeroes. This will increase both disk usage and file size.

Deallocate space. This will dig a hole in the file, thus making it sparse and reducing disk usage without affecting file size.

Collapse space, making the file size and usage smaller.

Increase file space, by inserting a hole at the end. This increases file size without affecting disk usage.

Zero holes. This will make the wholes into unwritten extents so that reads will produce zeroes without affecting space or usage.

For instance, we can dig holes in a file, thus making it sparse in place with

$ fallocate -d file

The cp command supports working with sparse files. It tries to detect if the source file is sparse by some simple heuristics and then it makes the destination file sparse as well. We can copy a non-sparse file into a sparse copy with

$ cp --sparse=always file file_sparse

, or conversely make a solid copy of a sparse file with

$ cp --sparse=never file_sparse file

If you are convinced that you like working with sparse files, you can add this alias to your terminal environment

alias cp='cp --sparse=always'

When processes read bytes in the hole sections the filesystem will provide zeroed pages to them. For instance, we can analyze what happens when the file cache reads from the filesystem in a hole region in ext4. In this case, the sequence in readpage.c looks something like this

(cache read miss) ext4_aops-> ext4_readpages() -> ... -> zero_user_segment()

After this, the memory segment that the process is trying to access through the read() system call will efficiently obtain zeroes straight from fast memory.

COW filesystems

The next generation of filesystems after the ext family brings some very interesting features. Probably the most game changing feature from filesystems like ZFS or BTRFS is their COW or copy-on-write abilities.

When we perform a copy-on-write operation, or a clone, or a reflink or a shallow copy we are really not duplicanting extents. We are just making a metadata annotation in the newly created file, where we reference the same extents from the original file in the new file and we tag the extent as shared. The userspace is now under the illusion that there are two distinct files that can be modified separatedly. Whenever a process wants to write in a shared extent, the kernel will first create a copy of the extent and annotate it as belonging exclusively to that file, at least for now. After this, both files are a bit more different from one another, but they can still share many extents. In other words, in a COW filesystem extents can be shared between files and the filesystem will be in charge of only creating new extents whenever it is necessary.

We can see that cloning is a very fast operation, that doesn’t require doubling the space that we use like a regular copy. This is really powerful, and it is the technology behind the instant snapshot abilities of BTRFS and ZFS. You can literally clone ( or take a snapshot ) of you whole root filesystem in under a second. This is useful for instance right before upgrading your packages in case something breaks.

BTRFS supports two ways of creating shallow copies. The first one applies to subvolumes and uses the btrfs subvolume snapshot command. The second one applies to individual files and uses the cp –reflink command. You can find this alias useful to make fast shallow copies by default.

cp='cp --reflink=auto --sparse=always'

Going one step further, if we have non shallow copies or a file, or even files with duplicated extents, we can deduplicate them to make them reflink those common extents and free up space. One tool that can be used for this is duperemove but beware that this will naturally lead to a higher file fragmentation.

Now things really start getting complicated if we are trying to discover how our disk is being used by our files. Tools such as du or dutree will just count used blocks without being aware that some of them might be shared so will report more space than what is really being used.

Similarly, in BTRFS we should avoid using the df command as it will report space that is allocated by the BTRFS filesystem as free, so it is better to use btrfs filesystem usage.

$ sudo btrfs filesystem usage /media/disk1 Overall: Device size: 2.64TiB Device allocated: 1.34TiB Device unallocated: 1.29TiB Device missing: 0.00B Used: 1.27TiB Free (estimated): 1.36TiB (min: 731.10GiB) Data ratio: 1.00 Metadata ratio: 2.00 Global reserve: 512.00MiB (used: 0.00B) Data,single: Size:1.33TiB, Used:1.26TiB /dev/sdb2 1.33TiB Metadata,DUP: Size:6.00GiB, Used:3.48GiB /dev/sdb2 12.00GiB System,DUP: Size:8.00MiB, Used:192.00KiB /dev/sdb2 16.00MiB Unallocated: /dev/sdb2 1.29TiB

In order to learn what part of our files is exclusive or shared in BTRFS, we can use btrfs filesystem du . In this example, I want to check how much of my Nextcloud logs is shared between snapshots, which is most of it. Still it is hard to tell how the shared extents are distributed.

# btrfs filesystem du ncp-snapshots/*/nextcloud.log Total Exclusive Set shared Filename 76.88MiB 4.00KiB 76.87MiB daily_2018-06-07_091703/nextcloud.log 76.88MiB 4.00KiB 76.87MiB daily_2018-06-08_091703/nextcloud.log 76.88MiB 4.00KiB 76.88MiB daily_2018-06-09_091703/nextcloud.log 76.92MiB 0.00B 76.92MiB daily_2018-06-10_091703/nextcloud.log 76.92MiB 4.00KiB 76.92MiB daily_2018-06-11_101704/nextcloud.log 76.93MiB 4.00KiB 76.92MiB daily_2018-06-12_111712/nextcloud.log 76.97MiB 0.00B 76.97MiB daily_2018-06-13_121703/nextcloud.log 76.97MiB 0.00B 76.97MiB hourly_2018-06-13_081701/nextcloud.log 76.97MiB 0.00B 76.97MiB hourly_2018-06-13_091701/nextcloud.log 76.97MiB 0.00B 76.97MiB hourly_2018-06-13_101701/nextcloud.log 76.97MiB 0.00B 76.97MiB hourly_2018-06-13_111701/nextcloud.log 76.97MiB 0.00B 76.97MiB hourly_2018-06-13_121702/nextcloud.log 76.97MiB 0.00B 76.97MiB hourly_2018-06-13_131701/nextcloud.log 76.98MiB 4.00KiB 76.98MiB hourly_2018-06-13_141701/nextcloud.log 77.01MiB 4.00KiB 77.01MiB hourly_2018-06-13_151701/nextcloud.log 77.02MiB 4.00KiB 77.02MiB hourly_2018-06-13_161701/nextcloud.log 77.03MiB 4.00KiB 77.03MiB hourly_2018-06-13_171701/nextcloud.log 77.04MiB 4.00KiB 77.04MiB hourly_2018-06-13_181701/nextcloud.log 77.05MiB 4.00KiB 77.05MiB hourly_2018-06-13_191701/nextcloud.log 77.06MiB 0.00B 77.06MiB hourly_2018-06-13_201701/nextcloud.log 77.06MiB 0.00B 77.06MiB hourly_2018-06-13_211701/nextcloud.log 77.06MiB 0.00B 77.06MiB hourly_2018-06-13_221702/nextcloud.log 77.06MiB 0.00B 77.06MiB hourly_2018-06-13_231702/nextcloud.log 77.06MiB 0.00B 77.06MiB hourly_2018-06-14_011701/nextcloud.log 77.06MiB 0.00B 77.06MiB hourly_2018-06-14_021701/nextcloud.log 77.06MiB 0.00B 77.06MiB hourly_2018-06-14_031701/nextcloud.log 77.06MiB 0.00B 77.06MiB hourly_2018-06-14_041701/nextcloud.log 77.06MiB 0.00B 77.06MiB hourly_2018-06-14_051701/nextcloud.log 77.06MiB 0.00B 77.06MiB hourly_2018-06-14_061701/nextcloud.log 77.06MiB 0.00B 77.06MiB hourly_2018-06-14_071701/nextcloud.log 77.06MiB 0.00B 77.06MiB hourly_2018-06-14_081701/nextcloud.log 76.32MiB 4.00KiB 76.32MiB monthly_2018-03-28_091703/nextcloud.log 76.65MiB 4.00KiB 76.65MiB monthly_2018-04-27_091706/nextcloud.log 76.84MiB 4.00KiB 76.83MiB monthly_2018-05-27_101701/nextcloud.log 76.82MiB 4.00KiB 76.82MiB weekly_2018-05-23_121702/nextcloud.log 76.85MiB 4.00KiB 76.85MiB weekly_2018-05-30_121704/nextcloud.log 76.87MiB 4.00KiB 76.87MiB weekly_2018-06-06_121705/nextcloud.log 76.97MiB 0.00B 76.97MiB weekly_2018-06-13_131703/nextcloud.log

At the subvolume level we can get a rough idea from tools such as btrfs-du of what amount of data is exclusive to a snapshot and what is shared between snapshots.

References

https://en.wikipedia.org/wiki/Comparison_of_file_systems

https://lwn.net/Articles/187321/

https://ext4.wiki.kernel.org/index.php/Main_Page