From Ext4

Overview

TLDR: Add crc32c to ext4 superblock, inode, block and inode bitmap, extent tree, directory block, htree block, MMP block, journal, and extended attribute objects with as few disk layout adjustments as possible.

Regular: As much as we wish our storage hardware was 100% reliable, it is still quite possible for data to be corrupted on disk, corrupted during transfer over a wire, or written to the wrong places. To protect against this sort of non-hostile corruption, it is desirable to store checksums of metadata objects on the filesystem to prevent broken metadata from shredding the filesystem. In theory, btrfs has stronger guarantees against corruption (uniform checksums on _all_ metadata blocks, redundant copies of all metadata, etc.) but this retrofit to ext4 will provide stronger protections for users who desire to stay with or refuse to migrate off of ext4, and at the fairly low cost of a single tune2fs/e2fsck.

This code started going upstream in mid-2012. As of October 2012 it is not all upstream yet.

How to Use

TL;DR

Install Linux 3.6+ and e2fsprogs 1.43-WIP.

modprobe crc32c-intel

mkfs.ext4 -O metadata_csum,64bit /dev/path/to/disk

mount /dev/path/to/disk /mountpoint -o journal_checksum

Detailed Instructions

The metadata checksumming code started going into mainline in Linux 3.5, and as of 3.7-rc1 it is undergoing some user testing. This code is not yet rock solid; be careful, and please report problems to the mailing list at linux-ext4@vger.kernel.org. Support for checksums is sitting in the e2fsprogs WIP tree, which implies that it might appear in e2fsprogs 1.43.

In a perfect world, one could simply enable metadata checksums and that would be the end of it. Unfortunately, there isn't enough space in some of the data structures on 32-bit ext4 filesystems to hold full 32-bit checksums, which reduces the strength of the checksum on those filesystems. Therefore, it is best to start by formatting a fresh filesystem with 64-bit support enabled, since it is not possible to upgrade a 32-bit filesystem to a 64-bit filesystem. In any case, existing filesystems (32 or 64-bit) can be upgraded to have checksums.

To create a fresh filesystem, simply specify the metadata_csum (and hopefully 64bit ) features when running mkfs.ext4, e.g. mkfs.ext4 -O metadata_csum,64bit /dev/path/to/disk .

To enable checksums on an existing filesystem, first make sure that the filesystem will pass fsck (or back up your data). Then, simply turn on metadata_csum via tune2fs, e.g. tune2fs -O metadata_csum /dev/path/to/disk . The tune2fs program will try to amend all the metadata structures to include checksums. However, it is possible that it will find a directory block that is completely full, in which case it will advise you to run e2fsck -D to have the directory files rebuilt and reindexed. If you ignore this step, then that directory block will not be protected by checksums!

To mount a checksum-enabled ext4 filesystem, use the regular mount command. It is not necessary to provide any mount options to enable the feature.

If you mount your ext4 filesystem with either async_commit or journal_checksum , then enabling the metadata_csum feature will cause all journal data structures to gain checksums as well.

To disable checksums on an existing filesystem, ensure that the filesystem will pass fsck. Then turn off metadata_csum via tune2fs, e.g. tune2fs -O ^metadata_csum /dev/path/to/disk .

Some recent CPUs (Intel and SPARC) provide a hardware accelerated CRC32c implementation. For best performance you should ensure that these modules load before any ext4 filesystems with checksums.

Algorithm

The popular sentiment is that a CRC will suffice to detect bit flips and other various corruption. The existing block group checksum uses the ANSI CRC16 polynomial (0x8005), which probably suffices for 32-byte block group descriptors. However, this crc16 is not be the most desirable function for the other metadata objects; longer CRCs are generally better at detecting errors when the data being checksummed gets large. It is expected that this will be the case since the bitmaps and the directory blocks are generally 4KiB in size.

The CRC32c polynomial (0x1EDC6F41) seems to have stronger error detection abilities over regular CRC32 (0x04C11DB7). It is implemented in hardware on Core i7 Intel CPUs and can be made to run reasonably quickly on other processors. Therefore, it seems desirable to use it. Further study is required to determine which CRCs (and which implementations) are fastest.

CRC Stuffing

For the space-constrained block groups (at least in standard 32-bit mode) It has been suggested that because CRC16 is implemented in software, we should find a way to use the fast crc32c function yet somehow shrink the checksum to fit in 16 bits.

For the bitmap checksums it seems possible to take advantage of the property crc32(a ^ b) = crc32(a) ^ crc32(b).

Benchmarking

I culled crc code from the Linux kernel and e2fsprogs, and linked it all into a big dumb program that crcs a large block of data. Following are bandwidth results (K/s) from various machines and a block size of 512MB:

machine clock crc16 crc16-t10dif crc32-kern-be crc32-kern-le crc32-e2fs-be crc32-e2fs-le crc32c crc32c-intel crc32c-by8-be crc32c-by8-le crc32c-intelby8 Core i7-3320M 3.3GHz 395,838 352,502 1,127,426 437,862 453,601 11,386,583 2,212,699 1,779,745 Xeon E5-2680v2 3.6GHz 415,274 346,793 1,137,524 478,612 432,502 9,611,379 2,248,240 2,125,233 Xeon X5680 3.6GHz 420,597 341,541 1,246,439 535,867 497,779 5,156,318 2,144,441 2,168,978 Core(TM) i7-3615QM 3.3GHz 382,565 319,508 1,143,186 440,756 453,099 7,887,647 1,785,770 1,776,152 Xeon X5650 2.67GHz 381,856 293,951 1,039,389 1,059,679 454,377 454,133 419,964 4,447,431 1,684,071 1,698,309 1,843,101 Core i7-950 3.06GHz 363,599 279,431 996,363 994,851 429,477 428,275 398,382 4,131,127 1,573,210 1,593,776 1,714,893 Pentium 4 (Nocona) 3.6GHz 391,433 345,666 915,502 925,717 512,564 511,035 437,946 n/a 1,097,068 1,146,004 1,099,856 Core2 6700 2.67GHz 332,726 320,891 933,688 937,826 453,658 453,377 390,229 n/a 1,653,483 1,652,018 1,362,838 POWER5+ 1.5GHz 160,096 111,396 285,927 314,650 169,446 169,447 160,106 n/a 620,102 624,184 599,048 POWER5+ 1.9GHz 202,266 140,844 360,555 396,713 214,207 214,202 202,224 n/a 807,723 808,243 775,657 Athlon64 X2 4200+ 2.2GHz 261,927 298,252 767,435 767,469 392,507 392,520 337,204 n/a 1,193,278 1,102,328 1,136,237 Pentium 4 Xeon (Nocona) 3GHz 360,264 307,781 793,679 790,873 421,749 421,491 393,766 n/a 935,662 942,952 910,220 Pentium 3 Xeon 1GHz 67,448 68,429 157,668 157,609 116,705 116,294 107,558 n/a VIA C7 2GHz 133,243 132,670 296,732 296,757 228,180 228,417 153,906 n/a 339,504 343,237 327,777 VIA C7 800MHz 52,759 52,765 118,037 118,832 90,874 90,483 60,962 n/a 138,600 137,445 132,069 Opteron 8218 2.6GHz 304,453 346,510 888,013 890,044 454,597 454,210 391,157 n/a 1,189,312 1,176,844 1,176,380 Xeon E5450 3GHz 405,184 326,124 1,052,806 1,055,434 511,349 510,867 421,542 n/a 1,675,781 1,686,921 1,816,082 P4 Xeon MP 2.7GHz 174,024 150,326 267,248 267,390 175,788 176,342 185,110 n/a 319,609 320,717 270,821 Xeon E3110 3GHz 406,181 326,324 1,055,929 1,057,013 518,032 516,353 422,631 n/a 1,676,384 1,696,455 1,831,592 Pentium III 500MHz 34,034 34,778 93,968 96,528 62,248 62,896 55,315 n/a 121,693 121,570 116,931 Core2 T7400 2.16GHz 277,295 261,794 758,097 758,311 367,066 366,937 316,754 n/a 1,329,832 1,328,357 1,088,756 Core2 T2300 1.66GHz 210,691 232,884 586,950 587,660 298,031 297,973 239,845 n/a 855,838 855,600 763,868 Core2 T7500 2.2GHz 304,027 286,315 835,736 836,694 400,011 400,388 348,750 n/a 1,465,904 1,467,464 1,181,531 Xeon X5550 2.67GHz 385,203 296,862 1,053,178 1,054,078 455,272 455,312 422,926 4,351,392 1,667,016 1,676,230 1,822,632 PowerMac G5 2GHz 212,214 147,982 377,590 417,308 225,339 225,339 212,190 n/a/ 738,237 736,327 728,993 Xeon X5570 2.93GHz 384,908 259,286 855,428 855,416 421,520 421,524 406,596 4,283,526 1,818,824 1,818,756 1,632,126 Xeon X7560 2.3GHz 197,739 140,100 427,931 427,931 213,622 224,348 204,148 2,143,132 898,852 889,125 863,381 Opteron 8354 2.2GHz 257,997 258,429 650,962 650,855 369,342 367,794 337,798 n/a 984,548 984,264 996,814 Core i7?? 2.6GHz 241,697 193,481 597,500 597,550 267,273 267,266 264,275 3,249,929 1,257,160 1,257,236 1,219,009 Xeon E5335 2GHz 268,751 216,926 696,639 695,757 344,071 342,600 280,610 n/a 1,115,085 1,124,472 1,217,095 P7? 3.3GHz 321,760 205,227 417,772 460,069 257,957 258,019 320,922 n/a 933,644 902,453 929,237 P6? 4GHz 409,852 388,417 815,203 910,645 471,649 486,732 431,875 n/a 1,233,702 1,290,849 1,239,037 Pentium II 400MHz 28,969 29,538 80,999 81,057 53,753 53,777 48,480 n/a/ 103,129 102,818 99,590

Here is a description of the various CRC implementations tested:

algorithm description crc16 ANSI CRC16 algorithm in kernel (Sarwate) crc16-t10dif T10 CRC16 used for DIF in kernel (Sarwate) crc32-kern-be BE CRC32 in kernel (slice by 4) crc32-kern-le LE CRC32 in kernel (slice by 4) crc32-e2fs-be BE CRC32 in e2fsprogs 1.41 (slice by 4) crc32-e2fs-le LE CRC32 in e2fsprogs 1.41 (slice by 4) crc32c Default CRC32C in kernel (Sarwate) crc32c-intel Accelerated CRC32C on Intel Core i7 crc32c-by8-be Bob Pearson's updated BE CRC32 algorithm, but with CRC32C polynomial (slice by 8) crc32c-by8-le Bob Pearson's updated LE CRC32 algorithm, but with CRC32C polynomial (slice by 8) crc32c-intelby8 Intel's CRC32C algorithm http://prdownloads.sourceforge.net/slicing-by-8/ (slice by 8)

At a 4K block size the time slices are so tiny that it's difficult to identify any clear trends.

It is well known that Sarwate's algorithm has been superseded (performance-wise) by the bit slicing implementations; these results support that conclusion. All slice-by-N implementations had #define'd a polynomial, making it trivially easy to port the code to the "default" CRC32C implementation. Obviously, the hardware solution eats all the others for lunch, though it only exceeds the slice-by-8 algorithm by a factor of ~2.5x and the slice-by-4 algorithms by a factor of ~4x. Either way, 1.5GB/s of _metadata_ updates is quite a lot, so the performance hit might not be too hard provided we can replace the current software crc32c code with one of these slice-by algorithms.

As a side note, it is also desirable to optimize the crc16-t10dif algorithm, not for ext4 but for DIF disks.

Also, I hear that the upcoming SPARC T4 will have hardware CRC32c acceleration.

Existing Metadata Checksumming

Block Groups

The block group descriptor is protected by a CRC16. On a 64-bit filesystem, it may be possible either to extend the field to 32-bits, or to stuff a 32-bit crc into 16 bits per the "Stuffing" section above.

Journal

jbd2 has a (probably infrequently) used journal_checksum feature that ensures the integrity of the journal contents. Currently it supports CRC32, MD5, or SHA1 checksums, though as of Linux 3.0 it only seems to support CRC32. This can be easily switched over to CRC32c.

On-Disk Structure Modifications

Darrick will try to implement this without requiring an on-disk format change. Basically, that means that we have to find places where checksums can be crammed into existing data structures.

Superblock

Andi Kleen posted a patch to checksum the superblock. Darrick plans to massage this patch a little bit; the crc32c will be pasted into the superblock at offset 0x24C.

Inodes

Inode checksums are only supported on Linux. The checksum is a crc32c field at offset 0x7C, which puts it in the middle of osd2.linux2. The checksum covers the inode and everything else that follows it (afaik in-inode extended attribute blocks).

Inode/Block Bitmap (64-bit)

Each bitmap has its own crc32c checksum; both checksums are stored in the block group descriptor. The inode bitmap checksum is at offset 0x18, and the block bitmap checksum is at offset 0x38. This only works if the 64bit feature is set, unfortunately.

Inode/Block Bitmap (32-bit)

For 32-bit filesystems, Darrick is considering using the 16-bit fields in the block group descriptor at offset 0x18 and 0x20 to store either crc16 or stuffed crc32c values of the inode and block bitmaps. It's probably better to have a slow crc16 over no crc at all.

Extent Tree

Filesystem blocks are always 1024, 2048, or 4096 bytes, and the extent tree header and entry structures are both 12 bytes long. Therefore, because 2^n % 12 >= 4, there is sufficient space to store a crc32c just past the end of the last struct ext4_extent . The checksum is computed only the part of the extent block that is in use.

Directory Blocks

Regular directory leaf blocks (i.e. blocks that are not secretly htree nodes) are a semi-packed array of variable-length records. A 12-byte directory entry is created at the end of the block with a an inode of 0 to make the entry look unused to old ext4 drivers; a name_len of 0; and a rec_len large enough to hold a crc32c. In a cursory analysis of 250,000 directories, just 29 had blocks that did not have sufficient space to hold the 12-byte tail. tune2fs will advise users to run e2fsck -D to rebuild all directories so that all directory blocks may have a checksum.

HTree

The htree root and internal nodes do not hide a checksum in a fake dirent at the end of the block because that would require the removal of two struct dx_entry from each htree block. Instead, the limit count is decreased by 1 and the crc32c stored at the end of the block. Again, tune2fs will advise users to run e2fsck -D to rebuild all directories and perform any necessary htree rebalancing.

Unfortunately, in adding htree checksums to a very very large directory, it is possible to overflow the htree.

Extended Attributes (EAs)

For EAs stored in a separate disk block (i.e. not stored after the inode), there is sufficient space to store a crc32c directly in the header.

For EAs stored in the extra space after the inode, Darrick thought incorrectly that the h_magic field was never checked. That turned out to be untrue, so his new proposal is to follow Andreas Dilger's suggestion simply to extend the inode checksum to cover the extra space after the inode structure. That will require a fair amount of changes to e2fsprogs, but not a lot for the kernel.

Metadata Not Being Upgraded

Direct/indirect/triple-indirect block maps are not targeted for checksums, as this results in a totally incompatible disk format change and reduces the maximum file size considerably. Files should be converted to extents via chattr +e for increased safety and less overhead.

A user should be able to turn on this feature at mke2fs time simply by specifying -O metadata_csum . Because the 64bit feature allows arbitrarily large block group descriptors that are large enough to enable crc32c for the bitmaps, mke2fs should warn the user if the feature set is metadata_csum,^64bit when it becomes the case that the 64bit feature has been tested thoroughly.

It should be possible to convert existing filesystems with a simple tune2fs -O metadata_csum . tune2fs will apply checksums to all metadata structures that can trivially take them, and tell the user to run e2fsck -D if necessary. e2fsck will gain the ability to reorganize directory tree blocks to accommodate the checksum fields. Obviously, 64bit mode cannot (currently) be enabled on existing filesystems.

It should be possible to disable metadata checksumming on an existing filesystem with tune2fs -O ^metadata_csum , with the same conditions outlined for enabling checksums on an existing filesystem.

debugfs should try to display checksums whenever possible.

It should NOT be possible for old fs code to write to a filesystem with metadata checksums enabled. The metadata_csum flag is implemented as a ROCOMPAT flag, which should keep (non-malicious) old programs from messing things up.

Stuff Darrick Hasn't Thought Hard Enough About