On Fri, Aug 10, 2018 at 03:40:23AM +0200, erentheti...@mail.de wrote: > I am searching for more information regarding possible bugs related to > BTRFS Raid 5/6. All sites i could find are incomplete and information > contradicts itself: > > The Wiki Raid 5/6 Page (https://btrfs.wiki.kernel.org/index.php/RAID56) > warns of the write hole bug, stating that your data remains safe > (except data written during power loss, obviously) upon unclean shutdown > unless your data gets corrupted by further issues like bit-rot, drive > failure etc.

The raid5/6 write hole bug exists on btrfs (as of 4.14-4.16) and there are no mitigations to prevent or avoid it in mainline kernels. The write hole results from allowing a mixture of old (committed) and new (uncommitted) writes to the same RAID5/6 stripe (i.e. a group of blocks consisting of one related data or parity block from each disk in the array, such that writes to any of the data blocks affect the correctness of the parity block and vice versa). If the writes were not completed and one or more of the data blocks are not online, the data blocks reconstructed by the raid5/6 algorithm will be corrupt. If all disks are online, the write hole does not immediately damage user-visible data as the old data blocks can still be read directly; however, should a drive failure occur later, old data may not be recoverable because the parity block will not be correct for reconstructing the missing data block. A scrub can fix write hole errors if all disks are online, and a scrub should be performed after any unclean shutdown to recompute parity data. The write hole always puts both old and new data at risk of damage; however, due to btrfs's copy-on-write behavior, only the old damaged data can be observed after power loss. The damaged new data will have no references to it written to the disk due to the power failure, so there is no way to observe the new damaged data using the filesystem. Not every interrupted write causes damage to old data, but some will. Two possible mitigations for the write hole are: - modify the btrfs allocator to prevent writes to partially filled raid5/6 stripes (similar to what the ssd mount option does, except with the correct parameters to match RAID5/6 stripe boundaries), and advise users to run btrfs balance much more often to reclaim free space in partially occupied raid stripes - add a stripe write journal to the raid5/6 layer (either in btrfs itself, or in a lower RAID5 layer). There are assorted other ideas (e.g. copy the RAID-Z approach from zfs to btrfs or dramatically increase the btrfs block size) that also solve the write hole problem but are somewhat more invasive and less practical for btrfs. Note that the write hole also affects btrfs on top of other similar raid5/6 implementations (e.g. mdadm raid5 without stripe journal). The btrfs CoW layer does not understand how to allocate data to avoid RMW raid5 stripe updates without corrupting existing committed data, and this limitation applies to every combination of unjournalled raid5/6 and btrfs. > The Wiki Gotchas Page (https://btrfs.wiki.kernel.org/index.php/Gotchas) > warns of possible incorrigible "transid" mismatch, not stating which > versions are affected or what transid mismatch means for your data. It > does not mention the write hole at all. Neither raid5 nor write hole are required to produce a transid mismatch failure. transid mismatch usually occurs due to a lost write. Write hole is a specific case of lost write, but write hole does not usually produce transid failures (it produces header or csum failures instead). During real disk failure events, multiple distinct failure modes can occur concurrently. i.e. both transid failure and write hole can occur at different places in the same filesystem as a result of attempting to use a failing disk over a long period of time. A transid verify failure is metadata damage. It will make the filesystem readonly and make some data inaccessible as described below. > This Mail Archive (linux-btrfs@vger.kernel.org/msg55161.html" > target="_blank">https://www.mail-archive.com/linux-btrfs@vger.kernel.org/msg55161.html) > states that scrubbing BTRFS Raid 5/6 will always repair Data Corruption, > but may corrupt your Metadata while trying to do so - meaning you have > to scrub twice in a row to ensure data integrity. Simple corruption (without write hole errors) is fixed by scrubbing as of the last...at least six months? Kernel v4.14.xx and later can definitely do it these days. Both data and metadata. If the metadata is damaged in any way (corruption, write hole, or transid verify failure) on btrfs and btrfs cannot use the raid profile for metadata to recover the damaged data, the filesystem is usually forever readonly, and anywhere from 0 to 100% of the filesystem may be readable depending on where in the metadata tree structure the error occurs (the closer to the root, the more data is lost). This is the same for dup, raid1, raid5, raid6, and raid10 profiles. raid0 and single profiles are not a good idea for metadata if you want a filesystem that can persist across reboots (some use cases don't require persistence, so they can use -msingle/-mraid0 btrfs as a large-scale tmpfs). For all metadata raid profiles, recovery can fail due to risks including RAM corruption, multiple drives having defects in the same locations, or multiple drives with identically-behaving firmware bugs. For raid5/6 metadata there is the *additional* risk of the write hole bug preventing recovery of metadata. If the filesystem is -draid5 -mraid1 then the metadata is not vulnerable to the write hole, but data is. In this configuration you can determine with high confidence which files you need to restore from backup, and the filesystem will remain writable to replace the restored data, because raid1 does not have the write hole bug. More than one scrub for a single write hole event won't help (and never did). If the first scrub doesn't fix all the errors then your kernel probably also has a race condition bug or regression that will permanently corrupt the data (this was true in 2016 when the referenced mailing list post was written). Current kernels don't have such bugs--if the first scrub can correct the data, it does, and if the first scrub can't correct the data then all future scrubs will produce identical results. Older kernels (2016) had problems reconstructing data during read() operations but could fix data during scrub or balance operations. These bugs, as far as I am able to test, have been fixed by v4.17 and backported to v4.14. > The Bugzilla Entry > (https://bugzilla.kernel.org/buglist.cgi?component=btrfs) contains > mostly unanswered bugs, which may or may not still count (2013 - 2018). I find that any open bug over three years old on b.k.o can be safely ignored because it has either already been fixed or there is not enough information provided to understand what is going on. > This Spinics Discussion > (https://www.spinics.net/lists/linux-btrfs/msg76471.html) states > that the write hole can even damage old data eg. data that was not > accessed during unclean shutdown, the opposite of what the Raid5/6 > Status Page states! Correct, write hole can *only* damage old data as described above. > This Spinics comment > (https://www.spinics.net/lists/linux-btrfs/msg76412.html) informs that > hot-plugging a device will trigger the write hole. Accessed data will > therefore be corrupted. In case the earlier statement about old data > corruption is true, random data could be permamently lost. This is even > more dangerous if you are connecting your devices via USB, as USB can > unconnect due to external influence, eg. touching the cables, shaking... Hot-unplugging a device can cause many lost write events at once, and each lost write event is very bad. btrfs does not reject and resynchronize a device from a raid array if a write to the device fails (unlike every other working RAID implementation on Earth...). If the device reconnects, btrfs will read a mixture of old and new data and rely on checksums to determine which blocks are out of date (as opposed to treating the departed disk as entirely out of date and initiating a disk replace operation when it reconnects). A scrub after a momentary disconnect can reconstruct most missing data, but not all. CRC32 lets one error through per 16 TB of corrupted blocks, and all nodatasum/nodatacow files modified while a drive was offline will be corrupted without detection or recovery by btrfs. Device replace is currently the best recovery option from this kind of failure. Ideally btrfs would implement something like mdadm write intent bitmaps so only those block groups that were modified while the device as offline would be replaced, but this is the btrfs we want not the btrfs we have. > Lastly, this Superuser question > (https://superuser.com/questions/1325245/btrfs-transid-failure#1344494) > assumes that the transid mismatch bug could toggle your system > unmountable. While it might be possible to restore your data using > sudo BTRFS Restore, it is still unknown how the transid mismatch is > even toggled, meaning that your file system could fail at any time! Note that transid failure risk applies to all btrfs configurations. It is not specific to raid5/6. The write hole errors from raid5/6 will typically produce a header or csum failure (from reading garbage) not a transid failure (from reading an old, valid, but deleted metadata block). transid mismatch is pretty simple: one of your disk drives, or some caching or translation layer between btrfs and your disk drives, dropped a write (or, less likely, read from or wrote to the wrong sector address). btrfs detects this by embedding transids into all data structures where one object points to another object in a different block. transid mismatch is also hard: you then have to figure out which layer of your possibly quite complicated RAID setup is doing that, and make it stop. This process almost never involves btrfs. Sometimes it's the bottom layer (i.e. the drives themselves) but the more layers you add, the more candidates need to be eliminated before the cause can be found. Sometimes it's a *power supply* (i.e. the drive controller CPU browns out and forgets it was writing something or corrupts its embedded RAM). Sometimes it's host RAM going bad, corrupting and breaking everything it touches. I have a variety of test setups and the correlation between hardware model (especially drive model, but also some SATA controller models) and total filesystem loss due to transid verify failure is very strong. Out of 10 drive models from 5 vendors, 2 models can't keep a filesystem intact for more than a few months, while the other models average 3 years old and still hold the first btrfs filesystem they were formatted with. Disabling drive write caching sometimes helps, but some hardware eats a filesystem every few months no matter what settings I change. If the problem is a broken SATA controller or cable then changing drive settings won't help. It's fun and/or scary to put known good and bad hardware in the same RAID1 array and watch btrfs autocorrecting the bad data after every other power failure; however, the bad hardware is clearly not sufficient to implement any sort of reliable data persistence, and arrays with bad hardware in them will eventually fail. The bad drives can still contribute to society as media cache servers or point-of-sale terminals where the only response to any data integrity issue is a full reformat and image reinstall. This seems to be the target market that low-end consumer drives are aiming for, as they seem to be useless for anything else. Adopt a zero-tolerance policy for drive resets after the array is mounted and active. A drive reset means a potential lost write leading to a transid verify failure. Swap out both drive and SATA cable the first time a reset occurs during a read or write operation, and consider swapping out SATA controller, changing drive model, and upgrading power supply if it happens twice. > Do you know of any comprehensive and complete Bug list? ...related to raid5/6: - no write hole mitigation (at least two viable strategies available) - no device bouncing mitigation (mdadm had this working 20 years ago) - probably slower than it could be - no recovery strategy other than raid (btrfs check --repair is useless on non-trivial filesytems, and a single-bit uncorrected metadata error makes the filesystem unusable) > Do you know more about the stated Bugs? > > Do you know further Bugs that are not addressed in any of these sites? My testing on raid5/6 filesystems is producing pretty favorable results these days. There do not seem to be many bugs left. I have one test case where I write millions of errors into a raid5/6 and the filesystem recovers every single one transparently while verifying SHA1 hashes of test data. After years of rebuilding busted ext3 on mdadm-raid5 filesystems, watching btrfs do it all automatically is just...beautiful. I think once the write hole and device bouncing mitigations are in place, I'll start looking at migrating -draid1/-mraid1 setups to -draid5/-mraid1, assuming the performance isn't too painful. > ------------------------------------------------------------------------------------------------- > FreeMail powered by mail.de - MEHR SICHERHEIT, SERIOSITÄT UND KOMFORT

signature.asc

Description: PGP signature