* btrfs vs write caching firmware bugs (was: Re: BTRFS recovery not possible) @ 2019-06-23 20:45 Zygo Blaxell 2019-06-24 0:46 ` Qu Wenruo 2019-06-24 2:45 ` Remi Gauvin 0 siblings, 2 replies; 10+ messages in thread From: Zygo Blaxell @ 2019-06-23 20:45 UTC (permalink / raw) To: Qu Wenruo; +Cc: linux-btrfs [-- Attachment #1: Type: text/plain, Size: 19143 bytes --] On Thu, Jun 20, 2019 at 01:00:50PM +0800, Qu Wenruo wrote: > On 2019/6/20 上午7:45, Zygo Blaxell wrote: > > On Sun, Jun 16, 2019 at 12:05:21AM +0200, Claudius Winkel wrote: > >> What should I do now ... to use btrfs safely? Should i not use it with > >> DM-crypt > > > > You might need to disable write caching on your drives, i.e. hdparm -W0. > > This is quite troublesome. > > Disabling write cache normally means performance impact. The drives I've found that need write cache disabled aren't particularly fast to begin with, so disabling write cache doesn't harm their performance very much. All the speed gains of write caching are lost when someone has to spend time doing a forced restore from backup after transid-verify failure. If you really do need performance, there are drives with working firmware available that don't cost much more. > And disabling it normally would hide the true cause (if it's something > btrfs' fault). This is true; however, even if a hypothetical btrfs bug existed, disabling write caching is an immediately deployable workaround, and there's currently no other solution other than avoiding drives with bad firmware. There could be improvements possible for btrfs to work around bad firmware...if someone's willing to donate their sanity to get inside the heads of firmware bugs, and can find a way to fix it that doesn't make things worse for everyone with working firmware. > > I have a few drives in my collection that don't have working write cache. > > They are usually fine, but when otherwise minor failure events occur (e.g. > > bad cables, bad power supply, failing UNC sectors) then the write cache > > doesn't behave correctly, and any filesystem or database on the drive > > gets trashed. > > Normally this shouldn't be the case, as long as the fs has correct > journal and flush/barrier. If you are asking the question: "Are there some currently shipping retail hard drives that are orders of magnitude more likely to corrupt data after simple power failures than other drives?" then the answer is: "Hell, yes! How could there NOT be?" It wouldn't take very much capital investment or time to find this out in lab conditions. Just kill power every 25 minutes while running a btrfs stress-test should do it--or have a UPS hardware failure in ops, the effect is the same. Bad drives will show up in a few hours, good drives take much longer--long enough that, statistically, the good drives will probably fail outright before btrfs gets corrupted. > If it's really the hardware to blame, then it means its flush/fua is not > implemented properly at all, thus the possibility of a single power loss > leading to corruption should be VERY VERY high. That exactly matches my observations. Only a few disks fail at all, but the ones that do fail do so very often: 60% of corruptions at 10 power failures or less, 100% at 30 power failures or more. > > This isn't normal behavior, but the problem does affect > > the default configuration of some popular mid-range drive models from > > top-3 hard disk vendors, so it's quite common. > > Would you like to share the info and test methodology to determine it's > the device to blame? (maybe in another thread) It's basic data mining on operations failure event logs. We track events like filesystem corruption, data loss, other hardware failure, operator errors, power failures, system crashes, dmesg error messages, etc., and count how many times each failure occurs in systems with which hardware components. When a failure occurs, we break the affected system apart and place its components into other systems or test machines to isolate which component is causing the failure (e.g. a failing power supply could create RAM corruption events and disk failure events, so we move the hardware around to see where the failure goes). If the same component is involved in repeatable failure events, the correlation jumps out of the data and we know that component is bad. We can also do correlations by attributes of the components, i.e. vendor, model, size, firmware revision, manufacturing date, and correlate vendor-model-size-firmware to btrfs transid verify failures across a fleet of different systems. I can go to the data and get a list of all the drive model and firmware revisions that have been installed in machines with 0 "parent transid verify failed" events since 2014, and are still online today: Device Model: CT240BX500SSD1 Firmware Version: M6CR013 Device Model: Crucial_CT1050MX300SSD1 Firmware Version: M0CR060 Device Model: HP SSD S700 Pro 256GB Firmware Version: Q0824G Device Model: INTEL SSDSC2KW256G8 Firmware Version: LHF002C Device Model: KINGSTON SA400S37240G Firmware Version: R0105A Device Model: ST12000VN0007-2GS116 Firmware Version: SC60 Device Model: ST5000VN0001-1SF17X Firmware Version: AN02 Device Model: ST8000VN0002-1Z8112 Firmware Version: SC61 Device Model: TOSHIBA-TR200 Firmware Version: SBFA12.2 Device Model: WDC WD121KRYZ-01W0RB0 Firmware Version: 01.01H01 Device Model: WDC WDS250G2B0A-00SM50 Firmware Version: X61190WD Model Family: SandForce Driven SSDs Device Model: KINGSTON SV300S37A240G Firmware Version: 608ABBF0 Model Family: Seagate IronWolf Device Model: ST10000VN0004-1ZD101 Firmware Version: SC60 Model Family: Seagate NAS HDD Device Model: ST4000VN000-1H4168 Firmware Version: SC44 Model Family: Seagate NAS HDD Device Model: ST8000VN0002-1Z8112 Firmware Version: SC60 Model Family: Toshiba 2.5" HDD MK..59GSXP (AF) Device Model: TOSHIBA MK3259GSXP Firmware Version: GN003J Model Family: Western Digital Gold Device Model: WDC WD101KRYZ-01JPDB0 Firmware Version: 01.01H01 Model Family: Western Digital Green Device Model: WDC WD10EZRX-00L4HB0 Firmware Version: 01.01A01 Model Family: Western Digital Re Device Model: WDC WD2000FYYZ-01UL1B1 Firmware Version: 01.01K02 Model Family: Western Digital Red Device Model: WDC WD50EFRX-68MYMN1 Firmware Version: 82.00A82 Model Family: Western Digital Red Device Model: WDC WD80EFZX-68UW8N0 Firmware Version: 83.H0A83 Model Family: Western Digital Red Pro Device Model: WDC WD6002FFWX-68TZ4N0 Firmware Version: 83.H0A83 So far so good. The above list of drive model-vendor-firmware have collectively had hundreds of drive-power-failure events in the last 5 years, so we have been giving the firmware a fair workout [1]. Now let's look for some bad stuff. How about a list of drives that were involved in parent transid verify failure events occurring within 1-10 power cycles after mkfs events: Model Family: Western Digital Green Device Model: WDC WD20EZRX-00DC0B0 Firmware Version: 80.00A80 Change the query to 1-30 power cycles, and we get another model with the same firmware version string: Model Family: Western Digital Red Device Model: WDC WD40EFRX-68WT0N0 Firmware Version: 80.00A80 Removing the upper bound on power cycle count doesn't find any more. The drives running 80.00A80 are all in fairly similar condition: no errors in SMART, the drive was apparently healthy at the time of failure (no unusual speed variations, no unexpected drive resets, or any of the other things that happen to these drives as they age and fail, but that are not reported as official errors on the models without TLER). There are multiple transid-verify failures logged in multiple very different host systems (e.g. Intel 1U server in a data center, AMD desktop in an office, hardware ages a few years apart). This is a consistent and repeatable behavior that does not correlate to any other attribute. Now, if you've been reading this far, you might wonder why the previous two ranges were lower-bounded at 1 power cycle, and the reason is because I have another firmware in the data set with _zero_ power cycles between mkfs and failure: Model Family: Western Digital Caviar Black Device Model: WDC WD1002FAEX-00Z3A0 Firmware Version: 05.01D05 These drives have 0 power fail events between mkfs and "parent transid verify failed" events, i.e. it's not necessary to have a power failure at all for these drives to unrecoverably corrupt btrfs. In all cases the failure occurs on the same days as "Current Pending Sector" and "Offline UNC sector" SMART events. The WD Black firmware seems to be OK with write cache enabled most of the time (there's years in the log data without any transid-verify failures), but the WD Black will drop its write cache when it sees a UNC sector, and btrfs notices the failure a few hours later. Recently I've been asking people on IRC who present btrfs filesystems with transid-verify failures (excluding those with obvious symptoms of host RAM failure). So far all the users who have participated in this totally unscientific survey have WD Green 2TB and WD Black hard drives with the same firmware revisions as above. The most recent report was this week. I guess there are lot of drives with these firmwares still in inventories out there. The data says there's at least 2 firmware versions in the wild which have 100% of the btrfs transid-verify failures. These are only 8% of the total fleet of disks in my data set, but they are punching far above their weight in terms of failure event count. I first observed these correlations back in 2016. We had a lot of WD Green and Black drives in service at the time--too many to replace or upgrade them all early--so I looked for a workaround to force the drives to behave properly. Since it looked like a write ordering issue, I disabled the write cache on drives with these firmware versions, and found that the transid-verify filesystem failures stopped immediately (they had been bi-weekly events with write cache enabled). That was 3 years ago, and there are no new transid-verify failures logged since then. The drives are still online today with filesystems mkfsed in 2016. One bias to be aware of from this data set: it goes back further than 5 years, and we use the data to optimize hardware costs including the cost of ops failures. You might notice there are no Seagate Barracudas[2] in the data, while there are the similar WD models. In an unbiased sample of hard drives, there are likely to be more bad firmware revisions than found in this data set. I found 2, and that's a lower bound on the real number out there. > Your idea on hardware's faulty FLUSH/FUA implementation could definitely > cause exactly the same problem, but the last time I asked similar > problem to fs-devel, there is no proof for such possibility. Well, correlation isn't proof, it's true; however, if a behavior looks like a firmware bug, and quacks like a firmware bug, and is otherwise indistinguishable from a firmware bug, then it's probably a firmware bug. I don't know if any of these problems are really device firmware bugs or Linux bugs, particularly in the WD Black case. That's a question for someone who can collect some of these devices and do deeper analysis. In particular, my data is not sufficient to rule out either of these two theories for the WD Black: 1. Linux doesn't use FLUSH/FUA correctly when there are IO errors / drive resets / other things that happen around the times that drives have bad sectors, but it is OK as long as there are no cached writes that need to be flushed, or 2. It's just a bug in one particular drive firmware revision, Linux is doing the right thing with FLUSH/FUA and the firmware is not. For the bad WD Green/Red firmware it's much simpler: those firmware revisions fail while the drive is not showing any symptoms of defects. AFAIK there's nothing happening on these drives for Linux code to get confused about that doesn't also happen on every other drive firmware. Maybe it's a firmware bug WD already fixed back in 2014, and it just takes a decade for all the old drives to work their way through the supply chain and service lifetime. > The problem is always a ghost to chase, extra info would greatly help us > to pin it down. This lack of information is a bit frustrating. It's not particularly hard or expensive to collect this data, but I've had to collect it myself because I don't know of any reliable source I could buy it from. I found two bad firmwares by accident when I wasn't looking for bad firmware. If I'd known where to look, I could have found them much faster: I had the necessary failure event observations within a few months after starting the first btrfs pilot projects, but I wasn't expecting to find firmware bugs, so I didn't recognize them until there were double-digit failure counts. WD Green and Black are low-cost consumer hard drives under $250. One drive of each size in both product ranges comes to a total price of around $1200 on Amazon. Lots of end users will have these drives, and some of them will want to use btrfs, but some of the drives apparently do not have working write caching. We should at least know which ones those are, maybe make a kernel blacklist to disable the write caching feature on some firmware versions by default. A modestly funded deliberate search project could build a map of firmware reliability in currently shipping retail hard drives from all three big vendors, and keep it updated as new firmware revisions come out. Sort of like Backblaze's hard drive reliability stats, except you don't need a thousand drives to test firmware--one or two will suffice most of the time [3]. The data can probably be scraped from end user reports (if you have enough of them to filter out noise) and existing ops logs (better, if their methodology is sound) too. > Thanks, > Qu [1] Pedants will notice that some of these drive firmwares range in age from 6 months to 7 years, and neither of those numbers is 5 years, and the power failure rate is implausibly high for a data center environment. Some of the devices live in offices and laptops, and the power failures are not evenly distributed across the fleet. It's entirely possible that some newer device in the 0-failures list will fail horribly next week. Most of the NAS and DC devices and all the SSDs have not had any UNC sector events in the fleet yet, and they could still turn out to be ticking time bombs like the WD Black once they start to grow sector defects. The data does _not_ say that all of those 0-failure firmwares are bug free under identical conditions--it says that, in a race to be the first ever firmware to demonstrate bad behavior, the firmwares in the 0-failures list haven't left the starting line yet, while the 2 firmwares in the multi-failures list both seem to be trying to _win_. [2] We had a few surviving Seagate Barracudas in 2016, but over 85% of those built before 2015 had failed by 2016, and none of the survivors are still online today. In practical terms, it doesn't matter if a pre-2015 Barracuda has correct power-failing write-cache behavior when the drive hardware typically dies more often than the host's office has power interruptions. [3] OK, maybe it IS hard to find WD Black drives to test at the _exact_ moment they are remapping UNC sectors...tap one gently with a hammer, maybe, or poke a hole in the air filter to let a bit of dust in? > > After turning off write caching, btrfs can keep running on these problem > > drive models until they get too old and broken to spin up any more. > > With write caching turned on, these drive models will eat a btrfs every > > few months. > > > > > >> Or even use ZFS instead... > >> > >> Am 11/06/2019 um 15:02 schrieb Qu Wenruo: > >>> > >>> On 2019/6/11 下午6:53, claudius@winca.de wrote: > >>>> HI Guys, > >>>> > >>>> you are my last try. I was so happy to use BTRFS but now i really hate > >>>> it.... > >>>> > >>>> > >>>> Linux CIA 4.15.0-51-generic #55-Ubuntu SMP Wed May 15 14:27:21 UTC 2019 > >>>> x86_64 x86_64 x86_64 GNU/Linux > >>>> btrfs-progs v4.15.1 > >>> So old kernel and old progs. > >>> > >>>> btrfs fi show > >>>> Label: none uuid: 9622fd5c-5f7a-4e72-8efa-3d56a462ba85 > >>>> Total devices 1 FS bytes used 4.58TiB > >>>> devid 1 size 7.28TiB used 4.59TiB path /dev/mapper/volume1 > >>>> > >>>> > >>>> dmesg > >>>> > >>>> [57501.267526] BTRFS info (device dm-5): trying to use backup root at > >>>> mount time > >>>> [57501.267528] BTRFS info (device dm-5): disk space caching is enabled > >>>> [57501.267529] BTRFS info (device dm-5): has skinny extents > >>>> [57507.511830] BTRFS error (device dm-5): parent transid verify failed > >>>> on 2069131051008 wanted 4240 found 5115 > >>> Some metadata CoW is not recorded correctly. > >>> > >>> Hopes you didn't every try any btrfs check --repair|--init-* or anything > >>> other than --readonly. > >>> As there is a long exiting bug in btrfs-progs which could cause similar > >>> corruption. > >>> > >>> > >>> > >>>> [57507.518764] BTRFS error (device dm-5): parent transid verify failed > >>>> on 2069131051008 wanted 4240 found 5115 > >>>> [57507.519265] BTRFS error (device dm-5): failed to read block groups: -5 > >>>> [57507.605939] BTRFS error (device dm-5): open_ctree failed > >>>> > >>>> > >>>> btrfs check /dev/mapper/volume1 > >>>> parent transid verify failed on 2069131051008 wanted 4240 found 5115 > >>>> parent transid verify failed on 2069131051008 wanted 4240 found 5115 > >>>> parent transid verify failed on 2069131051008 wanted 4240 found 5115 > >>>> parent transid verify failed on 2069131051008 wanted 4240 found 5115 > >>>> Ignoring transid failure > >>>> extent buffer leak: start 2024985772032 len 16384 > >>>> ERROR: cannot open file system > >>>> > >>>> > >>>> > >>>> im not able to mount it anymore. > >>>> > >>>> > >>>> I found the drive in RO the other day and realized somthing was wrong > >>>> ... i did a reboot and now i cant mount anmyore > >>> Btrfs extent tree must has been corrupted at that time. > >>> > >>> Full recovery back to fully RW mountable fs doesn't look possible. > >>> As metadata CoW is completely screwed up in this case. > >>> > >>> Either you could use btrfs-restore to try to restore the data into > >>> another location. > >>> > >>> Or try my kernel branch: > >>> https://github.com/adam900710/linux/tree/rescue_options > >>> > >>> It's an older branch based on v5.1-rc4. > >>> But it has some extra new mount options. > >>> For your case, you need to compile the kernel, then mount it with "-o > >>> ro,rescue=skip_bg,rescue=no_log_replay". > >>> > >>> If it mounts (as RO), then do all your salvage. > >>> It should be a faster than btrfs-restore, and you can use all your > >>> regular tool to backup. > >>> > >>> Thanks, > >>> Qu > >>> > >>>> > >>>> any help > [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 195 bytes --] ^ permalink raw reply [flat|nested] 10+ messages in thread

* Re: btrfs vs write caching firmware bugs (was: Re: BTRFS recovery not possible) 2019-06-23 20:45 btrfs vs write caching firmware bugs (was: Re: BTRFS recovery not possible) Zygo Blaxell @ 2019-06-24 0:46 ` Qu Wenruo 2019-06-24 4:29 ` Zygo Blaxell 2019-06-24 17:31 ` Chris Murphy 2019-06-24 2:45 ` Remi Gauvin 1 sibling, 2 replies; 10+ messages in thread From: Qu Wenruo @ 2019-06-24 0:46 UTC (permalink / raw) To: Zygo Blaxell; +Cc: linux-btrfs [-- Attachment #1.1: Type: text/plain, Size: 21201 bytes --] On 2019/6/24 上午4:45, Zygo Blaxell wrote: > On Thu, Jun 20, 2019 at 01:00:50PM +0800, Qu Wenruo wrote: >> On 2019/6/20 上午7:45, Zygo Blaxell wrote: >>> On Sun, Jun 16, 2019 at 12:05:21AM +0200, Claudius Winkel wrote: >>>> What should I do now ... to use btrfs safely? Should i not use it with >>>> DM-crypt >>> >>> You might need to disable write caching on your drives, i.e. hdparm -W0. >> >> This is quite troublesome. >> >> Disabling write cache normally means performance impact. > > The drives I've found that need write cache disabled aren't particularly > fast to begin with, so disabling write cache doesn't harm their > performance very much. All the speed gains of write caching are lost > when someone has to spend time doing a forced restore from backup after > transid-verify failure. If you really do need performance, there are > drives with working firmware available that don't cost much more. > >> And disabling it normally would hide the true cause (if it's something >> btrfs' fault). > > This is true; however, even if a hypothetical btrfs bug existed, > disabling write caching is an immediately deployable workaround, and > there's currently no other solution other than avoiding drives with > bad firmware. > > There could be improvements possible for btrfs to work around bad > firmware...if someone's willing to donate their sanity to get inside > the heads of firmware bugs, and can find a way to fix it that doesn't > make things worse for everyone with working firmware. > >>> I have a few drives in my collection that don't have working write cache. >>> They are usually fine, but when otherwise minor failure events occur (e.g. >>> bad cables, bad power supply, failing UNC sectors) then the write cache >>> doesn't behave correctly, and any filesystem or database on the drive >>> gets trashed. >> >> Normally this shouldn't be the case, as long as the fs has correct >> journal and flush/barrier. > > If you are asking the question: > > "Are there some currently shipping retail hard drives that are > orders of magnitude more likely to corrupt data after simple > power failures than other drives?" > > then the answer is: > > "Hell, yes! How could there NOT be?" > > It wouldn't take very much capital investment or time to find this out > in lab conditions. Just kill power every 25 minutes while running a > btrfs stress-test should do it--or have a UPS hardware failure in ops, > the effect is the same. Bad drives will show up in a few hours, good > drives take much longer--long enough that, statistically, the good drives > will probably fail outright before btrfs gets corrupted. Now it sounds like we really need some good (more elegant than just random power failure, but more controlled system) way to do such test. > >> If it's really the hardware to blame, then it means its flush/fua is not >> implemented properly at all, thus the possibility of a single power loss >> leading to corruption should be VERY VERY high. > > That exactly matches my observations. Only a few disks fail at all, > but the ones that do fail do so very often: 60% of corruptions at > 10 power failures or less, 100% at 30 power failures or more. > >>> This isn't normal behavior, but the problem does affect >>> the default configuration of some popular mid-range drive models from >>> top-3 hard disk vendors, so it's quite common. >> >> Would you like to share the info and test methodology to determine it's >> the device to blame? (maybe in another thread) > > It's basic data mining on operations failure event logs. > > We track events like filesystem corruption, data loss, other hardware > failure, operator errors, power failures, system crashes, dmesg error > messages, etc., and count how many times each failure occurs in systems > with which hardware components. When a failure occurs, we break the > affected system apart and place its components into other systems or > test machines to isolate which component is causing the failure (e.g. a > failing power supply could create RAM corruption events and disk failure > events, so we move the hardware around to see where the failure goes). > If the same component is involved in repeatable failure events, the > correlation jumps out of the data and we know that component is bad. > We can also do correlations by attributes of the components, i.e. vendor, > model, size, firmware revision, manufacturing date, and correlate > vendor-model-size-firmware to btrfs transid verify failures across > a fleet of different systems. > > I can go to the data and get a list of all the drive model and firmware > revisions that have been installed in machines with 0 "parent transid > verify failed" events since 2014, and are still online today: > > Device Model: CT240BX500SSD1 Firmware Version: M6CR013 > Device Model: Crucial_CT1050MX300SSD1 Firmware Version: M0CR060 > Device Model: HP SSD S700 Pro 256GB Firmware Version: Q0824G > Device Model: INTEL SSDSC2KW256G8 Firmware Version: LHF002C > Device Model: KINGSTON SA400S37240G Firmware Version: R0105A > Device Model: ST12000VN0007-2GS116 Firmware Version: SC60 > Device Model: ST5000VN0001-1SF17X Firmware Version: AN02 > Device Model: ST8000VN0002-1Z8112 Firmware Version: SC61 > Device Model: TOSHIBA-TR200 Firmware Version: SBFA12.2 > Device Model: WDC WD121KRYZ-01W0RB0 Firmware Version: 01.01H01 > Device Model: WDC WDS250G2B0A-00SM50 Firmware Version: X61190WD > Model Family: SandForce Driven SSDs Device Model: KINGSTON SV300S37A240G Firmware Version: 608ABBF0 > Model Family: Seagate IronWolf Device Model: ST10000VN0004-1ZD101 Firmware Version: SC60 > Model Family: Seagate NAS HDD Device Model: ST4000VN000-1H4168 Firmware Version: SC44 > Model Family: Seagate NAS HDD Device Model: ST8000VN0002-1Z8112 Firmware Version: SC60 > Model Family: Toshiba 2.5" HDD MK..59GSXP (AF) Device Model: TOSHIBA MK3259GSXP Firmware Version: GN003J > Model Family: Western Digital Gold Device Model: WDC WD101KRYZ-01JPDB0 Firmware Version: 01.01H01 > Model Family: Western Digital Green Device Model: WDC WD10EZRX-00L4HB0 Firmware Version: 01.01A01 > Model Family: Western Digital Re Device Model: WDC WD2000FYYZ-01UL1B1 Firmware Version: 01.01K02 > Model Family: Western Digital Red Device Model: WDC WD50EFRX-68MYMN1 Firmware Version: 82.00A82 > Model Family: Western Digital Red Device Model: WDC WD80EFZX-68UW8N0 Firmware Version: 83.H0A83 > Model Family: Western Digital Red Pro Device Model: WDC WD6002FFWX-68TZ4N0 Firmware Version: 83.H0A83 At least there are a lot of GOOD disks, what a relief. > > So far so good. The above list of drive model-vendor-firmware have > collectively had hundreds of drive-power-failure events in the last 5 > years, so we have been giving the firmware a fair workout [1]. > > Now let's look for some bad stuff. How about a list of drives that were > involved in parent transid verify failure events occurring within 1-10 > power cycles after mkfs events: > > Model Family: Western Digital Green Device Model: WDC WD20EZRX-00DC0B0 Firmware Version: 80.00A80 > > Change the query to 1-30 power cycles, and we get another model with > the same firmware version string: > > Model Family: Western Digital Red Device Model: WDC WD40EFRX-68WT0N0 Firmware Version: 80.00A80 > > Removing the upper bound on power cycle count doesn't find any more. > > The drives running 80.00A80 are all in fairly similar condition: no errors > in SMART, the drive was apparently healthy at the time of failure (no > unusual speed variations, no unexpected drive resets, or any of the other > things that happen to these drives as they age and fail, but that are > not reported as official errors on the models without TLER). There are > multiple transid-verify failures logged in multiple very different host > systems (e.g. Intel 1U server in a data center, AMD desktop in an office, > hardware ages a few years apart). This is a consistent and repeatable > behavior that does not correlate to any other attribute. > > Now, if you've been reading this far, you might wonder why the previous > two ranges were lower-bounded at 1 power cycle, and the reason is because > I have another firmware in the data set with _zero_ power cycles between > mkfs and failure: > > Model Family: Western Digital Caviar Black Device Model: WDC WD1002FAEX-00Z3A0 Firmware Version: 05.01D05 > > These drives have 0 power fail events between mkfs and "parent transid > verify failed" events, i.e. it's not necessary to have a power failure > at all for these drives to unrecoverably corrupt btrfs. In all cases the > failure occurs on the same days as "Current Pending Sector" and "Offline > UNC sector" SMART events. The WD Black firmware seems to be OK with write > cache enabled most of the time (there's years in the log data without any > transid-verify failures), but the WD Black will drop its write cache when > it sees a UNC sector, and btrfs notices the failure a few hours later. > > Recently I've been asking people on IRC who present btrfs filesystems > with transid-verify failures (excluding those with obvious symptoms of > host RAM failure). So far all the users who have participated in this > totally unscientific survey have WD Green 2TB and WD Black hard drives > with the same firmware revisions as above. The most recent report was > this week. I guess there are lot of drives with these firmwares still > in inventories out there. > > The data says there's at least 2 firmware versions in the wild which > have 100% of the btrfs transid-verify failures. These are only 8% > of the total fleet of disks in my data set, but they are punching far > above their weight in terms of failure event count. > > I first observed these correlations back in 2016. We had a lot of WD > Green and Black drives in service at the time--too many to replace or > upgrade them all early--so I looked for a workaround to force the > drives to behave properly. Since it looked like a write ordering issue, > I disabled the write cache on drives with these firmware versions, and > found that the transid-verify filesystem failures stopped immediately > (they had been bi-weekly events with write cache enabled). So the worst scenario really happens in real world, badly implemented flush/fua from firmware. Btrfs has no way to fix such low level problem. BTW, do you have any corruption using the bad drivers (with write cache) with traditional journal based fs like XFS/EXT4? Btrfs is relying more the hardware to implement barrier/flush properly, or CoW can be easily ruined. If the firmware is only tested (if tested) against such fs, it may be the problem of the vendor. > > That was 3 years ago, and there are no new transid-verify failures > logged since then. The drives are still online today with filesystems > mkfsed in 2016. > > One bias to be aware of from this data set: it goes back further than 5 > years, and we use the data to optimize hardware costs including the cost > of ops failures. You might notice there are no Seagate Barracudas[2] in > the data, while there are the similar WD models. In an unbiased sample > of hard drives, there are likely to be more bad firmware revisions than > found in this data set. I found 2, and that's a lower bound on the real > number out there. > >> Your idea on hardware's faulty FLUSH/FUA implementation could definitely >> cause exactly the same problem, but the last time I asked similar >> problem to fs-devel, there is no proof for such possibility. > > Well, correlation isn't proof, it's true; however, if a behavior looks > like a firmware bug, and quacks like a firmware bug, and is otherwise > indistinguishable from a firmware bug, then it's probably a firmware bug. > > I don't know if any of these problems are really device firmware bugs or > Linux bugs, particularly in the WD Black case. That's a question for > someone who can collect some of these devices and do deeper analysis. > > In particular, my data is not sufficient to rule out either of these two > theories for the WD Black: > > 1. Linux doesn't use FLUSH/FUA correctly when there are IO errors > / drive resets / other things that happen around the times that > drives have bad sectors, but it is OK as long as there are no > cached writes that need to be flushed, or > > 2. It's just a bug in one particular drive firmware revision, > Linux is doing the right thing with FLUSH/FUA and the firmware > is not. > > For the bad WD Green/Red firmware it's much simpler: those firmware > revisions fail while the drive is not showing any symptoms of defects. > AFAIK there's nothing happening on these drives for Linux code to get > confused about that doesn't also happen on every other drive firmware. > > Maybe it's a firmware bug WD already fixed back in 2014, and it just > takes a decade for all the old drives to work their way through the > supply chain and service lifetime. > >> The problem is always a ghost to chase, extra info would greatly help us >> to pin it down. > > This lack of information is a bit frustrating. It's not particularly > hard or expensive to collect this data, but I've had to collect it > myself because I don't know of any reliable source I could buy it from. > > I found two bad firmwares by accident when I wasn't looking for bad > firmware. If I'd known where to look, I could have found them much > faster: I had the necessary failure event observations within a few > months after starting the first btrfs pilot projects, but I wasn't > expecting to find firmware bugs, so I didn't recognize them until there > were double-digit failure counts. > > WD Green and Black are low-cost consumer hard drives under $250. > One drive of each size in both product ranges comes to a total price > of around $1200 on Amazon. Lots of end users will have these drives, > and some of them will want to use btrfs, but some of the drives apparently > do not have working write caching. We should at least know which ones > those are, maybe make a kernel blacklist to disable the write caching > feature on some firmware versions by default. To me, the problem isn't for anyone to test these drivers, but how convincing the test methodology is and how accessible the test device would be. Your statistic has a lot of weight, but it takes you years and tons of disks to expose it, not something can be reproduced easily. On the other hand, if we're going to reproduce power failure quickly and reliably in a lab enivronment, then how? Software based SATA power cutoff? Or hardware controllable SATA power cable? And how to make sure it's the flush/fua not implemented properly? It may take us quite some time to start a similar project (maybe need extra hardware development). But indeed, a project to do 3rd-party SATA hard disk testing looks very interesting for my next year hackweek project. Thanks, Qu > > A modestly funded deliberate search project could build a map of firmware > reliability in currently shipping retail hard drives from all three > big vendors, and keep it updated as new firmware revisions come out. > Sort of like Backblaze's hard drive reliability stats, except you don't > need a thousand drives to test firmware--one or two will suffice most of > the time [3]. The data can probably be scraped from end user reports > (if you have enough of them to filter out noise) and existing ops logs > (better, if their methodology is sound) too. > > > >> Thanks, >> Qu > > [1] Pedants will notice that some of these drive firmwares range in age > from 6 months to 7 years, and neither of those numbers is 5 years, and > the power failure rate is implausibly high for a data center environment. > Some of the devices live in offices and laptops, and the power failures > are not evenly distributed across the fleet. It's entirely possible that > some newer device in the 0-failures list will fail horribly next week. > Most of the NAS and DC devices and all the SSDs have not had any UNC > sector events in the fleet yet, and they could still turn out to be > ticking time bombs like the WD Black once they start to grow sector > defects. The data does _not_ say that all of those 0-failure firmwares > are bug free under identical conditions--it says that, in a race to > be the first ever firmware to demonstrate bad behavior, the firmwares > in the 0-failures list haven't left the starting line yet, while the 2 > firmwares in the multi-failures list both seem to be trying to _win_. > > [2] We had a few surviving Seagate Barracudas in 2016, but over 85% of > those built before 2015 had failed by 2016, and none of the survivors > are still online today. In practical terms, it doesn't matter if a > pre-2015 Barracuda has correct power-failing write-cache behavior when > the drive hardware typically dies more often than the host's office has > power interruptions. > > [3] OK, maybe it IS hard to find WD Black drives to test at the _exact_ > moment they are remapping UNC sectors...tap one gently with a hammer, > maybe, or poke a hole in the air filter to let a bit of dust in? > >>> After turning off write caching, btrfs can keep running on these problem >>> drive models until they get too old and broken to spin up any more. >>> With write caching turned on, these drive models will eat a btrfs every >>> few months. >>> >>> >>>> Or even use ZFS instead... >>>> >>>> Am 11/06/2019 um 15:02 schrieb Qu Wenruo: >>>>> >>>>> On 2019/6/11 下午6:53, claudius@winca.de wrote: >>>>>> HI Guys, >>>>>> >>>>>> you are my last try. I was so happy to use BTRFS but now i really hate >>>>>> it.... >>>>>> >>>>>> >>>>>> Linux CIA 4.15.0-51-generic #55-Ubuntu SMP Wed May 15 14:27:21 UTC 2019 >>>>>> x86_64 x86_64 x86_64 GNU/Linux >>>>>> btrfs-progs v4.15.1 >>>>> So old kernel and old progs. >>>>> >>>>>> btrfs fi show >>>>>> Label: none uuid: 9622fd5c-5f7a-4e72-8efa-3d56a462ba85 >>>>>> Total devices 1 FS bytes used 4.58TiB >>>>>> devid 1 size 7.28TiB used 4.59TiB path /dev/mapper/volume1 >>>>>> >>>>>> >>>>>> dmesg >>>>>> >>>>>> [57501.267526] BTRFS info (device dm-5): trying to use backup root at >>>>>> mount time >>>>>> [57501.267528] BTRFS info (device dm-5): disk space caching is enabled >>>>>> [57501.267529] BTRFS info (device dm-5): has skinny extents >>>>>> [57507.511830] BTRFS error (device dm-5): parent transid verify failed >>>>>> on 2069131051008 wanted 4240 found 5115 >>>>> Some metadata CoW is not recorded correctly. >>>>> >>>>> Hopes you didn't every try any btrfs check --repair|--init-* or anything >>>>> other than --readonly. >>>>> As there is a long exiting bug in btrfs-progs which could cause similar >>>>> corruption. >>>>> >>>>> >>>>> >>>>>> [57507.518764] BTRFS error (device dm-5): parent transid verify failed >>>>>> on 2069131051008 wanted 4240 found 5115 >>>>>> [57507.519265] BTRFS error (device dm-5): failed to read block groups: -5 >>>>>> [57507.605939] BTRFS error (device dm-5): open_ctree failed >>>>>> >>>>>> >>>>>> btrfs check /dev/mapper/volume1 >>>>>> parent transid verify failed on 2069131051008 wanted 4240 found 5115 >>>>>> parent transid verify failed on 2069131051008 wanted 4240 found 5115 >>>>>> parent transid verify failed on 2069131051008 wanted 4240 found 5115 >>>>>> parent transid verify failed on 2069131051008 wanted 4240 found 5115 >>>>>> Ignoring transid failure >>>>>> extent buffer leak: start 2024985772032 len 16384 >>>>>> ERROR: cannot open file system >>>>>> >>>>>> >>>>>> >>>>>> im not able to mount it anymore. >>>>>> >>>>>> >>>>>> I found the drive in RO the other day and realized somthing was wrong >>>>>> ... i did a reboot and now i cant mount anmyore >>>>> Btrfs extent tree must has been corrupted at that time. >>>>> >>>>> Full recovery back to fully RW mountable fs doesn't look possible. >>>>> As metadata CoW is completely screwed up in this case. >>>>> >>>>> Either you could use btrfs-restore to try to restore the data into >>>>> another location. >>>>> >>>>> Or try my kernel branch: >>>>> https://github.com/adam900710/linux/tree/rescue_options >>>>> >>>>> It's an older branch based on v5.1-rc4. >>>>> But it has some extra new mount options. >>>>> For your case, you need to compile the kernel, then mount it with "-o >>>>> ro,rescue=skip_bg,rescue=no_log_replay". >>>>> >>>>> If it mounts (as RO), then do all your salvage. >>>>> It should be a faster than btrfs-restore, and you can use all your >>>>> regular tool to backup. >>>>> >>>>> Thanks, >>>>> Qu >>>>> >>>>>> >>>>>> any help >> > > > [-- Attachment #2: OpenPGP digital signature --] [-- Type: application/pgp-signature, Size: 488 bytes --] ^ permalink raw reply [flat|nested] 10+ messages in thread

* Re: btrfs vs write caching firmware bugs (was: Re: BTRFS recovery not possible) 2019-06-24 0:46 ` Qu Wenruo @ 2019-06-24 4:29 ` Zygo Blaxell 2019-06-24 5:39 ` Qu Wenruo 2019-06-24 17:31 ` Chris Murphy 1 sibling, 1 reply; 10+ messages in thread From: Zygo Blaxell @ 2019-06-24 4:29 UTC (permalink / raw) To: Qu Wenruo; +Cc: linux-btrfs [-- Attachment #1: Type: text/plain, Size: 14745 bytes --] On Mon, Jun 24, 2019 at 08:46:06AM +0800, Qu Wenruo wrote: > On 2019/6/24 上午4:45, Zygo Blaxell wrote: > > On Thu, Jun 20, 2019 at 01:00:50PM +0800, Qu Wenruo wrote: > >> On 2019/6/20 上午7:45, Zygo Blaxell wrote: [...] > So the worst scenario really happens in real world, badly implemented > flush/fua from firmware. > Btrfs has no way to fix such low level problem. > > BTW, do you have any corruption using the bad drivers (with write cache) > with traditional journal based fs like XFS/EXT4? Those filesystems don't make full-filesystem data integrity guarantees like btrfs does, and there's no ext4 equivalent of dup metadata for self-repair (even metadata csums in ext4 are a recent invention). Ops didn't record failure events when e2fsck quietly repairs unexpected filesystem inconsistencies. On ext3, maybe data corruption happens because of drive firmware bugs, or maybe the application just didn't use fsync properly. Maybe two disks in md-RAID1 have different contents because they had slightly different IO timings. Who knows? There's no way to tell from passive ops failure monitoring. On btrfs with flushoncommit, every data anomaly (e.g. backups not matching origin hosts, obviously corrupted files, scrub failures, etc) is a distinct failure event. Differences between disk contents in RAID1 arrays are failure events. We can put disks with two different firmware versions in a RAID1 pair, and btrfs will tell us if they disagree, use the correct one to fix the broken one, or tell us they're both wrong and it's time to warm up the backups. In 2013 I had some big RAID10 arrays of WD Green 2TB disks using ext3/4 and mdadm, and there were a *lot* of data corruption events. So many events that we didn't have the capacity to investigate them before new ones came in. File restore requests for corrupted data were piling up faster than they could be processed, and we had no systematic way to tell whether the origin or backup file was correct when they were different. Those problems eventually expedited our migration to btrfs, because btrfs let us do deeper and more uniform data collection to see where all the corruption was coming from. While changing filesystems, we moved all the data onto new disks that happened to not have firmware bugs, and all the corruption abruptly disappeared (well, except for data corrupted by bugs in btrfs itself, but now those are fixed too). We didn't know what was happening until years later when the smaller/cheaper systems had enough failures to make noticeable patterns. I would not be surprised if we were having firmware corruption problems with ext3/ext4 the whole time those RAID10 arrays existed. Alas, we were not capturing firmware revision data at the time (only vendor/model), and we only started capturing firmware revisions after all the old drives were recycled. I don't know exactly what firmware versions were in those arrays...though I do have a short list of suspects. ;) > Btrfs is relying more the hardware to implement barrier/flush properly, > or CoW can be easily ruined. > If the firmware is only tested (if tested) against such fs, it may be > the problem of the vendor. [...] > > WD Green and Black are low-cost consumer hard drives under $250. > > One drive of each size in both product ranges comes to a total price > > of around $1200 on Amazon. Lots of end users will have these drives, > > and some of them will want to use btrfs, but some of the drives apparently > > do not have working write caching. We should at least know which ones > > those are, maybe make a kernel blacklist to disable the write caching > > feature on some firmware versions by default. > > To me, the problem isn't for anyone to test these drivers, but how > convincing the test methodology is and how accessible the test device > would be. > > Your statistic has a lot of weight, but it takes you years and tons of > disks to expose it, not something can be reproduced easily. > > On the other hand, if we're going to reproduce power failure quickly and > reliably in a lab enivronment, then how? > Software based SATA power cutoff? Or hardware controllable SATA power cable? You might be overthinking this a bit. Software-controlled switched PDUs (or if you're a DIY enthusiast, some PowerSwitch Tails and a Raspberry Pi) can turn the AC power on and off on a test box. Get a cheap desktop machine, put as many different drives into it as it can hold, start writing test patterns, kill mains power to the whole thing, power it back up, analyze the data that is now present on disk, log the result over the network, repeat. This is the most accurate simulation, since it replicates all the things that happen during a typical end-user's power failure, only much more often. Hopefully all the hardware involved is designed to handle this situation already. A standard office PC is theoretically designed for 1000 cycles (200 working days over 5 years) and should be able to test 60 drives (6 SATA ports, 10 sets of drives tested 100 cycles each). The hardware is all standard equipment in any IT department. You only need special-purpose hardware if the general-purpose stuff is failing in ways that aren't interesting (e.g. host RAM is corrupted during writes so the drive writes garbage, or the power supply breaks before 1000 cycles). Some people build elaborate hard disk torture rigs that mess with input voltages, control temperature and vibration, etc. to try to replicate the effects effects of aging, but these setups aren't representative of typical end-user environments and the results will only be interesting to hardware makers. We expect most drives to work and it seems that they do most of the time--it is the drives that fail most frequently that are interesting. The drives that fail most frequently are also the easiest to identify in testing--by definition, they will reproduce failures faster than the others. Even if there is an intermittent firmware bug that only appears under rare conditions, if it happens with lower probability than drive hardware failure then it's not particularly important. The target hardware failure rate for hard drives is 0.1% over the warranty period according to the specs for many models. If one drive's hardware is going to fail with p < 0.001, then maybe the firmware bug makes it lose data at p = 0.00075 instead of p = 0.00050. Users won't care about this--they'll use RAID to contain the damage, or just accept the failure risks of a single-disk system. Filesystem failures that occur after the drive has degraded to the point of being unusable are not interesting at all. > And how to make sure it's the flush/fua not implemented properly? Is it necessary? The drive could write garbage on the disk, or write correct data to the wrong physical location, when the voltage drops at the wrong time. The drive electronics/firmware are supposed to implement measures to prevent that, and who knows whether they try, and whether they are successful? The data corruption that results from the above events is technically not a flush/fua failure, since it's not a write reordering or a premature command completion notification to the host, but it's still data corruption on power failure. Drives can fail in multiple ways, and it's hard (even for hard disk engineering teams) to really know what is going on while the power supply goes out of spec. To an end user, it doesn't matter why the drive fails, only that it does fail. Once you have *enough* drives, some of them are always failing, and it just becomes a question of balancing the different risks and mitigation costs (i.e. pick a drive that doesn't fail so much, and a filesystem that tolerates the failure modes that happen to average or better drives, and maybe use RAID1 with a mix of drive vendors to avoid having both mirrors hit by a common firmware bug). To make sure btrfs is using flush/fua correctly, log the sequence of block writes and fua/flush commands, then replay that sequence one operation at a time, and make sure the filesystem correctly recovers after each operation. That doessn't need or even want hardware, though--it's better work for a VM that can operate on block-level snapshots of the filesystem. > It may take us quite some time to start a similar project (maybe need > extra hardware development). > > But indeed, a project to do 3rd-party SATA hard disk testing looks very > interesting for my next year hackweek project. > > Thanks, > Qu > > > > > A modestly funded deliberate search project could build a map of firmware > > reliability in currently shipping retail hard drives from all three > > big vendors, and keep it updated as new firmware revisions come out. > > Sort of like Backblaze's hard drive reliability stats, except you don't > > need a thousand drives to test firmware--one or two will suffice most of > > the time [3]. The data can probably be scraped from end user reports > > (if you have enough of them to filter out noise) and existing ops logs > > (better, if their methodology is sound) too. > > > > > > > >> Thanks, > >> Qu > > > > [1] Pedants will notice that some of these drive firmwares range in age > > from 6 months to 7 years, and neither of those numbers is 5 years, and > > the power failure rate is implausibly high for a data center environment. > > Some of the devices live in offices and laptops, and the power failures > > are not evenly distributed across the fleet. It's entirely possible that > > some newer device in the 0-failures list will fail horribly next week. > > Most of the NAS and DC devices and all the SSDs have not had any UNC > > sector events in the fleet yet, and they could still turn out to be > > ticking time bombs like the WD Black once they start to grow sector > > defects. The data does _not_ say that all of those 0-failure firmwares > > are bug free under identical conditions--it says that, in a race to > > be the first ever firmware to demonstrate bad behavior, the firmwares > > in the 0-failures list haven't left the starting line yet, while the 2 > > firmwares in the multi-failures list both seem to be trying to _win_. > > > > [2] We had a few surviving Seagate Barracudas in 2016, but over 85% of > > those built before 2015 had failed by 2016, and none of the survivors > > are still online today. In practical terms, it doesn't matter if a > > pre-2015 Barracuda has correct power-failing write-cache behavior when > > the drive hardware typically dies more often than the host's office has > > power interruptions. > > > > [3] OK, maybe it IS hard to find WD Black drives to test at the _exact_ > > moment they are remapping UNC sectors...tap one gently with a hammer, > > maybe, or poke a hole in the air filter to let a bit of dust in? > > > >>> After turning off write caching, btrfs can keep running on these problem > >>> drive models until they get too old and broken to spin up any more. > >>> With write caching turned on, these drive models will eat a btrfs every > >>> few months. > >>> > >>> > >>>> Or even use ZFS instead... > >>>> > >>>> Am 11/06/2019 um 15:02 schrieb Qu Wenruo: > >>>>> > >>>>> On 2019/6/11 下午6:53, claudius@winca.de wrote: > >>>>>> HI Guys, > >>>>>> > >>>>>> you are my last try. I was so happy to use BTRFS but now i really hate > >>>>>> it.... > >>>>>> > >>>>>> > >>>>>> Linux CIA 4.15.0-51-generic #55-Ubuntu SMP Wed May 15 14:27:21 UTC 2019 > >>>>>> x86_64 x86_64 x86_64 GNU/Linux > >>>>>> btrfs-progs v4.15.1 > >>>>> So old kernel and old progs. > >>>>> > >>>>>> btrfs fi show > >>>>>> Label: none uuid: 9622fd5c-5f7a-4e72-8efa-3d56a462ba85 > >>>>>> Total devices 1 FS bytes used 4.58TiB > >>>>>> devid 1 size 7.28TiB used 4.59TiB path /dev/mapper/volume1 > >>>>>> > >>>>>> > >>>>>> dmesg > >>>>>> > >>>>>> [57501.267526] BTRFS info (device dm-5): trying to use backup root at > >>>>>> mount time > >>>>>> [57501.267528] BTRFS info (device dm-5): disk space caching is enabled > >>>>>> [57501.267529] BTRFS info (device dm-5): has skinny extents > >>>>>> [57507.511830] BTRFS error (device dm-5): parent transid verify failed > >>>>>> on 2069131051008 wanted 4240 found 5115 > >>>>> Some metadata CoW is not recorded correctly. > >>>>> > >>>>> Hopes you didn't every try any btrfs check --repair|--init-* or anything > >>>>> other than --readonly. > >>>>> As there is a long exiting bug in btrfs-progs which could cause similar > >>>>> corruption. > >>>>> > >>>>> > >>>>> > >>>>>> [57507.518764] BTRFS error (device dm-5): parent transid verify failed > >>>>>> on 2069131051008 wanted 4240 found 5115 > >>>>>> [57507.519265] BTRFS error (device dm-5): failed to read block groups: -5 > >>>>>> [57507.605939] BTRFS error (device dm-5): open_ctree failed > >>>>>> > >>>>>> > >>>>>> btrfs check /dev/mapper/volume1 > >>>>>> parent transid verify failed on 2069131051008 wanted 4240 found 5115 > >>>>>> parent transid verify failed on 2069131051008 wanted 4240 found 5115 > >>>>>> parent transid verify failed on 2069131051008 wanted 4240 found 5115 > >>>>>> parent transid verify failed on 2069131051008 wanted 4240 found 5115 > >>>>>> Ignoring transid failure > >>>>>> extent buffer leak: start 2024985772032 len 16384 > >>>>>> ERROR: cannot open file system > >>>>>> > >>>>>> > >>>>>> > >>>>>> im not able to mount it anymore. > >>>>>> > >>>>>> > >>>>>> I found the drive in RO the other day and realized somthing was wrong > >>>>>> ... i did a reboot and now i cant mount anmyore > >>>>> Btrfs extent tree must has been corrupted at that time. > >>>>> > >>>>> Full recovery back to fully RW mountable fs doesn't look possible. > >>>>> As metadata CoW is completely screwed up in this case. > >>>>> > >>>>> Either you could use btrfs-restore to try to restore the data into > >>>>> another location. > >>>>> > >>>>> Or try my kernel branch: > >>>>> https://github.com/adam900710/linux/tree/rescue_options > >>>>> > >>>>> It's an older branch based on v5.1-rc4. > >>>>> But it has some extra new mount options. > >>>>> For your case, you need to compile the kernel, then mount it with "-o > >>>>> ro,rescue=skip_bg,rescue=no_log_replay". > >>>>> > >>>>> If it mounts (as RO), then do all your salvage. > >>>>> It should be a faster than btrfs-restore, and you can use all your > >>>>> regular tool to backup. > >>>>> > >>>>> Thanks, > >>>>> Qu > >>>>> > >>>>>> > >>>>>> any help > >> > > > > > > > [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 195 bytes --] ^ permalink raw reply [flat|nested] 10+ messages in thread

* Re: btrfs vs write caching firmware bugs (was: Re: BTRFS recovery not possible) 2019-06-24 4:37 ` Zygo Blaxell @ 2019-06-24 5:27 ` Zygo Blaxell 0 siblings, 0 replies; 10+ messages in thread From: Zygo Blaxell @ 2019-06-24 5:27 UTC (permalink / raw) To: Remi Gauvin; +Cc: linux-btrfs [-- Attachment #1: Type: text/plain, Size: 3427 bytes --] On Mon, Jun 24, 2019 at 12:37:51AM -0400, Zygo Blaxell wrote: > On Sun, Jun 23, 2019 at 10:45:50PM -0400, Remi Gauvin wrote: > > On 2019-06-23 4:45 p.m., Zygo Blaxell wrote: > > > > > Model Family: Western Digital Green Device Model: WDC WD20EZRX-00DC0B0 Firmware Version: 80.00A80 > > > > > > Change the query to 1-30 power cycles, and we get another model with > > > the same firmware version string: > > > > > > Model Family: Western Digital Red Device Model: WDC WD40EFRX-68WT0N0 Firmware Version: 80.00A80 > > > > > > > > > > > These drives have 0 power fail events between mkfs and "parent transid > > > verify failed" events, i.e. it's not necessary to have a power failure > > > at all for these drives to unrecoverably corrupt btrfs. In all cases the > > > failure occurs on the same days as "Current Pending Sector" and "Offline > > > UNC sector" SMART events. The WD Black firmware seems to be OK with write > > > cache enabled most of the time (there's years in the log data without any > > > transid-verify failures), but the WD Black will drop its write cache when > > > it sees a UNC sector, and btrfs notices the failure a few hours later. > > > > > > > First, thank you very much for sharing. I've seen you mention several > > times before problems with common consumer drives, but seeing one > > specific identified problem firmware version is *very* valuable info. > > > > I have a question about the Black Drives dropping the cache on UNC > > error. If a transid id error like that occurred on a BTRFS RAID 1, > > would BTRFS find the correct metadata on the 2nd drive, or does it stop > > dead on 1 transid failure? > > Well, the 2nd drive has to have correct metadata--if you are mirroring > a pair of disks with the same firmware bug, that's not likely to happen. OK, I forgot the Black case is a little complicated... I guess if you had two WD Black drives and they had all their UNC sector events at different times, then the btrfs RAID1 repair should still work with write cache enabled. That seems kind of risky, though--what if something bumps the machine and both disks get UNC sectors at once? Alternatives in roughly decreasing order of risk: 1. Disable write caching on both Blacks in the pair 2. Replace both Blacks with drives in the 0-failure list 3. Replace one Black with a Seagate Firecuda or WD Red Pro (any other 0-failure drive will do, but these have similar performance specs to Black) to ensure firmware diversity 4. Find some Black drives with different firmware that have UNC sectors and see what happens with write caching during sector remap events: if they behave well, enable write caching on all drives with matching firmware, disable if not 5. Leave write caching on for now, but as soon as any Black reports UNC sectors or reallocation events in SMART data, turn write caching off for the remainder of the drive's service life. > There is a bench test that will demonstrate the transid verify self-repair > procedure: disconnect one half of a RAID1 array, write for a while, then > reconnect and do a scrub. btrfs should self-repair all the metadata on > the disconnected drive until it all matches the connected one. Some of > the data blocks might be hosed though (due to CRC32 collisions), so > don't do this test on data you care about. > > > > > [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 195 bytes --] ^ permalink raw reply [flat|nested] 10+ messages in thread

* Re: btrfs vs write caching firmware bugs (was: Re: BTRFS recovery not possible) 2019-06-24 4:29 ` Zygo Blaxell @ 2019-06-24 5:39 ` Qu Wenruo 0 siblings, 0 replies; 10+ messages in thread From: Qu Wenruo @ 2019-06-24 5:39 UTC (permalink / raw) To: Zygo Blaxell; +Cc: linux-btrfs [-- Attachment #1.1: Type: text/plain, Size: 7291 bytes --] On 2019/6/24 下午12:29, Zygo Blaxell wrote: [...] > >> Btrfs is relying more the hardware to implement barrier/flush properly, >> or CoW can be easily ruined. >> If the firmware is only tested (if tested) against such fs, it may be >> the problem of the vendor. > [...] >>> WD Green and Black are low-cost consumer hard drives under $250. >>> One drive of each size in both product ranges comes to a total price >>> of around $1200 on Amazon. Lots of end users will have these drives, >>> and some of them will want to use btrfs, but some of the drives apparently >>> do not have working write caching. We should at least know which ones >>> those are, maybe make a kernel blacklist to disable the write caching >>> feature on some firmware versions by default. >> >> To me, the problem isn't for anyone to test these drivers, but how >> convincing the test methodology is and how accessible the test device >> would be. >> >> Your statistic has a lot of weight, but it takes you years and tons of >> disks to expose it, not something can be reproduced easily. >> >> On the other hand, if we're going to reproduce power failure quickly and >> reliably in a lab enivronment, then how? >> Software based SATA power cutoff? Or hardware controllable SATA power cable? > > You might be overthinking this a bit. Software-controlled switched > PDUs (or if you're a DIY enthusiast, some PowerSwitch Tails and a > Raspberry Pi) can turn the AC power on and off on a test box. Get a > cheap desktop machine, put as many different drives into it as it can > hold, start writing test patterns, kill mains power to the whole thing, > power it back up, analyze the data that is now present on disk, log the > result over the network, repeat. This is the most accurate simulation, > since it replicates all the things that happen during a typical end-user's > power failure, only much more often. To me, this is not as good as expected methodology. It simulates the most common real world power loss case, but I'd say it's less reliable in pinning down the incorrect behavior. (And extra time wasted on POST, booting into OS and things like that) My idea is, some SBC based controller controlling the power cable of the disk. And another system (or the same SBC if it supports SATA) doing regular workload, with dm-log-writes recording every write operations. Then kill the power to the disk. Then compare the data on-disk against dm-log-writes to see how the data differs. From the view point of end user, this is definitely overkilled, but at least to me, this could proof how bad the firmware is, leaving no excuse for the vendor to dodge the bullet, and maybe do them a favor by pinning down the sequence leading to corruption. Although there are a lot of untested things which can go wrong: - How kernel handles unresponsible disk? - Will dm-log-writes record and handle error correctly? - Is there anything special SATA controller will do? But at least this is going to be a very interesting project. I already have a rockpro64 SBC with SATA PCIE card, just need to craft an GPIO controlled switch to kill SATA power. > Hopefully all the hardware involved > is designed to handle this situation already. A standard office PC is > theoretically designed for 1000 cycles (200 working days over 5 years) > and should be able to test 60 drives (6 SATA ports, 10 sets of drives > tested 100 cycles each). The hardware is all standard equipment in any > IT department. > > You only need special-purpose hardware if the general-purpose stuff > is failing in ways that aren't interesting (e.g. host RAM is corrupted > during writes so the drive writes garbage, or the power supply breaks > before 1000 cycles). Some people build elaborate hard disk torture > rigs that mess with input voltages, control temperature and vibration, > etc. to try to replicate the effects effects of aging, but these setups > aren't representative of typical end-user environments and the results > will only be interesting to hardware makers. > > We expect most drives to work and it seems that they do most of the > time--it is the drives that fail most frequently that are interesting. > The drives that fail most frequently are also the easiest to identify > in testing--by definition, they will reproduce failures faster than > the others. > > Even if there is an intermittent firmware bug that only appears under > rare conditions, if it happens with lower probability than drive hardware > failure then it's not particularly important. The target hardware failure > rate for hard drives is 0.1% over the warranty period according to the > specs for many models. If one drive's hardware is going to fail > with p < 0.001, then maybe the firmware bug makes it lose data at p = > 0.00075 instead of p = 0.00050. Users won't care about this--they'll > use RAID to contain the damage, or just accept the failure risks of a > single-disk system. Filesystem failures that occur after the drive has > degraded to the point of being unusable are not interesting at all. > >> And how to make sure it's the flush/fua not implemented properly? > > Is it necessary? The drive could write garbage on the disk, or write > correct data to the wrong physical location, when the voltage drops at > the wrong time. The drive electronics/firmware are supposed to implement > measures to prevent that, and who knows whether they try, and whether > they are successful? The data corruption that results from the above > events is technically not a flush/fua failure, since it's not a write > reordering or a premature command completion notification to the host, > but it's still data corruption on power failure. > > Drives can fail in multiple ways, and it's hard (even for hard disk > engineering teams) to really know what is going on while the power supply > goes out of spec. To an end user, it doesn't matter why the drive fails, > only that it does fail. Once you have *enough* drives, some of them > are always failing, and it just becomes a question of balancing the > different risks and mitigation costs (i.e. pick a drive that doesn't > fail so much, and a filesystem that tolerates the failure modes that > happen to average or better drives, and maybe use RAID1 with a mix of > drive vendors to avoid having both mirrors hit by a common firmware bug). > > To make sure btrfs is using flush/fua correctly, log the sequence of block > writes and fua/flush commands, then replay that sequence one operation > at a time, and make sure the filesystem correctly recovers after each > operation. That doessn't need or even want hardware, though--it's better > work for a VM that can operate on block-level snapshots of the filesystem. That's already what we're doing, dm-log-writes. And we failed to expose major problems. All the fsync related bugs, like what Filipe is always fixing, can't be easily exposed by random workload even with dm-log-writes. Most of these bugs needs special corner case to hit, but IIRC so far no transid problem is caused by it. But anyway, thanks for your info, we see some hope in pinning down the problem. Thanks, Qu [-- Attachment #2: OpenPGP digital signature --] [-- Type: application/pgp-signature, Size: 488 bytes --] ^ permalink raw reply [flat|nested] 10+ messages in thread

* Re: btrfs vs write caching firmware bugs (was: Re: BTRFS recovery not possible) 2019-06-24 0:46 ` Qu Wenruo 2019-06-24 4:29 ` Zygo Blaxell @ 2019-06-24 17:31 ` Chris Murphy 2019-06-26 2:30 ` Zygo Blaxell 2019-07-02 13:32 ` Andrea Gelmini 1 sibling, 2 replies; 10+ messages in thread From: Chris Murphy @ 2019-06-24 17:31 UTC (permalink / raw) To: Qu Wenruo; +Cc: Zygo Blaxell, Btrfs BTRFS On Sun, Jun 23, 2019 at 7:52 PM Qu Wenruo <quwenruo.btrfs@gmx.com> wrote: > > > > On 2019/6/24 上午4:45, Zygo Blaxell wrote: > > I first observed these correlations back in 2016. We had a lot of WD > > Green and Black drives in service at the time--too many to replace or > > upgrade them all early--so I looked for a workaround to force the > > drives to behave properly. Since it looked like a write ordering issue, > > I disabled the write cache on drives with these firmware versions, and > > found that the transid-verify filesystem failures stopped immediately > > (they had been bi-weekly events with write cache enabled). > > So the worst scenario really happens in real world, badly implemented > flush/fua from firmware. > Btrfs has no way to fix such low level problem. Right. The questions I have: should Btrfs (or any file system) be able to detect such devices and still protect the data? i.e. for the file system to somehow be more suspicious, without impacting performance, and go read-only sooner so that at least read-only mount can work? Or is this so much work for such a tiny edge case that it's not worth it? Arguably the hardware is some kind of zombie saboteur. It's not totally dead, it gives the impression that it's working most of the time, and then silently fails to do what we think it should in an extraordinary departure from specs and expectations. Are there other failure cases that could look like this and therefore worth handling? As storage stacks get more complicated with ever more complex firmware, and firmware updates in the field, it might be useful to have at least one file system that can detect such problems sooner than others and go read-only to prevent further problems? > BTW, do you have any corruption using the bad drivers (with write cache) > with traditional journal based fs like XFS/EXT4? > > Btrfs is relying more the hardware to implement barrier/flush properly, > or CoW can be easily ruined. > If the firmware is only tested (if tested) against such fs, it may be > the problem of the vendor. I think we can definitely say this is a vendor problem. But the question still is whether the file system as a role in at least disqualifying hardware when it knows it's acting up before the file system is thoroughly damaged? I also wonder how ext4 and XFS will behave. In some ways they might tolerate the problem without noticing it for longer, where instead of kernel space recognizing it, it's actually user space / application layer that gets confused first, if it's bogus data that's being returned. Filesystem metadata is a relatively small target for such corruption when the file system mostly does overwrites. I also wonder how ZFS handles this. Both in the single device case, and in the RAIDZ case. -- Chris Murphy ^ permalink raw reply [flat|nested] 10+ messages in thread

* Re: btrfs vs write caching firmware bugs (was: Re: BTRFS recovery not possible) 2019-06-24 17:31 ` Chris Murphy @ 2019-06-26 2:30 ` Zygo Blaxell 2019-07-02 13:32 ` Andrea Gelmini 1 sibling, 0 replies; 10+ messages in thread From: Zygo Blaxell @ 2019-06-26 2:30 UTC (permalink / raw) To: Chris Murphy; +Cc: Qu Wenruo, Btrfs BTRFS [-- Attachment #1: Type: text/plain, Size: 9715 bytes --] On Mon, Jun 24, 2019 at 11:31:35AM -0600, Chris Murphy wrote: > On Sun, Jun 23, 2019 at 7:52 PM Qu Wenruo <quwenruo.btrfs@gmx.com> wrote: > > > > > > > > On 2019/6/24 上午4:45, Zygo Blaxell wrote: > > > I first observed these correlations back in 2016. We had a lot of WD > > > Green and Black drives in service at the time--too many to replace or > > > upgrade them all early--so I looked for a workaround to force the > > > drives to behave properly. Since it looked like a write ordering issue, > > > I disabled the write cache on drives with these firmware versions, and > > > found that the transid-verify filesystem failures stopped immediately > > > (they had been bi-weekly events with write cache enabled). > > > > So the worst scenario really happens in real world, badly implemented > > flush/fua from firmware. > > Btrfs has no way to fix such low level problem. > > Right. The questions I have: should Btrfs (or any file system) be able > to detect such devices and still protect the data? i.e. for the file > system to somehow be more suspicious, without impacting performance, > and go read-only sooner so that at least read-only mount can work? Part of the point of UNC sector remapping, especially in consumer hard drives, is that filesystems _don't_ notice it (health monitoring daemons might notice SMART events, but it's intentionally transparent to applications and filesystems). The alternative is that one bad sector throws an application that is not prepared to handle it, or forces the filesystem RO, or triggers a full-device RAID data rebuild. Of course that all goes sideways if the firmware loses its mind (and write cache) during UNC sector remapping. > Or is this so much work for such a tiny edge case that it's not worth it? > > Arguably the hardware is some kind of zombie saboteur. It's not > totally dead, it gives the impression that it's working most of the > time, and then silently fails to do what we think it should in an > extraordinary departure from specs and expectations. > Are there other failure cases that could look like this and therefore > worth handling? In some ways firmware bugs are just another hardware failure. Hard disks are free to have any sector unreadable at any time, or one day the entire disk could just decide not to spin up any more, or non-ECC RAM in the embedded controller board could flip some bits at random. These are all standard failure modes that btrfs detects (and, with an intact mirror available, automatically corrects). Firmware bugs are different quantitatively: they turn common-but-recoverable failure events into common-and-catastrophic failure events. Most people expect catastrophic failure events to be less common, but manufacturing is hard, and sometimes they are not. Entire production runs of hard drives can die early due to a manufacturing equipment miscalibration or a poor choice of electrical component. > As storage stacks get more complicated with ever more > complex firmware, and firmware updates in the field, it might be > useful to have at least one file system that can detect such problems > sooner than others and go read-only to prevent further problems? I thought we already had one: btrfs. Probably ZFS too. The problem with parent transid verify failure is that the problem is detected after the filesystem is already damaged. It's too late to go RO then, you need a time machine to get the data back. We could maybe make some more pessimistic assumptions about how stable new data is so that we can recover from damage in new data far beyond what flush/fua expectations permit. AFAIK the Green only fails during a power failure, so btrfs could keep the last N filesystem transid trees intact at all times, and during mount btrfs could verify the integrity of the last transaction and roll back to an earlier transid if there was a failure. This has been attempted before, and it has various new ENOSPC failure modes, and it requires modifications to some already very complex btrfs code, but if we waved a magic wand and a complete, debugged implementation of this appeared with a reasonable memory and/or iops overhead, it would work on the Green drives. The WD Black is a different beast: some sequence of writes is lost when a UNC sector is encountered, but the drive doesn't report the loss immediately (if it did, btrfs would already go RO before the end of the transaction, and the metadata tree would remain intact). The loss is only detected some time after, during reads which might be thousands of transids later. Both of these approaches have a problem: when the workaround is used, the filesystem rolls back to an earlier state, including user data. In some cases that might not be a good thing, e.g. rolling back 1000 transids on a mail store or OLTP database, or rolling back datacow files while _not_ rolling back nodatacow files. btrfs already writes two complete copies of the metadata with dup metadata, but firmware bugs can kill both copies. btrfs could hold the last 256MB of metadata writes in RAM (or whatever amount of RAM is bigger than the drive cache), and replay those writes or verify the metadata trees whenever a bad sector is reported or the drive does a bus reset. This would work if the write cache is dropped during a read, but if the firmware silently drops the write cache while remapping a UNC sector then btrfs will not be able to detect the event and would not know to replay the write log. This kind of solution seems expensive, and maybe a little silly, and might not even work against all possible drive firmware bugs (what if the drive indefinitely postpones some writes, so 256MB isn't enough RAM log?). Also, a more meta observation: we don't know this is what is really happening in the firmware. There are clearly problems observed when multiple events occur currently, but there are several possible mechanisms that could lead to the behavior, and nowhere in my data is enough information to determine which one is correct. So if a drive has a firmware bug that just redirects a cache write to an entirely random address on the disk (e.g. it corrupts or overruns an internal RAM buffer) the symptoms will match the observed behavior, but none of these workaround strategies will work. You'd need to have a RAID1 mirror in a different disk to protect against arbitrary data loss anywhere in a single drive--and btrfs can already support that because it's a normal behavior for all hard drives. The cost of these workarounds has to be weighed against the impact (how many drives are out there with these firmware bugs) and compared with the cost of other solutions that already exist. A heterogeneous RAID1 solves this problem--unless you are unlucky and get two different firmwares with the same bug. It may be possible that the best workaround is also the simplest, and also works for all filesystems at once: turn the write cache off for drives where it doesn't work. CoW filesystems write in big contiguous sorted chunks, and that gets most of the benefit of write reordering before the drive gets the data, so there is less to lose if the drive cannot reorder. An overwriting filesystem writes in smaller, scattered chunks with more seeking, and can get more benefit from write caching in the drive. > > BTW, do you have any corruption using the bad drivers (with write cache) > > with traditional journal based fs like XFS/EXT4? > > > > Btrfs is relying more the hardware to implement barrier/flush properly, > > or CoW can be easily ruined. > > If the firmware is only tested (if tested) against such fs, it may be > > the problem of the vendor. > > I think we can definitely say this is a vendor problem. But the > question still is whether the file system as a role in at least > disqualifying hardware when it knows it's acting up before the file > system is thoroughly damaged? How does a filesystem know the device is acting up without letting the device damage the filesystem first? i.e. how do you do this without maintaining a firmware revision blacklist? Some sort of extended self-test during mkfs? Or something an admin can run online, like a balance or scrub? That would not catch the WD Black firmware revisions that need a bad sector to make the bad behavior appear. > I also wonder how ext4 and XFS will behave. In some ways they might > tolerate the problem without noticing it for longer, where instead of > kernel space recognizing it, it's actually user space / application > layer that gets confused first, if it's bogus data that's being > returned. Filesystem metadata is a relatively small target for such > corruption when the file system mostly does overwrites. The worst case on those filesystems is less bad than btrfs (for the filesystem--the user data is trashed in ways that are not reported and might be difficult to detect). btrfs checks everything--metadata and user data--and stops when unrecoverable failure is detected, so the logical result is that btrfs stops on firmware bugs. That's a design feature or horrible flaw, depending on what the user's goals are. ext4 optimizes for availability and performance (simplicity ended with ext3) and intentionally ignores some possible failure modes (ext4 makes no attempt to verify user data integrity at all, and even metadata checksums are optional). XFS protects itself similarly, but not user data. > I also wonder how ZFS handles this. Both in the single device case, > and in the RAIDZ case. > > > -- > Chris Murphy [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 195 bytes --] ^ permalink raw reply [flat|nested] 10+ messages in thread

Linux-BTRFS Archive on lore.kernel.org Archives are clonable: git clone --mirror https://lore.kernel.org/linux-btrfs/0 linux-btrfs/git/0.git # If you have public-inbox 1.1+ installed, you may # initialize and index your mirror using the following commands: public-inbox-init -V2 linux-btrfs linux-btrfs/ https://lore.kernel.org/linux-btrfs \ linux-btrfs@vger.kernel.org public-inbox-index linux-btrfs Example config snippet for mirrors Newsgroup available over NNTP: nntp://nntp.lore.kernel.org/org.kernel.vger.linux-btrfs AGPL code for this site: git clone https://public-inbox.org/public-inbox.git