EDIT: (see the end of this question) after more digging this appears to be a system USB issue and not ZFS that's causing the drives to be kicked. I'll leave this question up for posterity, because I'm still curious if there's an answer, but in the meantime if people have advice from FreeBSD USB devices getting forcefully removed, I'm all ears!

Please approach this question with a sense of humor and don't just downvote because it's a bad idea, sometimes (very rarely!) a user is totally ok with data loss and just needs help loading their footgun! After all, ZFS provides other benefits beyond data integrity, and I'd still rather use it for my bad drives than ext4 . If you're the type of sysdmin who reads this with a sly smile and remembers the time they lost data by doing exactly this, this question is for you.

I'm running a pool with some USB drives on a non-critical server with non-critical data, and I don't care if it gets corrupted. I'm trying to set it up so that ZFS does not force-remove USB drives when they experience checksum errors (just like how ext4 or FAT handle this scenario, by not noticing/caring about data loss).

Disclaimer:

To readers landing here via Google trying to fix their ZFS pool, do not attempt anything described in this question or its answers, you will lose your data!

Because the ZFS police love to yell at people who are using USB drives or have any other non-standard setup: for the sake of this discussion, assume it's cat videos that I have backed up in 32 other physically remote places on 128 redundant SSDs. I fully acknowledge that I will lose 100% of my data unrecoverably on this pool (many times over) if I try to do this. I'm directing this question to the people who are curious about just how bad an environment ZFS is capable of running in (the people who like pushing systems to their breaking points and beyond, just for fun).

So heres the setup:

HP EliteDesk server running FreeNAS-11.2-U5

2x WD Elements 8TB drives connected via USB 3.0

unreliable power environment, server and drives are often force rebooted/disconnected with no warning. (yes I have a UPS, no I don't want to use it, I want to break this server, didn't you read the disclaimer 😉?)

one mirror pool hdd with the two drives (with failmode=continue set)

with the two drives (with set) one drive is stable, even after multiple reboots and force-disconnects, it never seems to report checksum errors or any other issues in ZFS

one drive is unreliable, with occasional checksum errors during normal operation (even when not disconnected unexpectedly), the errors seem to appear unrelated to the bad power environment, as it'll be running fine for 10+ hours and suddenly get ejected from the pool due to checksum failures

I've confirmed the unreliable drive is due to a software issue or a hardware issue with the USB bus on the server, and not an unreliable cable or a physical problem with the drive. The way I've confirmed this is by plugging it into my MacBook with known-good USB ports, and zeroing then writing random data to the entire drive and verifying it (done 3 times, 100% success each time). The drive is almost new, with no other SMART indicators below 100% health. However, even if the drive were failing gradually and losing a few bits here and there, I'm ok with that.

Heres the problem:

Whenever the bad drive has checksum errors ZFS removes it from the pool (Edit: this turned out to be an incorrect assumption, the system kicked it, not ZFS). Unfortunately, FreeNAS does not allow me to re-add it to the pool without physically rebooting, or unplugging and reconnecting both the USB cable, and the drive's power supply. This means I can't script the re-adding process or do it remotely without rebooting the entire server, I'd have to be physically present to unplug things or have an internet-connected Arduino and a relay wired into both cables.

Possible solutions

I've already done quite a bit of research on whether this sort of thing is possible, and it's been difficult because every time I find a relevant thread, the data integrity police jump in and convince the asker to abandon their unreliable setup instead of ignoring the errors or working around them. I'm resorting to asking here because I haven't been able to find documentation or other answers on how to accomplish this.

turning off checksums entirely with zfs set checksum=off hdd , I haven't done this yet because I'd ideally like to keep checksums so I know when the drive is misbehaving, I just want to ignore the failures

, I haven't done this yet because I'd ideally like to keep checksums so I know when the drive is misbehaving, I just want to ignore the failures a flag that keeps checksumming but ignores checksum errors / attempts to repair them without removing the drive from the pool

a ZFS flag that raises the maximum allowable checksum error limit before the drive gets removed (currently the drive gets booted after about ~13 errors)

a FreeBSD/FreeNAS command that allows me to force-online the device after it got removed, without having to reboot the entire server

a FreeBSD/FreeNAS kernel option to force this drive to never be allowed to be removed

a FreeBSD sysctl option that magically fixes the USB bus issue causing errors/timeouts on only this drive (unlikely)

a ZFS on linux option that does the same thing (I'd be willing to move these drives to my Ubuntu box if I know it's possible to do there)

running zpool clear hdd in a loop every 500ms to clear checksum errors before they reach the threshold

in a loop every 500ms to clear checksum errors before they reach the threshold I'm also experimenting with setting hw.usb.xhci.use_polling=1 to fix the USB reconnection failure after disconnect, but don't have conclusive results yet

I'm really trying to avoid having to resort to using ext4 or another filesystem that doesn't force-remove drives after USB errors, because I want all the other ZFS features like snapshots, datasets, send/recv, etc, I'm just trying to ignore/repair data integrity errors without disconnecting drives.

Relevant logs

This is the dmesg output whenever the drive misbehaves and gets removed

Jul 7 04:10:35 freenas-lemon ZFS: vdev state changed, pool_guid=13427464797767151426 vdev_guid=11823196300981694957 Jul 7 04:10:35 freenas-lemon ugen0.8: <Western Digital Elements 25A3> at usbus0 (disconnected) Jul 7 04:10:35 freenas-lemon umass4: at uhub2, port 20, addr 7 (disconnected) Jul 7 04:10:35 freenas-lemon da4 at umass-sim4 bus 4 scbus7 target 0 lun 0 Jul 7 04:10:35 freenas-lemon da4: <WD Elements 25A3 1021> s/n 5641474A4D56574C detached Jul 7 04:10:35 freenas-lemon (da4:umass-sim4:4:0:0): Periph destroyed Jul 7 04:10:35 freenas-lemon umass4: detached Jul 7 04:10:46 freenas-lemon usbd_req_re_enumerate: addr=9, set address failed! (USB_ERR_IOERROR, ignored) Jul 7 04:10:52 freenas-lemon usbd_setup_device_desc: getting device descriptor at addr 9 failed, USB_ERR_TIMEOUT Jul 7 04:10:52 freenas-lemon usbd_req_re_enumerate: addr=9, set address failed! (USB_ERR_IOERROR, ignored) Jul 7 04:10:58 freenas-lemon usbd_setup_device_desc: getting device descriptor at addr 9 failed, USB_ERR_TIMEOUT Jul 7 04:10:58 freenas-lemon usb_alloc_device: Failure selecting configuration index 0:USB_ERR_TIMEOUT, port 20, addr 9 (ignored) Jul 7 04:10:58 freenas-lemon ugen0.8: <Western Digital Elements 25A3> at usbus0 Jul 7 04:10:58 freenas-lemon ugen0.8: <Western Digital Elements 25A3> at usbus0 (disconnected)

This is the zpool status hdd output after the bad drive gets kicked.

pool: hdd state: DEGRADED status: One or more devices has been removed by the administrator. Sufficient replicas exist for the pool to continue functioning in a degraded state. action: Online the device using 'zpool online' or replace the device with 'zpool replace'. scan: scrub repaired 0 in 0 days 00:53:45 with 0 errors on Sun Jul 7 17:19:41 2019 config: NAME STATE READ WRITE CKSUM hdd DEGRADED 0 0 0 mirror-0 DEGRADED 0 0 0 gptid/6a8016b8-a08d-11e9-8e1c-ecb1d765a86d ONLINE 0 0 0 11823196300981694957 REMOVED 0 0 0 was /dev/gptid/6c3950c1-a08d-11e9-8e1c-ecb1d765a86d errors: No known data errors

Edit:

After some more digging it looks like other people have experienced this sort of error too. It appears to be either a kernel bug or USB hardware/software problem with some drives, and not a problem at the ZFS level. The system is kicking the d rives, which then causes the ZFS checksum errors, and not the other way around. ZFS has no problem re-importing the drives after reboot, and it happily fixes the errors and reports no data loss. The USB issues might possibly related to power management features or other USB commands not being supported on the drive, but I'm still skeptical because the two drives are practically identical WD Elements drives bought only a year apart. I'm not sure how to fix it since camcontrol rescan all doesn't even find the USB device attached anymore after it gets disconnected, it really takes a full reboot, and often a full power cycling of the external drive in addition to the reboot.

dmesg output during the failure: