One of the data disks in my home machine has been increasingly problematic for, well, a while. I eventually bought a replacement HD, then even more eventually put it the machine along side the current two data disks, partitioned it, and added it as a third mirror to my software RAID partitions. After running everything as a three-way mirror for a while, I decided that problems on the failing disk were affecting system performance enough that I'd take the main software RAID partition on the disk out of service.

I did this as, roughly:

mdadm --manage /dev/md53 --fail /dev/sdd4 mdadm --manage /dev/md53 --remove /dev/sdd4 mdadm --manage /dev/md53 --raid-devices 2

(I didn't save the exact commands, so this is an approximation. The failing drive is sdd.)

The main software RAID device immediately stopped using /dev/sdd4 and everything was happy (and my Prometheus monitoring of disk latency no longer showed drastic latency spikes for sdd). The information in /proc/mdstat said that md53 was fine, with two out of two mirrors.

Then, today, my home machine locked up and rebooted (because it's the first significantly cold day in Toronto and I have a little issue with that). When it came back, I took a precautionary look at /proc/mdstat to see if any of my RAID arrays had decided to resync themselves. To my very large surprise, mdstat reported that md53 had two out of three failed devices and the only intact device was the outdated /dev/sdd4.

(The system then then started the outdated copy of the LVM volume group that sdd4 held, mounted outdated copies of the filesystems in it, and let things start writing to them as if they were the right copy of those filesystems. Fortunately I caught this very soon after boot and could immediately shut the system down to avoid further damage.)

This was not a disk failure; all of my other software RAID arrays on those disks showed three out of three devices, spanning the old sdc and sdd drives and the new sde drive. But rather than assemble the two-device new version of md53 with both mirrors fully available on sdc4 and sde4, the Fedora udev boot and software RAID assembly process had decided to assemble the old three-device version visible only on sdd4 with one out of three mirrors. Nor is this my old case of not updating my initramfs to have the correct number of RAID devices, because I never updated either the real /etc/mdadm.conf or the version in the initramfs to claim that any of my RAID arrays had three devices instead of two.

As I said on Twitter, I'm sufficiently used to ZFS's reliable behavior on device removal that I never even imagined that this could happen with software RAID. I can sort see how it did (for a start, I expect that marking a device as failed leaves its RAID superblock untouched), but I don't know why and the logs I have available contain no clues from udev and mdadm about its decision process for which array component to pick.

The next time I do this sort of device removal, I guess I will have to explicitly erase the software RAID superblock on the removed device with ' mdadm --zero-superblock '. I don't like doing this because if I make a mistake in the device name (and it is only a letter or a number away from something live), I've probably just blown things up.

The obvious conclusion is that mdadm should have an explicit way to say 'take this device out of service in this disk array', one that makes sure to update everything so that this can't happen even if the device remains physically present in the system. I don't care whether that involves adding a special mark to the device's RAID superblock or erasing it; I just want it to work. Perhaps what I did should already work in theory; if so, I regret to say that it didn't in practice.

(My short term solution is to physically disconnect sdd, the failing disk drive. This reduces the other three-way mirrors to two-way ones and I don't know what I'll do with the pulled sdd; it's probably not safe to let my home machine see it in any state at any time in the future. But at least this way I have working software RAID arrays.)

Sidebar: Why mdadm's --replace is not a solution for me

I explicitly wanted to run my new drive along side the existing two drives for a while, in case of infant mortality. Thus I wanted to run with three-way mirrors, instead of replacing one disk in a two-way mirror with another one.