This how-to describes how to replace a failing drive on a software RAID managed by the mdadm utility. To replace a failing RAID 6 drive in mdadm :

Let us look at this process in more detail by walking through an example.

Identify the problem

To identify which disk is failing within the RAID array, run:

[root@server loc]# cat /proc/mdadm

Or:

[root@server loc]# mdadm -–query -–detail /dev/md2

The failing disk will appear as failing or removed. For example:

[root@server loc]# mdadm -–query -–detail /dev/md2 /dev/md2: Version : 1.2 Creation Time : Mon Jun 22 08:47:09 2015 Raid Level : raid6 Array Size : 5819252736 (5549.67 GiB 5958.91 GB) Used Dev Size : 2909626368 (2774.84 GiB 2979.46 GB) Raid Devices : 4 Total Devices : 4 Persistence : Superblock is persistent Intent Bitmap : Internal Update Time : Mon Oct 15 11:55:06 2018 State : clean, degraded, recovering Active Devices : 3 Working Devices : 4 Failed Devices : 0 Spare Devices : 1 Layout : left-symmetric Chunk Size : 512K Consistency Policy : bitmap Rebuild Status : 3% complete Name : localhost.localdomain:2 UUID : 54404ab5:4450e4f3:aba6c1fb:93a4087e Events : 1046292 Number Major Minor Raid Device State 0 0 0 0 removed 1 8 36 1 active sync /dev/sdc4 2 8 52 2 active sync /dev/sdd4 3 8 68 3 active sync /dev/sde4

Get details from the RAID array

To examine the RAID array's state and identify the state of a disk within the RAID:

[root@server loc]# cat /proc/mdstat Personalities : [raid6] [raid5] [raid4] md2 : active raid6 sdb4[4](F) sdd4[2] sdc4[1] sde4[3] 5819252736 blocks super 1.2 level 6, 512k chunk, algorithm 2 [4/3] [_UUU] [>………………..] recovery = 3.4% (100650992/2909626368) finish=471.5min speed=99278K/sec bitmap: 2/22 pages [8KB], 65536KB chunk unused devices: <none>

As we can see, the device /dev/sdb4 has failed in the RAID.

Since we identified that the failed disk is /dev/sdb4 (which was the case on this server), we’d need to get the disk's serial number using smartctl :

[root@server loc]# smartctl -–all /dev/sdb | grep -i 'Serial'

The above command is important since you need to know what disk to remove from the server, according to the disk's physical label.

Remove the failing disk from the RAID array

It is important to remove the failing disk from the array so the array retains a consistent state and is aware of every change, like so:

[root@server loc]# mdadm -–manage /dev/md2 -–remove /dev/sdb4

On a successful remove, a message like the following will return:

[root@server loc]# mdadm: hot removed /dev/sdb4 from /dev/md2

Check the state of /proc/mdstat once again:

[root@server loc]# cat /proc/mdstat

You can see that /dev/sdb4 is no longer visible.

Shut down the machine and replace the disk

Now it’s time to shut down the system and replace the faulty disk with a new one, but before shutting down the system, comment /dev/md2 out of your /etc/fstab file. See the example below:

[root@server loc]# cat /etc/fstab # # /etc/fstab # Created by anaconda on Fri May 20 13:12:25 2016 # # Accessible filesystems, by reference, are maintained under ‘/dev/disk’ # See man pages fstab(5), findfs(8), mount(8) and/or blkid(8) for more info # /dev/mapper/centos-root / xfs defaults 0 0 UUID=1300b86d-2638-4a9f-b366-c5e67e9ffa4e /boot xfs defaults 0 0 #/dev/mapper/centos-home /home xfs defaults 0 0 /dev/mapper/centos-swap swap swap defaults 0 0 #/dev/md2 /var/loc xfs defaults 0 0

Partition the new disk

Since we have other working disks within the RAID array, it is easy and convenient to copy the partition schema of a working disk onto the new disk. This task is accomplished with the sgdisk utility, which is provided by the gdisk package.

Install gdisk like this (adjust this command for your distribution):

[root@server loc]# yum install gdisk

Using gdisk , we will first pass the -R option (stands for Replicate). Make sure you replicate the partition schema of a working disk. It is important that you use the correct order of disks to replicate the partition schema from a working disk to a new one. In our situation, on the new disk is /dev/sdb and the working disks are /dev/sdc , /dev/sdd , /dev/sde .

Now, to replicate the partition schema of a working disk (say /dev/sdc ) to the new disk /dev/sdb , the following command is needed:

[root@server loc]# sgdisk -R /dev/sdb /dev/sdc

To prevent GUID conflicts with other drives, we’ll need to randomize the GUID of the new drive using:

[root@server loc]# sgdisk -G /dev/sdb The operation has completed successfully.

Next, verify the output of /dev/sdb using the parted utility:

[root@server loc]# parted /dev/sdb print

Add the new disk to the RAID array



After completing the partition schema replication to the new drive, we now can add the drive to the RAID array:

[root@server loc]# mdadm -–manage /dev/md2 -–add /dev/sdb4 mdadm: added /dev/sdb4

Verify recovery

To verify the RAID recovery, use the following:

[root@server loc]# cat /proc/mdstat Personalities : [raid6] [raid5] [raid4] md2 : active raid6 sdb4[4] sdd4[2] sdc4[1] sde4[3] 5819252736 blocks super 1.2 level 6, 512k chunk, algorithm 2 [4/3] [_UUU] [==>………………] recovery = 12.2% (357590568/2909626368) finish=424.1min speed=100283K/sec bitmap: 0/22 pages [0KB], 65536KB chunk unused devices: <none>

Or:

[root@server loc]# mdadm -–query -–detail /dev/md2 /dev/md2: Version : 1.2 Creation Time : Mon Jun 22 08:47:09 2015 Raid Level : raid6 Array Size : 5819252736 (5549.67 GiB 5958.91 GB) Used Dev Size : 2909626368 (2774.84 GiB 2979.46 GB) Raid Devices : 4 Total Devices : 4 Persistence : Superblock is persistent Intent Bitmap : Internal Update Time : Mon Oct 15 12:37:37 2018 State : clean, degraded, recovering Active Devices : 3 Working Devices : 4 Failed Devices : 0 Spare Devices : 1 Layout : left-symmetric Chunk Size : 512K Consistency Policy : bitmap Rebuild Status : 12% complete Name : localhost.localdomain:2 UUID : 54404ab5:4450e4f3:aba6c1fb:93a4087e Events : 1046749 Number Major Minor Raid Device State 4 8 20 0 spare rebuilding /dev/sdb4 1 8 36 1 active sync /dev/sdc4 2 8 52 2 active sync /dev/sdd4 3 8 68 3 active sync /dev/sde4

From the above output, we now see that /dev/sdb4 is rebuilding, and four working and active devices are available. The rebuilding process might take a while, depending on your total disk size and disk type (i.e., traditional or solid-state).

Celebrate