Hewlett Packard Enterprise (HPE) is once again warning its customers that certain Serial-Attached SCSI solid-state drives will fail after 40,000 hours of operation, unless a critical patch is applied.

The company made a similar announcement in November 2019, when firmware defect produced failure after 32,768 hours of running.

Affected drives

The current issue affects drives in HPE server and Storage products like HPE ProLiant, Synergy, Apollo 4200, Synergy Storage Modules, D3000 Storage Enclosure, StoreEasy 1000 Storage.

HPE Model Number HPE SKU HPE SKU DESCRIPTION HPE Spare Part SKU HPE Firmware Fix Date EK0800JVYPN 846430-B21 HPE 800GB 12G SAS WI-1 SFF SC SSD 846622-001 3/20/2020 EO1600JVYPP 846432-B21 HPE 1.6TB 12G SAS WI-1 SFF SC SSD 846623-001 3/20/2020 MK0800JVYPQ 846432-B21 HPE 800GB 12G SAS MU-1 SFF SC SSD 846624-001 3/20/2020 MO1600JVYPR 846436-B21 HPE 1.6TB 12G SAS MU-1 SFF SC SSD 846625-001 3/20/2020

The company says that this is a comprehensive list of impacted SSDs it makes available. However, the issue is not unique to HPE and may be present in drives from other manufacturers.

If the SSD in the HPE products runs a firmware version older than HPD7, they will fail after being powered on for 40,000 hours; this translates into 4 years, 206 days, 16 hours and it is about half a year shorter than the extended warranty available for some of them.

When the failure point is reached, neither the data nor the drive can be recovered. Preventing such a disaster is possible in environments with data backup setups.

HPE learned about the firmware bug from a SSD manufacturer and warns that if SSDs were installed and put into service at the same time they are likely to fail almost concurrently.

“Restoration of data from backup will be required in non-fault tolerance modes (e.g., RAID 0) and in fault tolerance RAID mode if more drives fail than what is supported by the fault tolerance RAID mode logical drive [e.g. RAID 5 logical drive with two failed SSDs]” - HPE advisory

The new firmware can be installed by using the online flash component for VMware ESXi, Windows, and Linux.

Last month, Dell EMC released new firmware to correct a bug causing nine SanDisk SSDs in its portfolio to fail "after approximately 40,000 hours of usage." Dell identified the following models to be impacted:

LT0200MO

LT0400MO

LT0800MO

LT1600MO

LT0200WM

LT0400WM

LT0800WM

LT0800RO

LT1600RO

The update corrects a check for logging the circular buffer index value. "Assert had a bad check to validate the value of circular buffer's index value. Instead of checking the max value as N, it checked for N-1," Dell's advisory explains.

Customers that were shipped one or more of the affected SSD models were informed about this "potentially critical issue" with the recommendation to apply the update immediately.

Not as bad as last time

There is some good news, though. By checking the shipping dates from HPE and considering the 40,000 hours expiration limit, no affected SSD have failed because of this firmware bug.

HPE estimates that unpatched SSDs will begin to fail as early as October 2020. This gives plenty of time for admins to apply the corrected firmware.

Back in November, reports about storage drive failure came pouring on social media and forums, with ussers complaing about device collapsing in bulk, minutes apart.

Finding out the uptime of an affected drive is possible with the Smart Storage Administrator (SSA) utility, which offers the power-on time for every drive installed on the system.

Alternatively, users can run scripts that can check if the firmware on their SSDs has the 40,000 power-on-hours failure issue. The scripts work for certain HPE‌‌ SAS SSDs and are available for Linux, VMware and Windows.

Update March 25, 09:05 EDT: Article updated with details about some SandDisk SSDs that could also fail after 40,000 hours of operation time.

h/t JohnC_21 (comment below)