A Series of Unfortunate Events

When we first received the server, we couldn’t boot into the TDM. The TDM is a tool built into IBM/Lenovo servers that let you do a variety of things, including configuring the RAID array. After some phone troubleshooting, IBM dispatched a tech to replace the 720ix RAID card. After the card had been replaced, the TDM started working.

The tech left and I configured the RAID6 array. I wanted to be sure there was no further issues with the machine, so I let it do a full RAID6 initialization, which took about 12 hours.

The following morning, I went to load an operating system. This is when I discovered that I couldn’t boot into anything.

These servers come with a utility called the TSM. This is a web based management tool that runs independently of the server (it functions even when the server is off or bricked.) I noticed, through the TSM, that the CPU temperature was reporting in the negatives. Which is, quite frankly, not possible.

IBM dispatched a technician to replace the motherboard. This fixed the booting issue, but it didn’t fix the CPU temperature issue. We tried 3 different CPU temperature monitoring tools and they all reported in the negatives.

What we didn’t realize at the time was that the new motherboard they had installed was still on the old 1.53 BIOS. This is why booting worked and we were able to install an operating system.

The server was working, but I still wanted to fix this CPU temperature issue. Fan speeds and other important metrics, such as CPU throttling, all depend on an accurate reading of the CPU temperature. I opened yet another ticket with IBM.

IBM sent a tech out to replace the CPU — unsure at this point what else to try. He replaced the CPU, but unfortunately, the temperature problem continued to persist. So, we checked the BIOS, TDM and TSM versions. At that time, we had the following:

BIOS: 1.53.0

TDM: 1.2.10

TSM: 3.31.84

The TDM and TSM are up to date, but the BIOS was still on the old 1.53.0. Not realizing that our boot issue was solved by going back to BIOS 1.53 with the new motherboard, the tech upgraded the BIOS to 3.34.0. Afterwards, as you can imagine, ESXi wouldn’t boot.

My systems engineer, the IBM tech and myself all sat around a table trying to figure out what the hell had happened. Suddenly, everything started to make sense: the only common denominator was the bad BIOS version!

The tech attempted to downgrade the BIOS. Unfortunately, this ended up bricking the server. During POST, if your server locks up at “Initializing PCI devices,” the BIOS is most likely corrupt.

The TSM web management tool was still functional. We tried to check the BIOS version through the TSM, but it reported as “N/A.” We even tried to force a BIOS update through the TSM, but it would not install.

The tech then attempted to use the jumper on the motherboard to recover the BIOS, but it also failed. He made the decision at that point to swap out the motherboard. Fortunately, he had one in his truck.

The new motherboard (the third one we’ve had so far) was flashed with the bad 3.34 BIOS from the factory. As you can imagine, nothing would boot. So, we once again attempted the downgrade, which failed and bricked another board.

That was around 11pm last night. I’m currently waiting to hear back from IBM while they work to find a motherboard with the old BIOS installed. They still have no answer or explanation for why the CPU temperature is reading in the negative. Despite their technician agreeing that the BIOS is the problem, Lenovo has not officially admitted the BIOS is bad. I will update this post if that changes.