Debian Bug report logs - #900399

memtest86+: very probably kills system controller on Lenovo Thinkpad T500 laptop

Reported by: Sergey Kogan <kogan@bit-integro.ru> Date: Wed, 30 May 2018 08:36:02 UTC Severity: normal Found in version memtest86+/5.01-3

Reply or subscribe to this bug.

Toggle useless messages

Report forwarded to debian-bugs-dist@lists.debian.org, kogan@bit-integro.ru, Yann Dirson <dirson@debian.org> :

Bug#900399 ; Package memtest86+ . (Wed, 30 May 2018 08:36:04 GMT) (full text, mbox, link).

Acknowledgement sent to Sergey Kogan <kogan@bit-integro.ru> :

New Bug report received and forwarded. Copy sent to kogan@bit-integro.ru, Yann Dirson <dirson@debian.org> . (Wed, 30 May 2018 08:36:04 GMT) (full text, mbox, link).

Message #5 received at submit@bugs.debian.org (full text, mbox, reply):

From: Sergey Kogan <kogan@bit-integro.ru> To: Debian Bug Tracking System <submit@bugs.debian.org> Subject: memtest86+: very probably kills system controller on Lenovo Thinkpad T500 laptop Date: Wed, 30 May 2018 14:58:33 +0600

Package: memtest86+ Version: 5.01-3 Severity: critical Justification: breaks the whole system Hi! There is a situation I belive should be reported ASAP. We have two Lenovo T500 laptops completely dead after an overnight testing with memtest86+. Notebooks do not power on, and even do not show up the 'external power' ledled when plugging in AC-adapter. The whole story is: 29-May-2018 Two Lenovo T500 laptops were upgraded with 4Gb memory sticks. After the upgrade the laptops were powered on with no problems and were used till the evening. In the evening 29-May-2018 both laptops were rebooted into Memtest86+ and set up for an overnight RAM test with default memtest settings. In the morning 30-May-2018 both laptops where found with a few passes completed and zero memory errors found. But laptops were not responding to keyboard commands. The laptops were turned off with a long press of the power button and then refused to start. All the usual tricks were performed including: - Removing and replacing RAM sticks - Removing battery and AC power for a period of time - Ten times and a long press power button advice found on the internet - CMOS battery removal Still, two lenovo laptops show no signs of life. We are going to send one laptop for service today (maybe they would diagnose the issue better), but it seems very likely that memtest86+ somehow killed the firmware of a motherboard system controller. Until the problem is identified, I recommend to issue a warning and (or) prevent installation/running of the memtest86+ on Lenovo T500 laptops. Will post updates when more information will be available. -- System Information: Debian Release: 9.1 APT prefers stable APT policy: (500, 'stable') Architecture: i386 (i686) Kernel: Linux 4.9.0-3-686-pae (SMP w/2 CPU cores) Locale: LANG=en_US, LC_CTYPE=ru_RU.UTF8 (charmap=UTF-8) (ignored: LC_ALL set to en_US.UTF8), LANGUAGE=en_US:en (charmap=UTF-8) (ignored: LC_ALL set to en_US.UTF8) Shell: /bin/sh linked to /bin/bash Init: systemd (via /run/systemd/system) Versions of packages memtest86+ depends on: ii debconf [debconf-2.0] 1.5.61 memtest86+ recommends no packages. Versions of packages memtest86+ suggests: ii grub-pc 2.02~beta3-5 pn hwtools <none> pn kernel-patch-badram <none> pn memtest86 <none> pn memtester <none> pn mtools <none> -- debconf-show failed

Information forwarded to debian-bugs-dist@lists.debian.org, Yann Dirson <dirson@debian.org> :

Bug#900399 ; Package memtest86+ . (Wed, 06 Jun 2018 09:54:08 GMT) (full text, mbox, link).

Acknowledgement sent to Сергей Коган <kogan@bit-integro.ru> :

Extra info received and forwarded to list. Copy sent to Yann Dirson <dirson@debian.org> . (Wed, 06 Jun 2018 09:54:08 GMT) (full text, mbox, link).

Message #10 received at 900399@bugs.debian.org (full text, mbox, reply):

From: Сергей Коган <kogan@bit-integro.ru> To: 900399@bugs.debian.org Subject: It's confirmed: memtest86+ can kill lenovo mainboard Date: Wed, 6 Jun 2018 15:35:36 +0600

Hi! Good news and a bad news. Both T500 laptops were examined. One was (almost) repaired. One is dead. One-line summary: Yes, memtest86+ killed them. No, it is not related to the embedded controller. It's a short-circuit in a power hub IC. Details: - It's important to note that power management logic in lenovo thinkpad laptops is quite sophisticated. The embedded controller provides a high-level signal, while special IC's issue signals to various gates to power up or power down specific parts of the system. - One of those low-level IC's is a RIKNAN (U61 on lenovo schematics). The important part of the IC is the VCC3SW micro-power LDO (dc/dc converter). It provides a limited 3.3v power supply for the power button detection circuit, thermal protection logic and a power hub IC. - The power hub PMH_7 (U28) is more intelligent then RINKAN, and has a SPI connection to the EC. It controls a lot of clocks and power signals on a main board. Note that PMH is used across different lenovo products, so some of it's outputs are left unused. It is a common practice to tie unused IC outputs to ground or VCC instead of leaving them unconnected. - Coreboot developers discovered a method of accessing the internal registers of the PMH. The protocol is simple: write a register address to some memory-mapped EC address, then write desired value to the other EC address. outb(reg, EC_LENOVO_PMH7_ADDR); val = inb(EC_LENOVO_PMH7_DATA); outb(reg, EC_LENOVO_PMH7_ADDR); outb(val | (1 << bit), EC_LENOVO_PMH7_DATA); - Now we are leaving the hard facts ground and start speculating. - It seems be the case than either BIOS do not list memory-mapped EC registers as a reserved memory area, or memtest86+ fails to process this reservation correctly. - The pattern of the memory writes by memtest is (unfortunately) 100% compatible with PMH internal register access protocol. - It is very possible that by writing some moving ones and zeros or a random bytes, the memtest has pulled an unused (tied to ground or VCC) PMH pin high or low - thereby creating a short-circuit on VCC3SW line. - This short-circuit would tend to overheat the RINKAN LDO as it's output transistor is in active mode, and is easily overloaded with a PMH output transistor (which is in conduction mode with a resistance of milli-ohms). It seems that RINKAN has no over-current or thermal protection built in. - VCC3SW malfunction is not critical while the main board 3.3V/9А and 5V/8A buses are powered by TPS51221 (U41) IC. Most components draw power from main buses and not from VCC3SW. But when the laptop is powered off, there is no VCC3SW bus to initiate the power-on process. The laptop is bricked. Findings: Both laptops were disassembled and main boards examined using a multi-meter and an oscilloscope. The main boards were of a different revisions (and different types: one with discrete graphics, one without) but both has the VCC3SW power bus malfunctioned. The first laptop provided around 1.2v over the VCC3SW and a measured resistance from VCC3SW to GND was around 400 Ohm. After cutting the VCC3SW pin on RINKAN IC and providing an external power to the VCC3SW line - the laptop powered up and attempted to boot. We ended up wiring up an external micro-power LDO (LP2930-3.3) to provide the power permanently. This laptop still has some minor problems (like refusing to power-up unless the battery is removed and AC-IN is plugged-in), but is still usable. The second T500 RINKAN was not providing any power to the VCC3SW bus, and measured resistance was only ~50 Ohms. We had to cut both VCC3SW (output) and VREGIN20 (input) RINKAN pins to remove an over-current condition. After that we observed the power on main 3.3V and 5V buses, but RINKAN/PMH7 do not issue 'POWER GOOD' signals and prevent the system to become usable. No repair is possible. It looks like T6x, T400/500, T410/510, T420/520 laptop families could be affected by this problem. Starting from the T430/530 series, a communication protocol with the EC was changed - breaking tp_smapi driver and fixing the described problem as a side effect. I have a "revived" T500 on hands and I would be happy to provide any information to confirm or correct my findings. I still think that it's appropriate to warn lenovo users of a possibility to brick their laptops with just a mere memory test. --- Sincerely yours, Sergey Kogan

Information forwarded to debian-bugs-dist@lists.debian.org, Yann Dirson <dirson@debian.org> :

Bug#900399 ; Package memtest86+ . (Thu, 07 Jun 2018 13:30:03 GMT) (full text, mbox, link).

Acknowledgement sent to Сергей Коган <kogan@bit-integro.ru> :

Extra info received and forwarded to list. Copy sent to Yann Dirson <dirson@debian.org> . (Thu, 07 Jun 2018 13:30:04 GMT) (full text, mbox, link).

Message #15 received at 900399@bugs.debian.org (full text, mbox, reply):

From: Сергей Коган <kogan@bit-integro.ru> To: 900399@bugs.debian.org Subject: More good news Date: Thu, 7 Jun 2018 19:26:43 +0600

Hi! Let's lower the severity of this bug and flag it as unverified. Given the datasheet for the TB62501 and actual board layout of the T500 - the described scenario (short from the VCC3SW to GND caused by a stray write to the PMH register) is highly improbable: - The LDO inside the RINKAN has an over-current protection set as low as 55mA and should prevent any damage even if the VCC3SW is shorted. After the single over-current/under-voltage event, RINKAN LDO is locked in the OFF state and requires a complete power-off to restart. - Unused pins of the PMH are in fact floating - Some RINKAN batches do show tendency to malfunction with no apparent reasons. The main board temperature could be a contributing factor. So, we have to seriously consider the possibility that two laptops died at the same time just by a coincidence. We do plan to run a memtest on the restored laptop using a current measuring/limiting circuit on the VCC3SW bus. If no excessive current consumption would be detected - the memtest has nothing to do with the issue. If an excessive current during the test would be observed, it would get us a direction to resume the investigation. --- Sincerely yours, Sergey Kogan

Information forwarded to debian-bugs-dist@lists.debian.org, Yann Dirson <dirson@debian.org> :

Bug#900399 ; Package memtest86+ . (Tue, 03 Jul 2018 12:36:03 GMT) (full text, mbox, link).

Acknowledgement sent to Tomas Janousek <tomi@nomi.cz> :

Extra info received and forwarded to list. Copy sent to Yann Dirson <dirson@debian.org> . (Tue, 03 Jul 2018 12:36:03 GMT) (full text, mbox, link).

Message #20 received at 900399@bugs.debian.org (full text, mbox, reply):

From: Tomas Janousek <tomi@nomi.cz> To: Сергей Коган <kogan@bit-integro.ru>, 900399@bugs.debian.org Subject: Re: Bug#900399: It's confirmed: memtest86+ can kill lenovo mainboard Date: Tue, 3 Jul 2018 14:24:28 +0200

Hi, On Wed, Jun 06, 2018 at 03:35:36PM +0600, Сергей Коган wrote: > [...] > It looks like T6x, T400/500, T410/510, T420/520 laptop families could be > affected by this problem. Starting from the T430/530 series, a communication > protocol with the EC was changed - breaking tp_smapi driver and fixing the > described problem as a side effect. > [...] This may be completely unrelated, but it seems somewhat relevant: When pressing and holding a key during memtest86+ on an otherwise perfectly working T420, there are errors due to a different value being read than was written. Initially I thought my memory/motherboard is faulty and the keyboard pressure is triggering this, but the patterns are totally deterministic: the same key always does the same "damage" to the bits. Perhaps there is indeed something mapped into the memory... :-) -- Tomáš Janoušek, a.k.a. Pivník, a.k.a. Liskni_si, http://work.lisk.in/

Information forwarded to debian-bugs-dist@lists.debian.org, Yann Dirson <dirson@debian.org> :

Bug#900399 ; Package memtest86+ . (Sat, 14 Jul 2018 01:27:03 GMT) (full text, mbox, link).

Acknowledgement sent to Dmitry Smirnov <onlyjob@debian.org> :

Extra info received and forwarded to list. Copy sent to Yann Dirson <dirson@debian.org> . (Sat, 14 Jul 2018 01:27:03 GMT) (full text, mbox, link).

Message #25 received at 900399@bugs.debian.org (full text, mbox, reply):

From: Dmitry Smirnov <onlyjob@debian.org> To: 900399@bugs.debian.org Cc: 900399-submitter@bugs.debian.org Subject: Re: #900399 memtest86+: very probably kills system controller on Lenovo Thinkpad T500 laptop Date: Sat, 14 Jul 2018 11:22:10 +1000

IMHO inflated severity if this bug is unjustified. Generally speaking, memtest86+ is exposing a hardware problem which is exactly what it designed to do and seems to be doing well - therefore this bug seems to be targeted against memtest86+'s primary function. Let me use a hypothetical example: suppose I'm stress testing a notebook continuously for many hours. But notebook is not designed with same thermal properties as a server so during testing notebook is overheated beyond its thermal specifications for too long so it eventually breaks. Fair enough, arguably memtest86+ exposed flaw in thermal design which is exactly what's expected. It is unfortunate if hardware ended up damaged but it is not a bug in memtest86+. Isn't it common sense that any burn-out test is not without risks of damage to hardware? Maybe this bug is to be forwarded to notebook vendor? What action you expect from Debian maintainer? Incorporating a warning appears to be a task for upstream developers. For what it's worth, I've used memtest86+ to extensively test two different models of T520 and T410 Thinkpads without breaking them... -- All the best, Dmitry Smirnov. --- Lies are the social equivalent of toxic waste: Everyone is potentially harmed by their spread. -- Sam Harris

Message sent on to Sergey Kogan <kogan@bit-integro.ru> :

Bug#900399. (Sat, 14 Jul 2018 01:27:05 GMT) (full text, mbox, link).

Information forwarded to debian-bugs-dist@lists.debian.org, Yann Dirson <dirson@debian.org> :

Bug#900399 ; Package memtest86+ . (Sun, 12 Aug 2018 00:24:03 GMT) (full text, mbox, link).

Acknowledgement sent to ydirson@free.fr :

Extra info received and forwarded to list. Copy sent to Yann Dirson <dirson@debian.org> . (Sun, 12 Aug 2018 00:24:03 GMT) (full text, mbox, link).

Message #33 received at 900399@bugs.debian.org (full text, mbox, reply):

From: ydirson@free.fr To: Сергей Коган <kogan@bit-integro.ru>, 900399@bugs.debian.org Cc: control@bugs.debian.org Subject: Re: Bug#900399: More good news Date: Sun, 12 Aug 2018 02:21:36 +0200 (CEST)

severity 900399 normal thanks I suggest you get some advice from the forum[1], and as Dmitry mentionned, bring the issue to Lenovo. [1] http://forum.canardpc.com/forums/73-Memtest86-Official-forum?s=1407c99a4da914ef85e60c32c658ba16 ----- Mail original ----- > De: "Сергей Коган" <kogan@bit-integro.ru> > À: 900399@bugs.debian.org > Envoyé: Jeudi 7 Juin 2018 15:26:43 > Objet: Bug#900399: More good news > > Hi! > > Let's lower the severity of this bug and flag it as unverified. > > Given the datasheet for the TB62501 and actual board layout of the > T500 > - the described scenario (short from the VCC3SW to GND caused by a > stray > write to the PMH register) is highly improbable: > > - The LDO inside the RINKAN has an over-current protection set as low > as > 55mA and should prevent any damage even if the VCC3SW is shorted. > After > the single over-current/under-voltage event, RINKAN LDO is locked in > the > OFF state and requires a complete power-off to restart. > > - Unused pins of the PMH are in fact floating > > - Some RINKAN batches do show tendency to malfunction with no > apparent > reasons. The main board temperature could be a contributing factor. > > So, we have to seriously consider the possibility that two laptops > died > at the same time just by a coincidence. > > We do plan to run a memtest on the restored laptop using a current > measuring/limiting circuit on the VCC3SW bus. If no excessive current > consumption would be detected - the memtest has nothing to do with > the > issue. If an excessive current during the test would be observed, it > would get us a direction to resume the investigation. > > --- > Sincerely yours, > Sergey Kogan >

Severity set to 'normal' from 'critical' Request was from ydirson@free.fr to control@bugs.debian.org . (Sun, 12 Aug 2018 00:24:04 GMT) (full text, mbox, link).

Send a report that this bug log contains spam.