Hi everyone,

While most of you are probably excited about the possibilities of the recently announced “Librem 5” phone, today I am sharing a technical progress report about our existing laptops, particularly findings about getting coreboot to be “production-ready” on the Skylake-based Librem 13 and 15, where you will see one of the primary reasons we experienced a delay in shipping last month (and how we solved the issue).

TL;DR: Shortly we began shipping from inventory the coreboot port was considered done, but we found some weird SATA issues at the last minute, and those needed to be fixed before shipping those orders.

The bug was sometimes preventing booting any operating system, which is why it became a blocker for shipments.

I didn’t find the “perfect” fix yet, I simply worked around the problem; the workaround corrects the behavior without any major consequences for users, other than warnings showing up during boot with the Linux kernel, which allowed us to resume shipments.

Once I come up with the proper/perfect fix, an update will be made available for users to update their coreboot install post-facto. So, for now, do not worry if you see ATA errors during boot (or in dmesg) in your new Librem laptops shipped this summer: it is normal, harmless, and hopefully will be fixed soon.

The SATA-killer Chronicles

I previously considered the coreboot port “done” for the new Skylake-based laptops, and as I went to the coreboot conference, I thought I’d be coming back home and finally be free to take care of the other stuff in my ever-increasing TODO list. But when I came back, I received an email from Zlatan (who was inside our distribution center that week), saying that some machines couldn’t boot, throwing errors such as:

Read Error

…in SeaBIOS, or

error: failure reading sector 0x802 from 'hd0'

or

error: no such partition. entering rescue mode

…in GRUB before dropping into the GRUB rescue shell.

That was odd, as I had never encountered those issues except one time very early in the development of the coreboot port, where we were seeing some ATA error messages in dmesg but that was fixed, and neither Matt nor I ever saw such errors again since. So of course, I didn’t believe Zlatan at first, thinking that maybe the OS was not installed properly… but the issue was definitely occurring on multiple machines that were being prepared to ship out. Zlatan then booted into the PureOS Live USB and re-installed the original AMI BIOS; then he had no more issues booting into his SSD, but when he’d flash coreboot back, it would fail to boot.

The ever changing name of the wind

Intrigued, I tested on my machine again with the “final release” coreboot image I had sent them and I couldn’t boot into my OS either. Wait—What!? It was working fine just before I went to the coreboot conference.

Did something change recently? No, I remember specifically sending the image that I had been testing for weeks, and I hadn’t rebased coreboot because I very specifically wanted to avoid any potential new bug being introduced “at the last minute” from the latest coreboot git base.

Just to be sure, I went back to an even older image I had saved (which was known to work as well), and the issue occurred there as well—so not a compiling-related problem either.

I asked Matt to test on his machine, and when he booted the machine, it was failing for him with the same error. He hadn’t even flashed a new coreboot image! It was still the same image he had on the laptop for the past few weeks, which was was working perfectly for him… until now, as it now refused to boot.

Madness? THIS—IS—SATA!

After extensive testing, we finally came to the conclusion that whether or not the machine would manage to boot was entirely dependent on the following conditions:

The time of day

The current phase of the moon

The alignment of the planets in some distant galaxy

The mood of my neighbor’s cat

The most astonishing (and frustrating) thing is that during the three weeks where Matt and I have been working on the coreboot port previously, we never encountered any “can’t boot” scenario—and we were rebooting those machines probably 10 times per hour or more… but now, we were suddenly both getting those errors, pretty consistently.

After a day or two of debugging, it suddenly started working without any errors again for a couple of hours, then it started bugging again. On my end, the problem seemed to typically happen with SATA SSDs on the M.2 port (I didn’t get any issues when using a 2.5″ HDD, and Matt was in the same situation). However, even with a 2.5″ HDD, Zlatan was having the same issues we were seeing with the M.2 connector.

So the good news was that we were at least able to encounter the error pretty frequently now, the bad news was that Purism couldn’t ship its newest laptops until this issue was fixed—and we had promised the laptops would be shipping out in droves by that time! Y’know, just to add a bit of stress to the mix.

The Eolian presents: DTLE

When I was doing the v1 port, I had a more or less similar issue with the M.2 SATA port, but it was much more stable: it would always fail with “Read Error”, instead of failing with a different error on every boot and “sometimes failing, sometimes working”. Some of you may remember my explanation of how I fixed the issue on the v1 in February: back then, I had to set the DTLE setting on the IOBP register of the SATA port. What this means is anyone’s guess, but I found this article explaining that “DTLE” means “Discrete Time Linear Equalization”, and that having the wrong DTLE values can cause the drives to “run slower than intended, and may even be subject to intermittent link failures”. Intermittent link failures! Well! Doesn’t that sound familiar?

Unfortunately, I don’t know how to set the DTLE setting on the Skylake platform, since coreboot doesn’t have support for it. The IOBP registers that were on the Broadwell platform do not exist in Skylake (they have been replaced by a P2SB—Primary to SideBand—controller), and the DTLE setting does not exist in the P2SB registers either, according to someone with access to the NDA’ed datasheet.

When the computer was booting, there were some ATA errors appearing in dmesg, and it looks something like this:

ata3: exception Emask 0x0 SAct 0xf SErr 0x0 action 0x10 frozen ata3.00: failed command: READ FPDMA QUEUED ata3.00: cmd 60/04:00:d4:82:85/00:00:1f:00:00/40 tag 0 ncq 2048 in res 40/00:18:d3:82:85/00:00:1f:00:00/40 Emask 0x4 (timeout) ata3.00: status: { DRDY }

Everywhere I found this error referenced, such as in forums, the final conclusion was typically “the SATA connector is defective”, or “it’s a power related issue” where the errors disappeared after upgrading the power supply, etc. It sort of makes sense with regards to the DTLE setting causing a similar issue.

It also looks strikingly similar to Ubuntu bug #550559 where there is no insight on the cause, other than “disabling NCQ in the kernel fixes it”… but the original (AMI) BIOS does not disable NCQ support in the controller, and it doesn’t fix the DTLE setting itself.

Chasing the wind

So, not knowing what to do exactly and not finding any information in datasheets, I decided to try and figure it out using some good old reverse engineering.

First, I needed to see what the original BIOS did… but when I opened it in UEFIExtract, it turns out there’s a bunch of “modules” in it. What I mean by “a bunch” is about 1581 modules in the AMI UEFI BIOS, from what I could count. Yep. And “somewhere” in one of those, the answer must lay. I didn’t know what to look for; some modules are named, some aren’t, so I obviously started with the file called “SataController”—I thought I’d find the answer in it quickly enough simply by opening it up with IDA, but nope: that module file pretty much doesn’t do anything. I also tried “PcieSataController” and “PcieSataDynamicSetup” but those weren’t of much help either.

I then looked at the code in coreboot to see how exactly it initializes the SATA controller, and found this bit of code:

/* Step 1 */ sir_write(dev, 0x64, 0x883c9003);

I don’t really know what this does but to me it looks suspiciously like a “magic number”, where for some reason that value would need to be set in that variable for the SATA controller to be initialized. So I looked for that variable in all of the UEFI modules and found one module that has that same magic value, called “PchInitDxe”. Progress! But the code was complex and I quickly realized it would take me a long time to reverse engineer it all, and time was something I didn’t have—remember, shipments were blocked by this, and customers were asking us daily about their order status!

The RAM in storm

One realization that I had was that the error is always about this “READ FPDMA QUEUED” command… which means it’s somehow related to DMA, and therefore related to RAM—so, could there be RAM corruption occurring? Obviously, I tested the RAM with memtest and no issues turned up, and since we had finally received the hardware, I could push for receiving the schematics from the motherboard designer (I was previously told it would be a distraction to pursue schematics when there were so many logistical issues to fix first).

As I finally received the schematics and started studying them, I found that there were some discrepancies between the RComp resistor values in the schematics and what I had set in coreboot, so I fixed that… but it made no difference.

I thought that maybe the issue then is with the DQ/DQS settings of the RAM initialization (which is meant for synchronization), but I didn’t have the DQ/DQS settings for this motherboard and I couldn’t figure it out from the schematics, so what I did was to simply hexdump the entire UEFI modules, and grep for “79 00 51” which is the 16 bit value of “121” followed by the first byte of the 16 bit value of “81”, which are two of the RComp resistor values. That allowed me to find 2 modules which contained the values of the Rcomp resistors for this board, and from there, I was able to find the DQ and DQS settings that were stored in the same module, just a few bytes above the Rcomp values, as expected. I tested with these new values, and… it made no difference. No joy.

A night with no moon

What else could I do? “If only there was a way to run the original BIOS in an emulator and catch every I/O it does to initialize the SATA controller!”

Well, there is something like that, it’s called serialICE and it’s part (sort of?) of the coreboot umbrella project. I was very happy to find that, but after a while I realized I can’t make use of it (at least not easily): it requires us to replace the BIOS with this serialICE which is a very very minimal BIOS that basically only initializes the UART lines and loads up qemu, then you can “connect” to it using the serial port, send it the BIOS you want to run, and while serialICE runs the BIOS it will output all the I/O access over the serial port back to you… That’s great, and exactly what I need, unfortunately:

the Librems do not have a serial port that I can use for that;

looking at the schematics, the only UART pad that is available is for TX (for receiving data), not RX (for sending data to the machine);

I can’t find the TX pad on the motherboard, so I can’t even use that.

Thankfully, I was told that there is a way to use xHCI usb debugging capabilities even on Skylake, and Nico Huber wrote libxhcidbg which is a library implementing the xHCI usb debug features. So, all I would need to make serialICE work would be to:

port coreboot to use libxhcidebug to have the USB debugging feature, test it and make sure it all works, or…

port my previous flashconsole work to serialICE then find a way to somehow send/bundle the AMI BIOS inside the serialICE or put it somewhere in the flash so serialICE can grab it directly without me needing to feed it to it through serial.

Another issue is that for the USB debug to work, USB needs to be initialized, and there is no way for me to know if the AMI BIOS initializes the SATA controller before or after the USB controller, so it might not even be helpful to do all that yak shaving.

The other solution (to use flashconsole) might not work either because we have 16MB of flash and I expect that a log of all I/O accesses will probably take a lot more space than that, so it might not be useful either.

And even if one or both of the solutions actually worked, sifting through thousands of I/O accesses to find just the right one that I need, might be like looking for a needle in a haystack.

Considering the amount of work involved, the uncertainty of whether or not it would even work, and the fact that I really didn’t have time for such animal cruelty (remember: shipments on hold until this is fixed!), I needed to find a quicker solution.

The anger of a gentle man

At that point, I was starting to lose hope for a quick solution and I couldn’t find any more tables to flip:

“This issue is so weird! I can’t figure out the cause, nothing makes sense, and there’s no easy way to track down what needs to be done in order to get it fixed.”

And then I noticed something. While it will sometimes fail to boot, sometimes will boot without issues, sometimes will trigger ATA errors in dmesg, sometimes will stay silent… one thing was consistent: once Linux boots, we don’t experience any issues—there was no kernel panic “because the disc can’t be accessed”, no “input/output error” when reading files… there is no real visible issue other than the few ATA errors we see in dmesg at the beginning when booting Linux, and those errors don’t re-appear later.

After doing quite a few tests, I noticed that whenever the ATA errors happen for a few times, the Linux kernel ends up dropping the ATA link speed to 3Gbps instead of the default 6Gbps, and that once it does, there aren’t any errors happening afterwards. I eventually came to the conclusion that those ATA errors are the same issue causing the boot errors from SeaBIOS/GRUB, and that they only happened when the controller was setup to use 6Gbps speeds.

What if I was wrong about the DTLE setting, and potential RAM issues? What if all of this is because of a misconfiguration of the controller itself? What if all AMI does is to disable the 6Gbps speed setting on the controller so it can’t be used?!

So, of course, I checked, and nope, it’s not disabled, and when booting Linux from the AMI BIOS, the link was set up to 6Gbps and had no issues… so it must be something else, related to that. I dumped every configuration of the SATA controller—not only the PCI address space, but also the AHCI ABAR memory mapped registers, and any other registers I could find that were related to the SATA/AHCI controller—and I made sure that they matched exactly between the AMI BIOS and the coreboot registers, and… still nothing. It made even less sense! If all the SATA PCI address space and AHCI registers were exactly the same, then why wouldn’t it work?

I gave up!

…ok, I actually didn’t. I temporarily gave up trying to fix the problem’s root cause, but only because I had an idea for a workaround that could yield a quick win instead: if Linux is able to drop the link speed to 3Gbps and stop having any issues, then why can’t I do the same in coreboot? Then both SeaBIOS and GRUB would stop having issues trying to read from the drive, ensuring the drive will allow booting properly.

I decided I would basically do the same thing as Linux, but do it purposedly in coreboot, instead of it being done “in Linux” after errors start appearing.

While not the “ideal fix”, such a workaround would at least let the Skylake-based Librems boot reliably for all users, allowing us to release the shipments so customers can start receiving their machines as soon as possible, after which I would be able to take the time to devise the “ideal” fix, and provide it as a firmware update.

Sleeping under the wagon: an overnight workaround

I put my plan in motion:

I looked at the datasheet and how to configure the controller’s speed, and found that I could indeed disable the 6Gbps speed, but for some reason, that didn’t work.

Then I tried to make it switch to 3Gbps, and that still didn’t work.

I went into the Linux kernel’s SATA driver to see what it does exactly, and realized that I didn’t do the switch to 3Gbps correctly. So I fixed my code in coreboot, and the machines started booting again. I also learned what exactly happens in the Linux kernel: when there’s an error reading the drive, it will retry a couple of times; if the error keeps happening over and over again, then it will drop the speed to 3Gbps, otherwise, it keeps it as-is. That explains why we sometimes see only one ATA error, sometimes 3, and some other times 20 or more; it all depends on whether the retries worked or not. Once I changed the speed of the controller to 3Gbps, I stopped having troubles booting into the system because both SeaBIOS and GRUB were working on 3Gbps and were not having any issues reading the data. However, once Linux boots, it resets the controller, which cancels out the changes that I did, and Linux starts using the drive at 6Gbps. That’s not really a problem because I know that Linux will retry any reads, and will drop to 3Gbps on its own once errors start happening, but it has the side effect that users will be seeing these ATA error message on their boot screen or in dmesg.



The next chapter: probably less than 10 years from now

As you can see, small issues like that are a real puzzle, and that’s the kind of thing that can make you waste a month of work just to “get it working” (let alone “find the perfect fix”). This is why I typically don’t give time estimates on this sort of work. We’re committed though on getting you the best experience with your machines, so we’re still actively working on everything.

Here’s a summary of the current situation:

You will potentially see errors in your boot screen, but it’s not a problem since Linux will fix it

It’s not a hardware issue, since it doesn’t happen with the AMI BIOS, we just need to figure out what to configure to make it work.

There is nothing to be worried about, and I expect to fix it in a future coreboot firmware update, which we’ll release to everyone once it’s available (we’re working on integration with fwupd, so maybe we’ll release it through that, I don’t know yet).

It’s taken me much longer than anticipated to write this blog post (2 months exactly), as other things kept getting in the way—avalanches of emails, other bugs to fix, patches to test/verify, scripts to write, and a lot of things to catch up on from the one month of intense debugging during which I had neglected all my other responsibilities.

While I was writing this status report, I didn’t make much progress on the issue—I’ve had 3 or 4 enlightenments and I thought I suddenly figured it all out, only to end up in a dead end once again. Well, once I do figure it out, I will let you all know! Thanks for reading and thanks for your patience.