AMD’s Ryzen 7 has been generally well-received by the enthusiast community, but there’s been one low-level problem that we’ve been watching but haven’t previously reported on. In early June, Ryzen users running Linux began reporting segmentation faults when running multiple concurrent compilation workloads using multiple different versions of GCC. LVVM/Clang was not affected, and the issue appears confined to Linux. Moreover, it wasn’t apparently common, even among Linux users — Michael Larabel, of Phoronix.com, reported that his own test rigs had been absolutely solid, even under heavy workloads.

Like the Pentium FDIV bug of yesteryear, this was a real issue, but one that realistically only impacted a fraction of a fraction of buyers. AMD had previously said it was investigating the problem (which isn’t present on any Epyc or Threadripper CPUs) and it’s now announced a solution: CPU replacement.

Phoronix reports AMD provided them with a new Ryzen 7 1800X CPU and that this chip has refused to crash, even when running a “kill Ryzen” script that would previously deliberately create a compiler segmentation fault. While some users thought the issue was confined to a RAM, motherboard, or BIOS-related issue, Phoronix’s testing proves otherwise. Swap the new Ryzen 7 1800X for an older part, and the problem reappears. Switch back to the new chip, and it vanishes. Larabel has tentatively concluded that the issue appears confined to Ryzen CPUs manufactured before Week 25 of this year (the new chip was built in Week 30), but no other details on what caused it are available.

The good news is, AMD is replacing the CPUs of anyone who has this issue. Again, while the issue is real, it appears to only trigger in an extremely small number of cases when running a Linux workload under specific and particular circumstances.

CPU Errata Are the Rule, Not the Exception

We tend to think of CPU errata as being show-stopping phenomena that occur only occasionally, but the opposite is true. The summary table of errata within Intel’s sixth-generation Core family is eight pages long. Most of these bugs are minor issues or relate to corner cases, but larger issues can break through. Intel’s original Atom architecture had a major FPU bug in which trying to perform two back-to-back x87 operations would double the execution time. CPU analyst Agner Fog writes (Page 162 / 233):

Whenever there are two consecutive x87 instructions, the two instructions fail to pair and instead cause an extra delay of one clock cycle due to problems in the decoders. This gives a throughput of only one instruction every two clock cycles, while a similar code using XMM registers would have a maximum throughput of two instructions per clock cycle. This applies to all x87 instructions (names beginning with F), even the FNOP. For example, a sequence of 100 consecutive FNOP instructions takes 200 clock cycles to execute in my tests. If the 100 FNOPs are interspersed by 100 NOPs then the sequence takes only 100 clock cycles. It is therefore important to avoid consecutive x87 instructions.

The Skylake Hyper-Threading bug that froze systems when executing certain workloads is included in the 6th Generation list described above. AMD, of course, has had other problems of its own, including Piledriver’s poor handling of 256-bit AVX instructions (the penalty for using these was severe), and the infamous TLB bug that limited the scaling and performance of the original Phenom / Barcelona processors.

Unless you’re absolutely certain that you’re having a problem related to this bug, you probably aren’t. But we’re glad to see AMD offering replacement cores for those affected by the issue. CPU errata may be nothing new, but how companies respond to them still impacts how the issue is perceived by the IT community.