The revelations by Google’s Project Zero team earlier this year of the Spectre and Meltdown speculative execution vulnerabilities in most of processors that have powered servers and PCs for the past couple of decades shook the industry as Intel and other chip makers scrambled to mitigate the risk of the threats in the short term and then implement plans to incorporate the mitigation techniques into future versions of the silicon. The Google team discovered the vulnerabilities in June 2017 and worked behind the scenes with chip makers and software vendors to close as many of the holes as possible before disclosing the threats in January.

The already-complex efforts were further complicated over the following months as more variants of Spectre and Meltdown were detected and efforts to mitigate those threats were put in motion. The vulnerabilities initially were seen as an Intel problem – probably given the company’s dominance in the server and PC chip markets – and Meltdown (also known as Variant 3) does specifically target Intel Xeon and Core processors. But Variants 1 and 2, which collectively were known as Spectre, impact not only Intel and AMD X86 chips but also those leveraging the IBM Power, Arm, and Sparc architectures, all of which use speculative execution to ramp up the performance of the processors. Other vulnerabilities have cropped up, including L1TF – also known as Foreshadow – which were disclosed in August.

However, as component, system, and software makers continue to work to respond to every variant that emerges, Spectre and Meltdown also has the industry looking farther down the road, considering the ways hardware architecture will need to change to ensure that such vulnerabilities don’t arise again and wrestling with the thorny issue of whether there is a happy medium between security and performance. It was the focus of a keynote presentation at the recent Hot Chips 2018 show, where the speakers agreed changes need to be made, but the question quickly became what those changes would be and whether they could be implemented without sacrificing the performance that users demand.

“That growth in complexity has simply outstripped our security mechanisms, but there is a change here,” said John Hennessy, board chairman for Google parent company Alphabet and a legend in the IT industry as well as the former head of Stanford University. “Most attacks are traditionally software-focused – buffer overflow being the classic example – and they still are the majority of attacks. But things have really changed with Spectre and Meltdown. In Spectre and Meltdown, it is a hardware flaw. The architects are the people who did it, in some sense, although there is a lot of subtly about what is right and wrong in terms of this. There are now multiple versions of them and there’s lots of suspicion that they’re just the first of many variants of this form that exploit the holes created primarily by speculation.”

Hennessy said that what it comes down to is, “we simply can’t design hardware that has security flaws. That’s an obligation of everybody that works on computer architecture. No matter how much performance can be gained, we can’t open up a security hole. We’ve got to go fix those problems; the fixes can be difficult and the fixes can sometimes negate the performance advantages that are gained by the hardware mechanism that we plug in. Unfortunately, some of us missed this problem for somewhere between 15 and 20 years, so there are lots of processors out there that have these holes in various forms. We’ve also got to build better mechanisms.”

However, that won’t mean getting rid of speculative execution, said Mark Hill, professor of computer science at the University of Wisconsin in Madison. The industry needs to consider what he calls Architecture 2.0, and there are options to consider, although none yet stand out as the only solution. But the performance gains that speculative execution techniques in all these processors allows cannot just be given up in the wake of Spectre and Meltdown.

“So you just want to get rid of it? Well, it’s going to cost you many integers of performance to just get rid of speculation,” Hill explained. “It’s like moving back to a 200 MHz processor. Let’s not argue about the number because we’re not going to do this. It’s just not viable for any general-purpose processor product. We’re going to have to figure out ways to work creatively to mitigate these types of timing-channel vulnerabilities.”

We at The Next Platform have done a number of deep dives into how the Spectre and Meltdown vulnerabilities work and the work industry players have done to mitigate the threats. The vulnerabilities enable applications to read the data of other apps when running on the same server and in the same system memory pool, and they impact datacenters everywhere. Speculative execution techniques have been used for decades and side-channel attacks that pose the risk through Spectre and Meltdown have been known about even longer, Hennessy said. The assumption until last year was that such techniques could not lead to security issues.

Speculative execution techniques were a panacea, according to Paul Turner, software engineer with Google’s Project Zero.

“If speculation is correct, we can retire a bunch of instructions with really great accuracy at no cost,” Turner said during the keynote, echoing the assumptions of the past. “And if the prediction is not correct the speculation’s discarded. It can’t be observed and there are no side effects. Turns out that’s not true. This is kind of the problem. This is when Pandora’s Box was opened. This is the new now. Previously, speculative execution was thought to be data-free side effects.”

The concerns raised by Spectre and Meltdown are heightened by the rapid changes going on in the industry today, he said. More personal data is online, computing systems have become more complex – thus expanding the attack surface – cloud environments mean that strangers are sharing the same hardware resources and security attacks are increasing and becoming more sophisticated, thanks in part to the growing use by nation-states and organized crime. Throw in the continuing demand for improved system performance every year and the slowing of Moore’s Law, and the challenges of addressing vulnerabilities caused by speculative execution become obvious.

“On the hardware side, people that bought computers expected them to get faster every year,” said Jon Masters, chief Arm architect for Red Hat, who argued that software and hardware engineers will now have to work together more closely in the wake of Spectre and Meltdown. “We get this slow initial ramp and then we get this 52 percent year-on-year growth in performance, and then it’s been leveling off for the past few years because we got all the cheap wins. Many of those wins came because we have many of those capabilities like out-of-order execution, speculative engines, and so on, and we treated these like black boxes. Many software people until recently have never heard of speculation and a lot of design choices that we have today happened because we made assumptions.”

How the threat of Spectre and Meltdown and other hardware-based vulnerabilities will be dealt with long-term is unclear, though all the panelists said changes are needed. It will need a combination of hardware and software, but the push and pull between performance and security is the key challenge. Red Hat’s Omega team spent 10,000 hours of engineer time developing mitigations for the initial variants, but each came with some degree of performance hits. Masters said there are options in microcode and millicode in hardware, but software also will play a role.

“This going to be with us for a really, really long time, so we need to really think about these new classes of attacks and put the research required into long-term initiatives,” he said. “Changes in how we design software is going to be problem as well. The onus can’t all be on the hardware community. We obviously will have to make some changes in how we build our software. Open source can help in this process, but open source won’t magically solve all of our security problems.”

He added that RISC-V is a “very promising technology, but replacing all of our processors won’t solve the problem. However, having open implementations on which we can all collaborate, where we can all work together, that is very promising.”

Hill said the idea of bifurcation as part of fixing the microarchitecture should be considered. Users can have a choice between performance and security depending on the workloads they’re running. The can run fast cores or safe cores, for example.

“What we want is high performance and we want safety,” Hill said. “Are we going to be able to get it in the same thing? People are hoping there’s this happy medium. I’m skeptical. I think if we try to get both of these things, we get neither of these things. And this is a particularly bad time to hurt performance because what if we’re not getting those wonderful doublings of performance that we’ve had for the last couple of decades? So what to do? One of the things is bifurcation. If you can’t have it all, choose. So you have a mode where you’re faster and a mode where you’re safer. The same with speculation: not always, when you want to be safe. None of these are perfect, but they might be better than looking for the magical solution that have both high performance and that are safe.”

Hill listed various options, as shown below, but said that the long-term answers aren’t clear yet.

“In the short run we’re going to repair the microarchitecture, but the long-run question is: How are we doing to define this right so that we can potentially eliminate the problem?” he said. “Or are we forced to just make it like a crime thing where we’re always mitigating. It will take a long time to resolve where that comes down.”