One of the benefits of virtualization is security; applications running in separate virtual machines are isolated from each other and, ideally, it is very hard for a compromised guest to damage other virtual machines running on the same host. The hypervisor itself is the place where most attacks on a virtualization system will be aimed. At the 2014 KVM Forum, Andrew Honig presented his analysis of which parts of KVM are more likely to have problems, and proposed ways to limit the attack surface.

An insecure hypervisor does not provide much security to the virtual machines it hosts. Luckily, hypervisors are typically small pieces of software, and a smaller size means a reduced attack surface and a higher feasibility of auditing the code. In the case of KVM, the hypervisor runs in the same address space as the rest of the Linux kernel, including device drivers and the network stack, but only a small amount of code deals with untrusted input from the virtual machine.

Therefore, the Linux kernel is substantially insulated from possible malicious behavior of the virtual machine. Device drivers in the virtual machine talk to a user-space process (typically QEMU), and this process talks to the kernel through the regular system call interface or through special devices such as /dev/tap . QEMU is exposed to all the evil that could come from a malicious virtual machine, but only limited and low-level interfaces can be used to attack it. This makes it hard to use QEMU as a vector to exploit kernel vulnerabilities in the host. And, since QEMU is a user-space program, Linux Security Modules (LSMs) such as SELinux or AppArmor can be used to substantially mitigate the effect of arbitrary code execution if QEMU itself is subverted.

This makes the hypervisor much more interesting to attack than QEMU is. So there was a great deal of interest in Honig's talk, "Security Hardening of KVM", (slides [PDF], video [YouTube]) at the KVM Forum, which was held in Düsseldorf, Germany in October. Honig has been working on hypervisor security for about ten years. He used to try to break VMware, and found six CVEs, but his attention has shifted to KVM since he switched employers. He now works at Google, where his team takes care of securing Google Compute Engine (GCE). This is a cloud platform that uses KVM as the hypervisor. Interestingly, the user-space part of GCE is not QEMU; Google wrote its own.

So far, the team has found nine vulnerabilities in KVM. That is not all that many compared to the effort that he and his team is putting into breaking it. In Honig's words, few other parts of Linux have probably had as many "engineer-hours per line of code" spent looking for security problems. Forty thousand lines of C code can certainly be expected to have bugs.

Vulnerability types

What kind of vulnerabilities can you encounter? Privilege escalation or denial of service (DoS) in the host can happen of course, since hypervisors expose a relatively rich ioctl() API to user space; this kind of vulnerability is not really specific to hypervisors. It is slightly more interesting to have a bug that lets an unprivileged program running in the guest crash the whole virtual machine. A bug of this kind was fixed recently.

Crashing the host is worse and mostly happens because of null pointer dereferences (with the panic_on_oops=1 setting); and in some rare cases, a hypervisor bug can facilitate privilege escalation for an unprivileged program running within a guest. Which of these is worse? For a cloud provider such as Google, crashing the host is worse; its customers, however, might value the integrity of their virtual machines.

Higher up in the rankings are vulnerabilities that let guests read data from other guests or from the hypervisor. The recently discovered Xen vulnerability, XSA-108, let guests read a few kilobytes of hypervisor memory. Despite being hard to exploit, and despite the existence of worse kinds of hypervisor vulnerabilities, it received considerable press and forced major cloud providers to reboot all of their hosts.

Of course, the worst bugs of all happen when the guest can write to hypervisor memory and, in all likelihood, execute arbitrary code in hypervisor context soon after. Of the fifteen CVEs that Honig mentioned, five were of this kind: two in KVM and three in VMware.

In order to find these bugs, Honig's team resorts to fuzzing and a lot of code review. They have gained some experience and by now they know what and where to look for every time they upgrade GCE to a newer hypervisor.

Most of the problems stem from either race conditions or buffer overflows, and some are downright embarrassing. In one case for KVM, the code used an ASSERT() macro to verify the validity of an index in an array:

u32 redir_index = (ioapic->ioregsel - 0x10) >> 1; u64 redir_content; ASSERT(redir_index < IOAPIC_NUM_PINS); redir_content = ioapic->redirtbl[redir_index].bits;

Unfortunately, the bounds check is buried inside the ASSERT() call that is compiled out by default. That means the guest can read arbitrary host memory. Or, if you choose to enable it, as is the case for debug builds, an assertion failure will crash the host—pick your poison.

The code above is part of the emulation of the IOAPIC, an interrupt controller device. It turns out that device emulation is the area where Google reported most bugs, but it is not the only one.

Improving KVM security

The main task of the hypervisor is to drive execution of the virtual CPUs. Some actions of the virtual CPUs, such as reads and writes to model specific registers (MSRs) and I/O registers, cannot be done by the processor; the hypervisor will then either emulate the operation itself or ask a user-space process to complete it. MSRs right now are always handled in kernel space, and are one source of bugs. Performance-critical devices such as interrupt controllers and timers are also handled in kernel space; the IOAPIC is not really performance-critical anymore, but it used to be in 2007-2008 operating systems when KVM was being developed.

In order to process loads and stores to I/O registers, KVM includes a small x86 instruction emulator. The emulator actually has a second purpose: it is needed to handle processor states that are not supported by older Intel processors, such as the so-called "big real mode" and hardware task switching. The good news is that this second purpose is becoming obsolete, as newer processors can do almost all of this in hardware. The bad news is that, unlike RISC architectures where only a handful of instructions have to be emulated, x86 has dozens of instructions that can access memory-mapped I/O registers, and KVM has to recognize and execute them all. Thus, the emulator consists of roughly 5,000 lines of code, and has its own share of bugs.

The more these parts can be moved to user space, the more the attack surface can be reduced, Honig said. As mentioned earlier, user space is naturally confined, and it offers a wealth of mitigation techniques that do not apply to the kernel.

As newer processors include more and more virtualization features, Google is targeting fairly new Intel processors only, and high-end ones at that. In particular, the Xeon E5 v2, also known as Ivy Bridge-E, supports big real mode virtualization and can also virtualize large parts of the local APIC inside the processor.

In a perfect world, everything else would then move to user space. In practice, parts of the local APIC support will almost definitely remain in the kernel. For example, inter-processor interrupts (IPIs) are performance-critical and, in general, not virtualized by the CPU. The only accelerated special case is "self-IPIs", that is IPIs sent to the same processor that triggered them. This sounds weird but is used extensively by Windows.

Still, this means the emulator, the legacy i8259 interrupt controller, the legacy i8254 programmable timer, and the almost-legacy IOAPIC would no longer be part of the hypervisor's attack surface. Most MSR emulation could also move to user space. Honig stated a fairly ambitious goal: to reduce the attack surface by 50% (measured in lines of code and "number of pages of the Intel manual" emulated in the kernel) with at most 0.1% performance impact on macro-benchmarks.

The team's plan has been to start with everything in user space, and re-enable kernel acceleration as much as needed to satisfy their goal. This makes sense for a research project, but it is backwards compared to how this maintainer would like to see the work pushed upstream. As far as I am concerned, in fact, it would be preferable to receive many small series, each one moving a piece of KVM out of the kernel. Also, since Google has not been using either QEMU or kvmtool for the user-space part of the work, the team also has to develop patches for one of them before its improvements can be accepted upstream.

That said, this kind of hurdle should probably be expected, and it did not make the presentation any less interesting. Compared to containers, one of the strengths in virtualization is (or should be) the smaller attack surface. It is important that hypervisors keep up with the promises, and Honig's ideas are definitely going in the right direction.

Comments (6 posted)