Our last post on the benefits of virtualisation technology raised hackles with some readers in discussion forums, particularly regarding flexibility and performance. The flexibility that we talk about is purely in the context of being a service provider – that you can use all your resources immediately when you have a bare-metal box is missing the point. We sell virtual private servers, so it’s VPS upgrades that matter.

That’s not really interesting though, we’re going to talk about performance optimisations for virtual machines (VMs) instead. In recent years we’ve seen various hardware extensions introduced to reduce the overheads incurred by virtualisation, with the trend being to push virtualisation functionality into the CPU and hardware subsystems, relieving the hypervisor of much of the magic required to make things work transparently.

We’ll probably get into some fairly deep terminology, so this will be most meaningful if you’re familiar with low-level details of operating systems in general. For the record, we’re dealing exclusively with x86-type systems. While AMD also does virtualisation, Intel is all that counts for us, so we can’t say much about AMD’s functionality.

VT-x

The first round of optimisations arrived in 2005~2006, named VT-x and AMD-V by Intel and AMD respectively. The major improvements were two-fold:

Virtualisation of privileged system access: The short version of this story is that it makes the CPU context-aware, whether it’s the hypervisor or a VM performing privileged actions that would affect the state of the system. When hardware assistance isn’t present, the hypervisor needs to catch these actions and emulate the expected behaviour. With hardware assistance, it doesn’t have to.

Intel describes their virtualisation efforts pretty well in a tech journal publication, which is worth a read if you’re interested.

Another layer of page tables for memory access: Intel calls their version Extended Page Tables, AMD has Rapid Virtualization Indexing. Roughly the same thing, the idea is to hand the VM full control of its own page tables, and encapsulate that in another page table on the outside.

Because the VM is now sandboxed, it can safely do memory management as usual without the hypervisor needing to intervene and emulate the correct behaviour. The hypervisor gives the VM a chunk of memory and says “go nuts”, and the processor insulates the VM from the real world.

VT-x was significant because it meant that you could virtualise pretty much anything without specific workarounds in the hypervisor or VM, which is pretty cool. You’d be hard pressed buy a CPU without VT-x nowadays – even the lowly $42 Celerons have it. That’s great for users that want to run multiple operating systems without multibooting, or test different software environments.

VT-d

That’s the easy stuff out of the way. The next step up is VT-d (or AMD-Vi on the other side of the fence), and refers to applying the MMU concept to your memory-mapped I/O devices.

This was a bit tough for us to wrap our heads around, so we hit up one of our hardware geek contacts to help us make sense of it all.

Anchor: G’day Virgil, thanks for agreeing to come in to the studio today to help us out.

Virgil: Not at all, the pleasure’s all mine.

A: What have you been up to recently?

V: I spend most of my day working on a security-enhanced microkernel. Then at home I poke embedded systems for fun.

A: Sounds like you’re qualified to explain virtualisation extensions to us; what can you tell us about VT-d and this IOMMU thing?

V: Right, so VT-d is primarily two things. One, it’s an interrupt-remapping engine, which is awesome for a variety of reasons. It solves some problems you’d normally deal with in the hypervisor, in theory, but only with a massively huge performance cost.

Two, more importantly, it’s an IOMMU – this is the bit you’re interested in – and it’s mostly implemented on the chipset instead of the CPU. It means you can stick page tables in front of all memory accesses by devices. I’ll let that sink in for a moment.

A: …

V: Okay, the IOMMU is a DMA remapping engine. It lets you do things like, for example, you can wall off a network card into its own memory space and prevent it from touching memory it’s not allowed to. It simply can’t touch that memory because it’s outside its address space. It’d be like a segfault for normal software.

A: Hey, that’s pretty cool! If we had that for our RAID cards it would’ve picked up those dodgy memory accesses that we saw when fixing the megaraid_sas driver bug in the kernel.

V: Yep, you’re getting it now. Now let me connect this to the VT-x stuff for you.

You’ve got the EPT feature that Intel introduced with the Nehalem processors, another layer of memory address translation on top of the normal one. It’s controlled by the hypervisor. So the hypervisor crafts a “guest-physical” address space, feeds it to the processor’s MMU, starts the VM, and then butts out until it’s called again.

The hypervisor is confident that the VM’s view of memory is wholly described by the EPT layer. The guest can do whatever it wants: manage page tables, mess with virtual memory, whatever. The hypervisor doesn’t need to know, care, or (more importantly) be invoked or maintain shadow page tables. Say hello to performance!

Are you with me so far?

A: Yeah, we’re all safely sandboxed, and a huge chunk of performance overhead is gone.

V: Good. Now you’ve got a tech that lets you put the processor in a sandbox using page tables: EPT. And you’ve got a tech that lets you put devices into individual sandboxes using page tables: VT-d. Those page tables have a common format, so long as you’re a bit careful.

The hypervisor can craft a set of pagetables describing the whole VM, hand it over to the memory-management hardware, then let the VM run wild and free. The virtualisation overhead for setting up DMA transactions becomes essentially zero, and as an added bonus, the guest is incapable of causing corruption in the hypervisor.

Interrupt remapping gives you near-zero virtualisation overhead for handling interrupts as well. The obvious example is assigning a fast network card, say gig-eth or 10gig-eth, entirely to a guest. The guest can process incoming packets at pretty much native speed, save for the page table walks, which Intel is good at. The same goes for storage adapters as well, or even graphics cards if you wanted to.

A: For all those bitcoins that I’m mining in VMs, right?

V: Right.

I/O Virtualisation

A: That’s well and good, but uh… we can’t stick a network card into our hypervisors for every single VM. Even if we wanted to, there’s not enough PCI-e slots.

V: Indeed. And there’s a solution for that too, but it’s getting into pretty new territory for hardware and software support. It’s based on SR-IOV, or “Single-Root I/O Virtualisation”.

A: Explain?

V: Intel’s leading the charge at the moment, their solution is to create virtual PCI-e devices at the hardware level. VT-d allowed for “PCI passthrough” to VMs, so it stands to reason that we just need to create some more PCI-e devices. *grins*

A: Woah…

V: Yeah. Intel published a little doco about this a while ago, it has some really good diagrams showing how it all fits together [Ed: URL at the bottom of the page]

A: “Virtual Functions”?

V: *shrugs* It’s just what Intel calls the virtual network cards. Anyway, it’s the same pattern we’ve seen for the other virtualisation extensions: make the hardware aware of the VMs, then palm the hard work off to the hardware.

A: So, lemme make sure I’ve got this right, the VM loads the driver for a real physical network card, and it works?

V: Close enough, yes. It actually loads a slightly different driver, igbvf instead of the regular igb , because it’s not a real-real physical network card, and goes on its merry way.

A: I reckon we’ll be buying some new network cards with our next hypervisor chassis…

V: Assuming your current version of KVM supports it, you’re good to go.

A: Well I think that about wraps it up. Thanks for dropping by, it’s been most enlightening.

V: Not at all, thank you. Glad I could help!

Multiqueue support

Our chat with Virgil was very helpful, but it reminded us that our older VMs won’t be able to benefit from these enhancements – our older hypervisors probably don’t have hardware that’s capable, and SR-IOV support in KVM is very recent, only arriving in RHEL 6.3. Not all is necessarily lost though.

We recently managed to hit a performance ceiling on one of our loadbalancer VMs: the virtio_net device used by VMs has a single Tx and Rx queue to send interrupts. In contrast, real physical NICs tend to have multiple interrupt queues that can be mapped to multiple physical CPU cores.

As a result, a virtio_net device can’t scale up to a greater number of vCPUs in a VM, while a physical NIC can make use of the extra CPU grunt afforded by having more CPU cores. We believe we were saturating a single vCPU with the interrupt load caused by a large number of small requests.

Work on this in KVM is ongoing. The idea is to add support for multiple queues in the virtio_net device. This is then extended out towards the hardware, by adding complementary support to KVM/qemu, and to the tun/tap driver on the hypervisor side.

This doesn’t solve all the problems, but it does push the problem out of the VM and opens the door for improvement.

It’s tricky because it’s hard to get right – multiqueue support could be added naively, but it’s likely that interrupts will hit vCPUs at random, without regard for process locality. This would, by our understanding, require frequent inter-processor interrupts in the VM, which requires trapping out to the hypervisor and then back again. This is a big performance hit, which could end up worse than not having multiqueue support at all.

We hope you’ve enjoyed our discussion on hardware-assisted virtualisation. If you’ve got any questions, comments, or think we’ve got it just plain wrong, let us know in the comments.

If you’re interested in the paper that Virgil referred to, have a read of the Intel SR-IOV Primer.