Virtualization is a key enabling technology for the modern datacenter. Without virtualization, tricks like load balancing and multitenancy wouldn't be available from datacenters that use commodity x86 hardware to supply the on-demand compute cycles and networked storage that powers the current generation of cloud-based Web applications.

Even though it has been used pervasively in datacenters for the past few years, virtualization isn't standing still. Rather, the technology is still evolving, and with the launch of I/O virtualization support from Intel and AMD it's poised to reach new levels of performance and flexibility. Our past virtualization coverage looked at the basics of what virtualization is, and how processors are virtualized. The current installment will take a close look at how I/O virtualization is used to boost the performance of individual servers by better virtualizing parts of the machine besides the CPU.

Part 1 described three ways in which a component might be virtualized; emulation, "classic" virtualization, and paravirtualization, and part 2 described in more detail how each of these methods was used in CPU virtualization. But the CPU is not the only part of a computer that can use these techniques; although hardware devices are quite different from a CPU, similar approaches are equally useful.

I/O basics: the case of PCI and PCIe

Before looking at how I/O devices are virtualized, it's important to know in broad terms how they work. These days most PC hardware is, from an electronic and software perspective, PCI or PCI Express (PCIe); although many devices (disk controllers, integrated graphics, on-board networking) are not physically PCI or PCIe—they don't plug into a slot on the motherboard—the way in which they are detected, identified, and communicated with is still via PCI or PCIe.

In PCI, each device is identified by a bus number, a device number, and a device function. A given computer might have several PCI buses which might be linked (one bus used to extend another bus, joined through a PCI bridge) or independent (several buses all attached to the CPU), or some combination of the two. Generally, large high-end machines with lots of I/O expansion have more complicated PCI topologies than smaller or cheaper systems. Each device on a bus is assigned a device number by the PCI controller, and each device exposes one or more numbered functions. For example, many graphics cards offer integrated sound hardware for use with HDMI; typically the graphics capability will be function zero, the sound will be function 1. Only one device can use the bus at any given moment, which is why high-end machines often have multiple independent buses—this allows multiple devices to be active simultaneously.

PCIe operates similarly. PCIe is a point-to-point architecture rather than a bus architecture; rather than all devices (and all hardware slots) on the same bus being electrically connected, in PCIe there are no connections between devices. Instead, each device is connected solely to the controller. Each connection between device and controller is regarded as its own bus; devices are still assigned numbers, but because there can only be one device on each "bus," this number will always be zero. This approach allows software to treat PCIe as if it were PCI, allowing for easier migration from PCI to PCIe. This point-to-point topology alleviates the bus contention problem in PCI—since there is no bus sharing, there are fewer restrictions on concurrent device activity.

Actual data transfer to and from the device can use three mechanisms—system memory, x86 I/O ports, and PCI configuration space. x86 I/O ports are there to provide legacy compatibility, and PCI configuration space is used primarily for configuration. The main way that the OS communicates with PCIe devices is through system memory; this is the only mechanism that allows for large, general-purpose transfers. (With I/O ports, reads and writes are limited to 32 bits, and the CPU must take action after every single read or write, making communication slow and processor-intensive. And PCI configuration space is limited to 256 bytes, and used only for device configuration). Each device is assigned a block of system memory to which it can read and write directly ("DMA," direct memory access). For I/O devices requiring bulk transfers—disk controllers, network adaptors, video cards—this is the primary communication mechanism, as each of these devices performs regular large transfers.

When software wants to tell a PCI device to do something, the host delivers a command to the bus. Each device inspects the command, and acts on it if necessary. When the device wants to tell the CPU to do something—either because it has completed a command, or received some data—it interrupts the CPU, which in turn executes the device driver. PCI interrupts are generally delivered using 4 physical interrupt connections. These connections are shared between all devices on the same bus, so the device driver must then examine the interrupt to ensure it is handled properly. PCIe interrupts do not use physical hardware; instead, a message is sent to the device driver by writing to the block of memory assigned to the device—PCIe uses the same system for interrupts as it does for data transfer. This avoids the need to share interrupt lines, by enabling interrupts to be directed specifically and solely to the device that needs them.

Virtualizing PCI and PCIe

So, how do these things get virtualized? The first approach is emulation. Just as CPU emulation requires an entire virtual CPU to be run "in software," the same is true of device emulation. Generally, the approach taken is for the virtualization software to emulate well-known real-world devices. All the PCI infrastructure—device enumeration and identification, interrupts, DMA—is replicated in software. These software models respond to the same commands, and do the same thing as their hardware counterparts. The guest OS will write to its virtualized device memory (whether it be system memory, x86 I/O, or PCI configuration space), and trigger interrupts, and the VMM software will respond as if it were real hardware. Even this interrupt signalling uses emulation; one of the emulated devices is an interrupt controller.

This "response" generally means making an equivalent call to the host OS. So, for example, to write some data to disk, the guest OS will use its driver to write that to the disk controller's device memory, which sits inside a device model—a kind of virtual controller—along with the PCI configuration space and a virtual version of the controller chip. Then, using an interrupt sent via the VM's virtual interrupt controller, the guest OS commands the VMM's virtual disk controller to write that to a particular location on the disk. In turn, the VMM's disk controller will tell the host OS to write the data to a particular spot in a file (or, when used with so-called raw disks, to a particular spot on disk). The host OS then does the same thing as the guest OS—it copies the data to the disk controller's device memory via its driver and signals an interrupt.

A software-emulated device

In the diagram above, you can see that there's an entire virtual device and a virtual interrupt controller in the VM, and then another pair of these in the VMM. That's two layers of emulation before you get to the hardware. (The one element of the diagram above that's probably not at all self-explanatory is the little tab with gears on it beneath the OS. That's the device driver, and device model in the VMM uses it to interface with the hardware.)