Just like the OS creates virtual reality for user processes, the hypervisor creates virtual reality for operating systems. In my last blog I talked about virtual memory. I’ll continue by describing how the hypervisor further virtualizes the memory seen by the operating system.

The operating system is responsible for allocating physical pages and mapping them to virtual addresses. When the operating system boots, it asks the hardware how much physical memory there is. If your machine has 4GB of RAM, the OS will know that it can allocate a million or so 4kB pages starting at offset 0. If you want to run more than one OS on the same hardware, the hypervisor will have to provide each of the guest OSs with the illusion of a more or less contiguous block of physical memory starting at address 0, because that’s what all OSs expect.

The simplest way to do that is to introduce one more level of virtualization — one more level of mapping between what the guest OS considers a physical address (it’s called guest physical address) and the actual physical address (which is called host physical address). In other words, a virtual address is first translated to the guest physical address, and then to the host physical address — the latter translation controlled by the hypervisor.

There are several ways of implementing this scheme.

Nested Page Tables

Address translation on the x86 is implemented in hardware, with page tables stored in regular memory that is directly accessible to the OS kernel. It’s not easy to hook an additional level of indirection into this machinery. In recent years Intel and AMD started producing chips that provide hardware support for the second layer of virtualization. AMD calls its technology nested page tables (NPT), and Intel calls theirs extended page tables (EPT). Unfortunately those two differ in some important details — e.g., EPT doesn’t support accessed/dirty bits.

The usual page table walk, which happens when the mapping is not present in the TLB, goes through the guest page tables and continues into host page tables, until the physical address if found. This address then goes into the TLB to speed up future lookups. This happens in hardware without any intervention from the hypervisor.

The hypervisor has to intervene only when a page fault happens. It’s a two stage process: First the operating system processes the fault. It allocates guest physical memory from its (fake) pool and stores it in the guest page table. Remember, the OS knows nothing about the second level of indirection — it’s just your regular Linux or Windows. When the user instruction that caused the fault is restarted, address translation progresses half way, and faults again when it can’t find the guest-to-host mapping. This time the fault is vectored into the hypervisor, which allocates actual physical memory and patches the rest of the page tables.

Old generations chips don’t support NTP so a different set of software tricks is used.

Shadow Page Tables

In the absence of nested page table support, there is another cheat that’s used by virtual machines. The idea is to create a second copy of page tables called shadow page tables that map virtual addresses directly to host physical addresses (actual DRAM), and let the processor use them for address translation (so the CR3 register points to those, rather than to guest, or primary page tables).

The structure of shadow page tables that are kept by the hypervisor reflects the structure of primary page tables that are kept by the operating system. There are several approaches to keeping those two structures in sync.

The naive approach would be to let the OS freely modify its page tables, e.g., in response to a page fault. The hypervisor would only intercept the privileged instruction INVLPG that the OS uses to invalidate a TLB entry. At that point the hypervisor would make a corresponding change to its shadow page tables, substituting the guest physical address with the corresponding host physical address (possibly allocating a physical page from its own pool). Unfortunately, this doesn’t work because some operating systems (I won’t be pointing fingers) occasionally forget to invalidate TLB entries.

The fallback approach is for the hypervisor to write-protect all guest page tables. Whenever the OS tries to modify them, a page fault occurs — the so called tracing fault — which is vectored into the hypervisor. That gives the hypervisor the opportunity to immediately update its shadow page tables. In this scheme the two structures always mirror each other (modulo physical addresses) but at the cost of a large number of page faults.

A common optimization is to allow shadow page tables to be lazily updated. This is possible if not all primary page tables are write protected. If an entry is present in the primary page table but is missing from the shadow page table, a page fault will occur when the hardware walks it for the first time. This is a so called hidden page fault and it’s serviced by the hypervisor, which at that point creates an entry in its shadow page tables based on the entry in the primary page tables.

Since the hypervisor must be involved in the processing of some of the page faults, it must install its own page handler. That makes all faults, real and hidden, vector into the hypervisor. A real page fault happens when the entry is missing (or is otherwise protected) in the primary page tables. Such faults are reflected back to the OS — the OS knows what to do with them. Faults that are caused by the OS manipulating page tables, or by the absence of an entry in the shadow page tables are serviced by the hypervisor.

An additional complication is related to the A/D (accessed and dirty) bits stored in page tables. Normally, whenever a page of memory is accessed, the processor sets the A bit in the corresponding PTE (Page Table Entry); and when the page is written, it also sets the D bit. These bits are used by the OS in its page replacement policy. For instance, if a page hasn’t been dirtied, it doesn’t have to be written back to disk when it’s unmapped. In a virtualized system, the processor will of course mark the shadow PTEs, and those bits have to somehow propagate to the primary page tables for the use by the OS. This can be accomplished by always starting host physical pages in the read-only mode. The hypervisor allocates a host physical page but it sets the read-only bit in its shadow PTE, which is used by the processor for address translation. Reading from this page proceeds normally, but as soon as the processor tries to write to it, there is a protection fault. The hypervisor intercepts this fault: It checks the guest page tables to make sure the page was supposed to be writable; if so, it sets the dirty bit in the guest PTE, and removes the write protection from the host PTE. After that, write access to this page may proceed without further faults.

The hypervisor keeps separate shadow page tables for each process. When a context switch occurs, the OS tries to modify the CR3 register to point to the primary page directory for the new process. CR3 modification is a protected instruction and it traps into the hypervisor and gives it the opportunity to switch the shadow tables and point the CR3 at them. The hypervisor might also chose to discard shadow page tables upon context switch, and later refill them on demand using hidden page faults.

Guest/Host Transitions

All these hypervisor tricks rely on the ability to trap certain operations performed by the operating system. The virtualization of virtual memory requires the trapping of page faults, TLB flushes, and CR3 register modifications. I’ll describe various methods to achieve this in my next blog post.

I’d like to thank David Dunn and Pete Godman for sharing their expertise with me.

Bibliography