I understand that there might be a good reason for Intel to add virtualization extensions to their CPU architecture. Instead of fixing the x86 architecture to (optionally) make it Popek-Goldberg compliant and have all critial instructions trap if not run in Ring 0, they added non-root mode, a very big hammer that allows me to switch my CPU state completely to that of the guest and switches back to my original host state on a certain event in the guest. Well, it’s a great toy for people who want to play with CPU internals.

Therefore Intel had to add the VMCS, a 4 KB block in memory that holds the complete CPU state of both the host and the guest (segment registers, GDT and IDT pointer, certain MSRs etc.) as well as some control bits (for example, when to exit).

I also understand that Intel doesn’t allow me to just read and write memory in the VMCS, but abstracts accessing the virtualization state using a vmread/vmwrite interface. This way, the actual layout of this 4 KB page is an implementation detail and can be changed on later CPUs. It also allows for field indexes that are more spread out and encode what kind of field it is.

So I understand very well why Intel encodes into the VMCS field index whether it’s a control field (0), a read-only field (1), part of the guest state (2) or part of the host state (3), and whether it’s a 16 bit (0), 32 bit (2), 64 bit (1) or native-sized (3) field. This way, for example, all 16 bit guest state fields (like the guest’s CS) have indexes starting at 0x0800, and all 64 bit host state fields (like the hosts’s EFER MSR) start at 0x2C00.

Now what I don’t understand is what is so hard to be consistent with this convention (Intel Manual 3B, Appendix H).

VMCS Link Pointer (0x2800): In the first revision of VT, it had already been already decided that there should be a mechanism for having a second 4 KB page in case later versions of VT need more than 4 KB of state. For this, there is there “VMCS Link Pointer”, which is a 64 bit physical address. Guess what category this belongs to? Guest state.

(0x2800): In the first revision of VT, it had already been already decided that there should be a mechanism for having a second 4 KB page in case later versions of VT need more than 4 KB of state. For this, there is there “VMCS Link Pointer”, which is a 64 bit physical address. Guess what category this belongs to? Guest state. “Guest Address Space Size” bit in the “VM Entry Controls” Field (0x4012): This is clearly guest state and not a control field.

(0x4012): This is clearly guest state and not a control field. “Host Address Space Size” bit in the “VM Exit Controls” Field (0x400C): This is clearly host state and not a control field.

(0x400C): This is clearly host state and not a control field. VMX-preemption timer value (0x482E): This timer controls after how many ticks execution of the guest should end and control should be returned to the hypervisor. Intel put this into the “guest state” bucket: All other guest state fields are properties of the i386/x86_64 architecture that need to be switched, but not this one. This should really be a control field.

And here is another favorite of mine: the “Primary Execution Controls” field. The 32 bits specify which events in the guest will exit guest execution and trap into the hypervisor (Table 21-6). These events are, among others:

exit on HLT

exit on INVLPG

exit on MOV CR3

exit on PAUSE

Setting these bits to 1 enables the traps. So if you set all bits to 0, you basically have an unrestricted guest, and if you set all bits to 1, you have the most controlled guest, and you get a notification about every event in the guest. Or so you might think. Actually, there are two bits in the field that don’t work like this:

Use MSR bitmaps

Use I/O bitmaps

If these bits are set to 1, it checks a whitelist whether a certain MSR or I/O access is possible. If they are set to 0, all MSR and I/O accesses trap. Compared to all other bits, that’s backwards. Oh great.

Since Steve Jobs seems to be happy to explain his personal opinion on everything lately, I wrote him an email asking him about this, and he replied:

Return-path: <sjobs@apple.com> Received: from bulkin002-bge351000.mac.com ([unknown] [10.150.69.129]) by ms231.mac.com (Sun Java(tm) System Messaging Server 7u3-12.01 64bit (built Oct 15 2009)) with ESMTP id <0L2X00HTAZ3Q6GF1@ms231.mac.com> for XXX@mac.com; Mon, 24 May 2010 13:47:50 -0700 (PDT) Original-recipient: rfc822;XXX@mac.com Received: from relay13.apple.com ([17.128.113.29]) by bulkin002.mac.com (Sun Java(tm) System Messaging Server 6.3-7.02 (built Jun 27 2008; 32bit)) with ESMTP id <0L2X001EVZ3QKED0@bulkin002.mac.com> for XXX@mac.com (ORCPT XXX@mac.com); Mon, 24 May 2010 13:47:50 -0700 (PDT) X-AuditID: 1180721d-b7c17fe00000693e-19-4bfae5f6545a Received: from [17.201.27.84] (using TLS with cipher AES128-SHA (AES128-SHA/128 bits)) (Client did not present a certificate) by relay13.apple.com (Apple SCV relay) with SMTP id DB.14.26942.6F6EAFB4; Mon, 24 May 2010 13:47:50 -0700 (PDT) From: Steve Jobs <sjobs@apple.com> Content-type: text/plain Content-transfer-encoding: 7bit Subject: Re: Intel VT VMCS Layout Date: Mon, 24 May 2010 13:47:48 -0700 Message-id: <3E789F1B-7E13-FFD2-80F6-8E8D4CDDE7FB@apple.com> To: Michael Steil <XXX@mac.com> MIME-version: 1.0 (Apple Message framework v1077) X-Mailer: Apple Mail (2.1077) X-Brightmail-Tracker: AAAAAQAAAZE= The whole VMCS is a big mess, I hate it. > Hi Steve, what do you think about the ordering of the VMCS fields in > Intel's VT extenions? > > Michael