NO EXECUTE! A weekly look at personal computer technology issues.

(c) 2007 by Darek Mihocka, founder, Emulators.com. November 19 2007

The Arithmetic Flags Challenge Let me show you a piece of code from a very simple C test program, which I call T1.C: static int foo(int i) {

return(i+1);

}



int main(void) {

int i;

int t = 0;



for(i = 0; i < 100000000; i++)

t += foo(i);

} This is actually a test program I've used for about 10 years. I compile this test program using various compilers, and use it for two different purposes: to compare the code generation between different compilers and their compiler optimization switches in order to compare competing C/C++ compilers on the same platform, and, to generate test programs for use in correctness tests and benchmarking of various emulators. For example, this was the first piece of compiled PowerPC test code I got running back when I started developing my never ending PowerPC emulator. I like this code because it is simple, always takes exactly the same number of instructions to execute in, and exercises many common features such as arithmetic operations, loop index incrementing and checking, function calls, memory reads and writes and store-forwarding issues, stack pushes and stack adjusts. There are all extremely common elements of any typical C or C++ program. I compile two versions of this code: T1FAST.EXE is compiled with maximum compiler optimization turned on, and T1SLOW.EXE is compiled with compiler optimizations disabled, tending to generate many more memory references and thus slower code. You would be surprised how much Windows code there is out there ships without full compiler optimizations and so it is necessary to test both variants of the code. The full source code and the two compiled x86 Windows-compatible test programs can be downloaded here (nx11_t1.zip). Download that ZIP file if you will and try running those two executables from a Windows command prompt. You should find that T1FAST.EXE takes about 3/10 of a second to execute on most current Windows machines, and T1SLOW.EXE will be as fast or perhaps a tenth of a second slower. The Intel Core 2 is actually quite impressive in that it executes both versions in about the same time of under 300 milliseconds, indicating that the internal CPU pipeline is doing a lot of clever tricks to make memory accesses appear to be as fast as ordinary register operations. On most other architectures such as Pentium III and Pentium 4, there is a fixed extra execution time of about 100 milliseconds for the unoptimized executable, due to a slightly larger real cost in accessing memory over registers, even when that memory is known to be cached in the L1 data cache. There is another reason why I like this simple test program. Each iteration of the loop executes about dozen instructions (plus or minus, depending on the compiler) or which 6 of those instructions either use or modify the arithmetic flags, which I began describing to you last week. On x86, these are the 6 bits within the EFLAGS register which record the Sign flag, the Zero flag, the Carry flag, the Overflow flag, the Adjust flag, and the Parify flag, written in shorthand as SF ZF CF OF AF and PF. The first four I listed - Sign, Zero, Carry, Overflow - are common on most CPU architectures. On the 68000 and 68040, they exist as the "NZVC" bits (Negative, Zero, oVerflow, Carry) in the Condition Codes Register, which is the Motorola equivalent of the EFLAGS register. On PowerPC, these bits exist as five bits - Less Than, Greater Than, Equal, Overflow, and Carry - but actually convey the same information. The other two flags - Adjust (also known as Auxiliary Carry) and Parity - are unique to the x86 architecture but no less important. Windows would not boot correctly without these flags. Any kind of "if" statement in C, C++, Java, or other language, generally compiles into code which first executes one or more arithmetic instructions to set these condition flags, followed by a conditional branch instruction such as JE (Jump If Equal). Conditional flags are as fundamental to computer code as memory reads and writes. One of the great challenges in developing any efficient virtual machine which emulates real legacy CPU hardware is the accurate and efficient simulation of arithmetic condition flags. All too often I have seen otherwise good emulators slow to a crawl due to a poor implementation of the flags simulation. I believe that this is due to most programmers, even those who develop virtual machines, not having the basic understanding of exactly how condition flags are calculated and therefore they do not know how to efficiently simulate them. This shall be this week's topic.

Flags - The Emulator Litmus Test Two of the ongoing debates in the virtual machine community center on two arguments:

Direct execution or binary translation ?

Interpretation or dynamic recompilation?

If you have been reading this series in detail you should already have a good idea of what each of these terms mean:

Direct execution is the concept of allowing code that is in a sandbox to be executed natively by the host CPU, whether that code is inside of a virtual machine or just user-mode code in a multi-tasking protected operating system. For example, when you run a Windows application, the CPU switches to user mode, or what is called "Ring 3", and executes your Windows application natively. The hardware (in theory at least) catches any attempts of the sandboxed code from doing things that is not supposed to - writing beyond the end of the stack, accessing memory that does not belong to it, etc. Direct execution is utilized when the virtual machine's guest bytecode is the same as the bytecode of the host CPU, and is found in most "PC-on-PC" emulation products such as Microsoft's Virtual PC 2007 and VMware Fusion.

Binary translation is the opposite of that, and covers wide range of non-native code execution mechanisms. A Java virtual machine for example has to use binary translation since Java bytecode is not understood directly by the x86 architecture. Within the realm of binary translation are a variety of techniques, ranging from the trivial interpretation of bytecode (as I have discussed previously with the interpreter loop example), to full dynamic recompilation (also called "Just-In-Time" compilation or "jitting"), and various hybrid schemes in between. Binary translation is necessary when the guest CPU and host CPU are not identical, such as when emulating a 68040 or PowerPC Macintosh computer on a Windows PC.

In some virtual machines there may be a mixture of direct execution and binary translation techniques in use. For example, in one older technique used by PC-on-PC virtual machines, kernel mode code (or "Ring 0" code) is sandboxed via some form of binary translation such that accesses to hardware and other protected resources can be properly controlled in software, while user mode code (that "Ring 3" code I mentioned previously) is allowed to execute directly and is protected by the hardware itself. A related technique, called ring compression, directly executes both user and kernel mode code, but kernel mode code is pushed down to "Ring 1", which you can think of as a user mode kernel mode. When that code in Ring 1 attempts to access a protected resource, it triggers a hardware exception which is caught by the virtual machine's monitor (a.k.a. "hypervisor") that is actually executing at Ring 0. Within the Ring 0 exception handler the Ring 1 instruction is usually interpreted, then direct control is given back to the next instruction in Ring 1.

Ring compression fell out of favor recently due to its complexity - some code executes directly, but other code throws exceptions which then require it to be interpreted. Older versions of Virtual PC, Virtual Server, VMware, and Xen all used variants of ring compression or other hybrid schemes.

Today the hot technology is "hardware virtualization", called "Vanderpool" and "VT" by Intel and "Pacifica" and "AMD-V" by AMD. While I agree that ring compression is an overly complex and error prone solution, I disagree with the direction that the industry has taken to move away from all forms of binary translation and go to almost purely direct execution schemes based on hardware virtualization.

If anything, I view the decision to use VT or even ring compression as a copout by the developer of the virtual machine; an admission that they really didn't know how to solve performance issues without relying on additional hardware. Microsoft recently dropped support of ring compression, choosing to force its VM customers to upgrade to Intel and AMD's latest VT-enabled chips (http://blogs.technet.com/jhoward/archive/2006/04/28/426703.aspx). Forcing customers to buy new hardware is being spun as a "feature" when in fact it is an admission of utter failure.

If one is to ask most developers to rank various virtual machine techniques by order of performance, from fastest to slowest, they will likely give you a list similar to this:

Direct execution using VT hardware virtualization

Direct execution of ring 3 and ring compression

Direct execution of ring 3 and binary translation based jitting of ring 0

Binary translation based jitting of both user and kernel mode code

Interpretation of both user and kernel mode code

Traditionally, virtual machines are first developed as a pure interpreter. Once that is working they are optimized using jitting. When the guest and host architectures are the same, the virtual machine is further optimized to use one of the forms of direction execution. People naively assume that each step up the list equates to a large performance speedup. This is not always the case, as the performance of any virtual machine is only as good as the implementation. Current research from both VMware and people involved with the QEMU emulator indicates that today's VT technology from both Intel and AMD is actually slower than older ring compression or binary translation techniques! I even confirmed this with the little test program I discussed in Part 4 of this series. And in the Macintosh emulation world, I have seen third-party jitting-based Mac emulators get trounced by purely-interpretation based implementations written by myself.

So you might understand why I still care about discussing interpreters and why I tend to dislike sexy-sounding but purely misled efforts to move the whole world to VT. The data just does not back up people's assumptions.

Take the example of the QEMU virtual machine, which bills itself as a fast x86 emulator which uses jitting (http://fabrice.bellard.free.fr/qemu/about.html). Fair enough, I would expect that to be fast. However, QEMU also has an accelerated mode of operation that uses direct execution techniques. This would seem to imply that the jitted mode is significantly slower enough so as to warrant a direction execution alternative.

So earlier this summer, I took my little T1.C test program and ran it on a variety of Windows virtual machines running on my Intel Core 2-based Mac Pro to get an idea of what these slowdowns are like compared to natively running the test program.

VM and Technique T1FAST.EXE time in seconds T1SLOW.EXE time in seconds Native 0.26 0.26 VPC2007 using VT 0.27 0.27 VPC2007 using ring compression 0.27 0.27 QEMU 0.9.0 using jitting 10.5 12 Bochs 2.3.5 using interpretation

with inline ASM for flags 25 31 Bochs 2.3.5 using interpretation

with lazy flags code in C++ 34 42 My current Bochs sources using

interpretation and new flags code in C++ 14 18

Table 11-1: Timings of the simple integer benchmark.

As you can see from the above data, QEMU's dynamic recompilation is a grossly deficient implementation of jitting. I had excepted QEMU's slowdown to be on the order of 2x to 5x slower than native execution. It was actually slower by an order of magnitude!

So what is the bottleneck? What is so difficult in emulating the x86 architecture that makes the jitter in QEMU drop to its knees and barely outperform an x86 interpreter?

One significant factor, one that I consider to be the litmus test of a good virtual machine implementation, is the simulation of the arithmetic condition flags - those 6 little bits I described above. As of two months ago, QEMU was already only about 200% faster than Bochs 2.3.5 at running my CPU intensive test program. As of today, I have optimized Bochs down to a point where it is barely 50% slower than QEMU.

What did I do? And what are these "lazy flags" mentioned in the table?

WARNING! As I did last week, I will be discussing sequences of C++ code derived from the Bochs 2.3.5 source code which is covered by LGPL (the GNU Library license). If you have an allergy to GPL or to looking at open source code, I am afraid you will need to stop reading now.

There Is Lazy, And Then There Is Lazy Arithmetic condition flags are one of those ubiquitous features of any microprocessor machine language. While most programmers quite easily understand the concept that a CMP EAX,EBX instruction sets the Zero flag when the contents of registers EAX and EBX are equal, few really understand how the flags are really set. For example, is EAX=3 and EBX=-47, what bit values does CMP EAX,EBX set for the Parify flag, the Carry flag, the Overflow flag?

In many virtual machine implementations, that question is side-stepped by the fact that simulating most arithmetic operations requires executing those exact same operations. Huh? In other words, if one is simulating a PowerPC "ADDCO." instruction (which sets the five PowerPC arithmetic condition flags), the host virtual machine can simply execute a native ADD instruction. This is true the other way around, say, trying to emulate a 68040 on PowerPC as Apple once did in its Power Macintosh computers, or emulating PowerPC on x86 as Apple does today in its current Intel-based Mac products.

Most instruction sets, x86 included, provide some means to transfer the arithmetic condition flags into a register. On 68040 for example, one uses the MOVE CCR,D0 instruction to move the N Z V C bits into the lower 4 bits of the D0 register. On PowerPC one uses a similar instruction called mfcr ("Move From Condition Register"). And on x86, there are two common methods: push the flags registers to the stack and then pop the stack in to a register, using a code sequence such as PUSHF / POP AX, or, copy 5 of the arithmetic condition flags using the LAHF instruction and then read the 6th flag using SETO. What? What? What? Ok, let me explain this in a way that makes sense.

Neither Intel's IA32 nor AMD's AMD64 instruction sets provide an instruction to directly transfer all six condition flags to a general purpose register as is commonly found on other architectures. The shortest and most common x86 code sequence for such purposes is to use the Push Flags instruction PUSHF which pushes the flags to the stack. A POP instruction can then store those flags in a register or in a memory location. This is an inefficient code sequence it involves at least two memory operations (a write to the stack followed by a read of that value just pushed to the stack). The instruction PUSHF requires several clock cycles, and worse, its counterpart POPF can require several dozen clock cycles. The worst virtual machine implementations I have seen involve using both PUSHF and POPF.

There are other techniques. The x86 instruction set of course has conditional branch instruction which query most of those flags. There is JZ (Jump If Zero) instruction, also called JE (Jump If Equal), which branches if the ZF is 1 and does not branch when it is 0. There is JC (Jump If Carry), JO (Jump If Overflow), and JS (Jump If Sign) which are pretty self explanatory. These instructions are sufficient to query and record the four common flags ZF CF OF SF, which in turn is sufficient to simulate the condition flags of a 68040. In fact, earlier versions of my Gemulator and SoftMac products did in fact use JZ and JS instructions to record the Zero and Sign flags, and JC and JO to record the Carry and Overflow flags.

The 32-bit x86 instruction set IA32 also contains an LAHF ("Load AH With Flags") instruction, which is a little weird. It copies five of the six arithmetic flags to the AH register, which is the subset of bits 15..8 of the larger EAX register. I have no clue why the LAHF instruction exists as it does since it is rather awkward and does not fully do the job, and AMD unfortunately removed this instruction from the 64-bit AMD64 instruction set. That was a terrible mistake, as it is actually a handy instruction in cases where the result of the Overflow flag is known by other means. For example, most common arithmetic operations such as AND and XOR are defined to clear both the Carry and Overflow flags. So for those instructions, which thankfully are quite common, one does not need the full brute force strength of a PUSHF instruction, and can rely on LAHF. Most recent versions of Gemulator and SoftMac in fact do use LAHF to now record the Zero, Sign, and Carry flags in cases where Overflow is known to be set to zero.

What I am describing here is a technique called "lazy flags". In most direct execution based virtual machines, the host flags are kept in sync with the guest flags state. There is no need to use PUSHF or LAHF or other schemes, as the state of the host CF ZF OF SF AF and PF do always reflect the state of the code being simulated. In lazy schemes, that flags state is stored away somewhere for later retrieval, whether in a register or in memory. When the a particular arithmetic condition flag needs to be known, that register or memory location is tested in some way in order to recreate the necessary flag.

In the case of Bochs 2.3.5, there are two lazy flags schemes in use. The first one I already showed to you last week for when the host CPU is also of an x86 architecture. If you look at the Bochs source file cpu\hostasm.h, you will see that there are inline assembly versions of all of the x86 arithmetic operations, with descriptive inline function names such as asmShr16, asmTest32, asmXor8, etc. Each of these functions performs the arithmetic operation and then stores away the result of the operation, executes PUSHF and POP to read the host CPU's EFLAGS register, and stores that flags state away in memory. I don't like this particular scheme, because it introduces very CPU-specific inline ASM code to what otherwise a very portable emulator, and because the PUSHF/POP code sequences is really not that efficient.

The second lazy flags mechanism that the Bochs 2.3.5 source code uses is almost the opposite of the first mechanism, where instead of using PUSHF to know the state of the flags as soon as possible, the evaluation of those flags is pushed out as long as possible. What this portable C++ based method does is to store away 5 values for arithmetic operations - the two input values, the result, the size of the operation (whether it be 8-bit, 16-bit, 32-bit, or 64-bit), and an integer which specifies which operation was performed. If you look at the Bochs 2.3.5 source files cpu\cpu.h, cpu\lazy_flags.h, and cpu\lazy_flags.cpp, you will see this mechanism at work.

Let's look a typical ADD operation and how it works using this very lazy flags algorithm. We'll look at the code for a common 32-bit register-to-register ADD operation. The code for that is found in the Bochs 2.3.5 source file cpu\arith32.cpp, and I show here an edited down version of the code:

void BX_CPU_C::ADD_GdEGd(bxInstruction_c *i)

{

Bit32u op1_32, op2_32, sum_32;

unsigned nnn = i->nnn();



op1_32 = BX_READ_32BIT_REG(nnn);

op2_32 = BX_READ_32BIT_REG(i->rm());

sum_32 = op1_32 + op2_32;

SET_FLAGS_OSZAPC_32(op1_32, op2_32, sum_32, BX_INSTR_ADD32);



BX_WRITE_32BIT_REGZ(nnn, sum_32);

}

So what does this mysterious "SET_FLAGS_OSZAPC_32" macro do? If you look back at cpu\cpu.h, the various SET_FLAGS* macros boil down to these five lines of code:

oszapc.op1##size = lf_op1;

oszapc.op2##size = lf_op2;

oszapc.result##size = lf_result;

oszapc.instr = ins;

lf_flags_status = BX_LF_MASK_OSZAPC;

There is a data structure oszapc which holds the two input operands, the result, and an enumeration which specifies the x86 instruction which was just simulated. An lf_flags_status variable holds a bitmask which corresponds to the flags bits which need to be lazily evaluated. This scheme is thus very lazy in that this first portion never actually calculates the values of the flags. It simple stores 5 values to memory for later use.

When the flags actually need to be evaluated, such as when simulating an x86 conditional branch instruction, there are 6 lazy evaluation functions in cpu\lazy_flags.cpp, each corresponding to one of the arithmetic condition flags. The stripped down code for evaluating the state of the Zero flag for the above 32-bit ADD instruction is this:

bx_bool BX_CPU_C::get_ZFLazy(void)

{

unsigned zf;



zf = (oszapc.result_32 == 0); lf_flags_status &= 0xff0fff;

eflags.val32 &= ~0x40;

eflags.val32 |= zf<<6; // zf always exactly 0 or 1. return(zf);

}

What this code does is evaluate ZF, and then also propagate that state of ZF to the eflags state variable so as to avoid needing to lazily re-evaluate the ZF. Code that actually needs the ZF does not call get_ZFLazy directly, instead if calls an inline method called get_ZF which expands out to this code:

bx_bool get_ZF(void) {

if ( (lf_flags_status & 0x40) == 0)

return eflags.val32 & 0x40);

else

return get_ZFLazy();

}

So far so good, this looks fairly clean and efficient. The majority of x86 instructions update the arithmetic flags, and so they simply need to do the 5 stores to the temporary state in oszapc and lf_flags_status. A smaller percentage of x86 instruction (such as conditional branches) read the arithmetic flags, and thus call one of the get_*F() functions which in turn may end up calling one of the get_*FLazy() functions.

This appears to be efficient, and is certainly very portable to any non-x86 architecture that has a C++ compiler, but Table 11-1 shows it to be about 30% slower than the Bochs inline asm technique. As I found out, part of this slowdown is actually due to compiler differences between GCC (which was used to build the release version of Bochs 2.3.5) and the VC7.1 C++ compiler in Visual Studio 2003 which I used.