(with apologies to Drew McDermott)

Although the right people were at the right time in the same room, we did not have time to participate in this year's ICFP Programming Contest. The warm-up task was to implement the Universal Machine (UM), which was then used to run a program which contained the real riddles.

However, a few days later after reading that for speed reasons many participants had abandoned their first VMs (in their languages of choice) in favor of C reimplementations, I spent a train ride (and the subsequent Sunday afternoon) on implementing and benchmarking the UM specification in Common Lisp.

With only fourteen instructions, the interpreter loop is a pretty straight forward transcription from the specification:

( loop ( setf insn ( aref program ( offset pc )) opcode ( ldb ( byte 4 28 ) insn ) a ( ldb ( byte 3 6 ) insn ) b ( ldb ( byte 3 3 ) insn ) c ( ldb ( byte 3 0 ) insn )) ( incf pc ) ( ecase ( #x0 ( unless ( zerop ( reg c )) ( setf ( reg a ) ( reg b )))) ( #x1 ( setf ( reg a ) ( aref ( platter-array ( reg b )) ( offset ( reg c ))))) ( #x2 ( setf ( aref ( platter-array ( reg a )) ( offset ( reg b ))) ( reg c ))) ( #x3 ( setf ( reg a ) ( logand #xFFFFFFFF ( + ( reg b ) ( reg c ))))) ( #x4 ( setf ( reg a ) ( logand #xFFFFFFFF ( * ( reg b ) ( reg c ))))) ( #x5 ( setf ( reg a ) ( logand #xFFFFFFFF ( truncate ( reg b ) ( reg c ))))) ( #x6 ( setf ( reg a ) ( logand #xFFFFFFFF ( lognand ( reg b ) ( reg c ))))) ( #x7 ( return )) ( #x8 ( setf ( reg b ) ( platter-array-allocate ( reg c )))) ( #x9 ( platter-array-free ( reg c ))) ( #xA ( write-char ( code-char ( logand #xff ( reg c )))) ( when ( = ( reg c ) #. ( char-code #\Newline )) ( finish-output ))) ( #xB ( setf ( reg c ) ( handler-case ( char-code ( read-char )) ( end-of-file () #xFFFFFFFF )))) ( #xC ( when ( plusp ( reg b )) ( setf program ( copy-seq ( platter-array ( reg b ))))) ( setf pc ( reg c ))) ( #xD ( setf ( reg ( ldb ( byte 3 25 ) insn )) ( ldb ( byte 25 0 ) insn )))))

Noteworthy details:

With (logand #xFFFFFFFF ...) we make explicit that we implement 32-bit modular arithmetic as per the UM specification. Helpfully, SBCL knows about modular arithmetic, too, and generates efficient machine code for it.

we make explicit that we implement 32-bit modular arithmetic as per the UM specification. Helpfully, SBCL knows about modular arithmetic, too, and generates efficient machine code for it. (ldb (byte size position) integer) is a nice way to write what you would need bit-fiddling operations for in other languages. Good examples are the decoding of opcode and registers from instructions (see top of the loop) or how byte-order can be swapped in RUN .

is a nice way to write what you would need bit-fiddling operations for in other languages. Good examples are the decoding of opcode and registers from instructions (see top of the loop) or how byte-order can be swapped in . The really interesting opcodes are the ones dealing with arrays platters (#x1,#x2,#x8,#x9,#xC). When doing some trial runs it becomes obvious fast that in the given UM programs, arrays are allocated and freed extremely often, many of small size.

platters (#x1,#x2,#x8,#x9,#xC). When doing some trial runs it becomes obvious fast that in the given UM programs, arrays are allocated and freed extremely often, many of small size. The provided SANDmark benchmark and the actual application (the uncompressed codex) have different requirements to the VM. Just getting SANDmark fast is not necessarily helping for, e.g., the UM's adventure or the QvickBasic Compiler.

We are benchmarking GC performance here, as well as code generator quality for tight loops with integer arithmetic.

Trust in Garbage Collection

Frederic Jolliton's UM implementation (also in CL ) uses malloc-like memory management for arrays. While I did not go that far, I tried several different schemes (reusing freed arrays instead of allocating new ones, handling small arrays seperately, etc.). All of them had in common that they were not winning at lot over a simple strategy which offloads freeing to CL's Garbage Collector, often they performed even worse. Unsurprisingly, the implementation got a lot shorter too, that way.

I keep a map of UM array indices to CL arrays, and a free list to keep track of reusable indices. The latter is quite crucial to keep the size of the index map down, because arrays are allocated tens of millions of times, but, e.g., less than 50000 array indices are active at any time in the SANDmark benchmark.

For a short time, I toyed with a copy-on-write strategy for arrays for opcode #xC (Load Program), but there was no real gain so this idea was abandoned. Optimizing I/O performance was not done either this time, as it seemed less critical.

I now DECLARE this bazaar open

I was not patient enough to let the UM without any added declarations finish SANDmark. However, just telling SBCL that the parameter to execute , the program , is actually of type (simple-array platter (*)) improves things a lot. From way too slow we get down to acceptable 229.0 seconds.

Next, we declare the array index arrays of type (simple-array t (*)) , which brings us down to 174.3 seconds (losing 54.7 seconds).

We can put the compiler into a more aggressive mode with declaration (optimize (speed 3)) . Down to 170.9 seconds. We declare some variables of (user) type array-index which also gets rid of some compiler notes, and we are at 165.5 seconds. By declaring (optimize (safety 0)) , we allow the compiler to trust us and thus omitting safety checks at run-time (also silencing most of the compiler notes, as they were about fixnum checks), thus finally arriving at 145.2 seconds. A far cry from where we started, and all this with four more lines of declarations.

Finishing Touch

SANDmark complete. Evaluation took: 145.229 seconds of real time 142.4289 seconds of user run time 0.568036 seconds of system run time 0 page faults and 2,651,641,176 bytes consed.

; 249: E905040000 JMP L38 ; 24E: L9: 83F804 CMP EAX, 4 ; 251: 0F846E030000 JEQ L34 ; 257: 83F808 CMP EAX, 8 ; 25A: 0F8430030000 JEQ L32 ; 260: 83F80C CMP EAX, 12 ; 263: 0F8405030000 JEQ L31 ; 269: 83F810 CMP EAX, 16 ; 26C: 0F84D8020000 JEQ L30 ; 272: 83F814 CMP EAX, 20 ; 275: 0F84A3020000 JEQ L29 [...]

switch

( defmacro ecase/tree ( keyform &body ) ( labels (( %case/tree ( keyform cases ) ( if ( <= ( length cases ) 4 ) ` ( ecase ) ( loop ( truncate ( length cases ) 2 ) collect ( first rest-cases ) into first-half finally ( return ( if ( < ,keyform , ( caar rest-cases )) , ( %case/tree keyform first-half ) , ( %case/tree keyform rest-cases ))))))) ( let (( $keyform ( gensym "CASE/TREE-" ))) ` ( let (( ,$keyform ,keyform )) , ( %case/tree $keyform ( sort ( copy-list cases ) #'< :key #'first ))))))

We now get quite nice times:However, we can still do better. If we examine the generated machine code (SBCL 0.9.13) with, we find that thestatement in the instruction decoding loop is translated poorly:GCC uses all kinds of tricks for largestatements, like tree partitioning and jump tables . With little macrology, we can partition the case statement similarly:

Using the ECASE/TREE macro improves quality of the generated code somewhat. The performance gain on SANDmark is about 1.2 relative to the original ECASE version:

SANDmark complete. Evaluation took: 120.372 seconds of real time 117.30733 seconds of user run time 0.680042 seconds of system run time 0 page faults and 2,651,634,776 bytes consed.

Running the same program with CMUCL, we get:

SANDmark complete. ; Evaluation took: ; 131.0 seconds of real time ; 126.16389 seconds of user run time ; 3.84424 seconds of system run time ; 208,528,252,530 CPU cycles ; [Run times include 17.8 seconds GC run time] ; 0 page faults and ; 2,654,874,728 bytes consed.

On a final note, g++ still generates tighter code for the corresponding switch statement, but we are not that far off.

Comparison

Compared to the (then) fastest C(++) UM implementation I was aware of, this implementation has one third less code, and that with taking ECASE/TREE into account. With my benchmark setup, Li Peng's implementation runs in 104 seconds:

% time ./peng-um sandmark.umz [...] SANDmark complete. ./peng-um sandmark.umz 102.58s user 0.20s system 98% cpu 1:44.75 total

Execution is about factor 1.15 faster than my implementation in CL. Not bad, and it seems again to support Gabriel's hypothesis that one can get CL programs within twenty percent of the performance of their C counterpart.

Also, Ranjit Mathew reported on using array pointers as indices:

"One of the wicked ideas from the mailing list did help however - using the pointer to an array's memory as its index instead of maintaining an "array of arrays", yielded a 20% boost in the performance of my UM implementation (as measured by SANDmark) but reduced its "portability" to only those machines where both integers and pointers are 32 bits."

The array index trick is not easily transferable to CL, at least not portably across different implementations. Then again, it also precludes running the UM on 32-bit architectures while the CL implementation presented here works fine on 64-bit hardware.

Quo Vadis?

The next step, if UM speed is still not considered sufficient, would probably be to translate UM byte-code directly to Lisp, possibly as a JIT compiler. And indeed, people seem to be going the JIT route. I will save this for another time, though.

Unless noted otherwise, all times are wall clock times and were measured with (time (run #p"sandmark.umz")) from the REPL within SLIME, with SBCL 0.9.13 on a 1.6 GHz Turion64 laptop with 450MB RAM, running Linux 2.6.15. Additional tests were conducted with SBCL 0.9.15, CMUCL 19a, and LispWorks 4.4 Personal Edition.

If anybody manages to get this code fast on OpenMCL, Lispworks, Allegro CL, etc., drop me a line.



