LuaJIT is a high-performance virtual machine for Lua. I’ve recently started playing with it and this blog post talks about what I’ve discovered so far.

LuaJIT is a tracing JIT as opposed to a method JIT — it works not by discovering and optimizing hot methods as a whole but by discovering and optimizing hot traces or paths of execution. LuaJIT has a rather interesting interpreter as well, but I won’t talk about it here except in context of the JIT.

The meat of the interpreter lies in the vm_<arch>.dasc files, which contain a specification of the interpreter (and other assembly) stubs in an assembly-like language. So, for instance, vm_x86.dasc contains a function like:

// Generate assembly code for a specific Lua bytecode. static void build_ins(BuildCtx *ctx, BCOp op, int defop) { switch (op) { case BC_HHGTTG: // hypothetical bytecode | mov eax, 42 // hypothetical implementation break; } }

vm_x86.dasc is first lowered by dynasm.lua into a C program. The lines beginning with | s are mapped to calls to dasm_put in a way that the C functions produced now emit these assembly opcodes. This new C program is then linked with buildvm.c and run at compile time to generate the required assembly stubs. For instance, calling build_ins with BC_HHGTTG (after lowering) will emit the assembly opcodes for moving 42 to eax (whatever the semantic implication might be). This assembly is emitted into lj_vm.s and linked in with the rest of the Lua JIT.

Detecting hot traces

Now, with that bit of detail out of our way, let us look at how Lua actually picks up the traces it wishes to compile. LuaJIT maintains the hotness of relevant instructions (jumps, calls) in a hashtable but a catch — it doesn’t resolve collisions. Since hotness is a heuristic and not dense (i.e. it is unlikely that all jumps in a program will be hot) this approach works quite well in practice. The following macro, used to access a the hotness counter for a particular instruction from C, should make things clear:

// gg is the global state and pc points to the bytecode // instruction whose hotcount we want. #define hotcount_get(gg, pc) (gg)->hotcount[(u32ptr(pc)>>2) & (HOTCOUNT_SIZE-1)]

The interpreter updates and checks the hotness counters when executing a jump or a call. This is done using the hotloop and hotcall macros in vm_<arch>.dasc (we’ll only be looking at the x86 architecture so <arch> is x86 ):

|.macro hotloop, reg | mov reg, PC | shr reg, 1 | and reg, HOTCOUNT_PCMASK | sub word [DISPATCH+reg+GG_DISP2HOT], HOTCOUNT_LOOP | jb ->vm_hotloop |.endmacro

and later, in various places::

case BC_FORL: |.if JIT | hotloop RB |.endif

dynasm lowers the macro invocations in a straightforward way — BC_FORL subtracts HOTCOUNT_LOOP ticks from the corresponding counter and then jumps to vm_hotloop once the counter under-flows:

|->vm_hotloop: // Hot loop counter underflow. |.if JIT | mov LFUNC:RB, [BASE-8] // Same as curr_topL(L). | mov RB, LFUNC:RB->pc | movzx RD, byte [RB+PC2PROTO(framesize)] | lea RD, [BASE+RD*8] | mov L:RB, SAVE_L | mov L:RB->base, BASE | mov L:RB->top, RD | mov FCARG2, PC | lea FCARG1, [DISPATCH+GG_DISP2J] | mov aword [DISPATCH+DISPATCH_J(L)], L:RBa | mov SAVE_PC, PC | call extern lj_trace_hot@8 // (jit_State *J, const BCIns *pc) | jmp <3 |.endif

dynasm can be made aware of C struct s and offsets of their fields using a .type directive. In the body of vm_hotloop , LFUNC:RB tells dynasm to treat the value in RB as an LFUNC . RB and LFUNC are defined earlier:

|.define RB, ebp // ... |.type LFUNC, GCfuncL

Of course, treating RB as an LFUNC in LFUNC:RB doesn’t do anything, it is essentially documentation. In the next instruction ( mov RB, LFUNC:RB->pc ), however, this annotation allows us to say LFUNC:RB->pc and have dynasm automatically figure out the correct offset. Let’s step through vm_hotloop , noting that BASE points to the top of Lua’s evaluation stack, RB and RD are registers, FCARG1 and FCARG2 are the registers that are used as the first two arguments in the calling convention used when transitioning from Lua to C and that SAVE_L and SAVE_PC are stack slots.

First, we pop the GCfuncL off the top of the stack and read pc , which points to the beginning of the bytecode for the closure:

| mov LFUNC:RB, [BASE-8] | mov RB, LFUNC:RB->pc

LuaJIT follows a common convention of storing function literals or prototypes separately from function closures. This allows the VM to save space by sharing the byte-code between two closures of the same function prototype. In V8, for instance, you have SharedFunctionInfo s holding the information specific to a function literal and Function s representing actual, executable, closures.

In LuaJIT, function literals are represented using GCproto . In memory, a GCproto object is followed by the bytecode for the function literal (something shared by all closures, represented by GCfuncL s). Thus, given a GCfuncL , we can extract the corresponding GCproto by subtracting sizeof(GCproto) from the pointer to the beginning of the bytecode array. PC2PROTO uses this technique to access the framesize field in the GCproto structure and uses it to compute the first free slot in the stack:

| movzx RD, byte [RB+PC2PROTO(framesize)] | lea RD, [BASE+RD*8]

Then we fill up the fields in lua_State (defined in lj_obj.h ):

| mov L:RB, SAVE_L | mov L:RB->base, BASE | mov L:RB->top, RD

set the argument registers, save few things in stack slots and call into lj_trace_hot :

| mov FCARG2, PC | lea FCARG1, [DISPATCH+GG_DISP2J] | mov aword [DISPATCH+DISPATCH_J(L)], L:RBa | mov SAVE_PC, PC | call extern lj_trace_hot@8 | jmp <3

which puts LuaJIT into recording mode.

Recording Traces

A trace is the linear sequence of bytecodes encountered when executing a particular code path. Traces are recorded (the recording begins when lj_trace_hot is called) by coordinating with the interpreter. The interpreter uses a vector of pointers to assembly stubs, indexed by the bytecode instructions they implement — in principle, interpreting a Lua program proceeds by decoding the bytecodes one by one and jumping to the corresponding assembly stubs. Traces are recorded by making each element in the dispatch vector point to lj_vm_record ( dynasm adds the lj_ prefix, in vm_x86.dasc , the symbol is just vm_record ). A simplified vm_record looks like this:

|->vm_record: | mov L:RB, SAVE_L | mov L:RB->base, BASE | mov FCARG2, PC | mov FCARG1, L:RB | call extern lj_dispatch_ins@8 | mov BASE, L:RB->base | movzx RA, PC_RA | movzx OP, PC_OP | movzx RD, PC_RD | jmp aword [DISPATCH+OP*8+GG_DISP2STATIC] |.endif

We see that it basically boils down to a call to lj_dispatch_ins followed by a jump to whatever [DISPATCH+OP*8+GG_DISP2STATIC] points to. LuaJIT saves a backup copy of the dispatch vectors at a distance of GG_DISP2STATIC from the original dispatch vector (which now only contains pointers to lj_vm_record ), and lj_vm_record uses this backup vector to jump to the real assembly stub.

Let’s get off the armchair now and look at some actual code and understand some subtleties.

Once a loop or a call is measured to be hot, the interpreter calls lj_trace_hot which is the entry point into recording, compiling and installing a trace. lj_trace_hot sets state of the central jit_State object to LJ_TRACE_START and hands off control to lj_trace_ins .

The tracing subsystem of LuaJIT has five states: START , RECORD , END , ASM and ERR . lj_trace_ins is a finite automaton which moves between these states based on the bytecode instructions read off the execution trace. The overall scheme is simple:

START -------------------> RECORD ---> ERR set up dispatch | vector | [IDLE] v / END -----> ASM

Of course, this doesn’t happen all at once — the state is remembered and transitions are made (or not) as calls to lj_dispatch_ins uncover more and more of the currently executing trace. The bytecode stream the tracer sees is translated into an intermediate SSA representation which is then optimized and compiled into assembly.

Tracing gives us only one linear path through the code, disregarding other, equally valid paths that may have been taken. A tracing JIT usually deals with this by installing guards which check assumptions and bail out from the trace (to the interpreter, in LuaJIT’s case) on violated assumptions. For instance, consider this simple Lua program:

total = 0 function test() if total < 500 then total = total + 1 end end for i=1,501 do test() end

This compiles to the following bytecode (stripping out some parts non-essential to our understanding):

;; function test() 0001 GGET 0 0 0002 KSHORT 1 500 0003 ISGE 0 1 0004 JMP 0 => 0008 0005 GGET 0 0 0006 ADDVN 0 0 0 0007 GSET 0 0 0008 => RET0 0 1 ;; body (instructions 0001 to 0007 removed) 0008 FORI 0 => 0012 0009 => GGET 4 2 ; "test" 0010 CALL 4 1 1 0011 FORL 0 => 0009 0012 => RET0 0 1

(You can get luajit to emit bytecode in this form using -b -l and have it dump all the trace information using -jdump .)

Nothing unexpected here — you have a for loop (delimited by FORI and FORL ) running 501 times calling test in each iteration. LuaJIT extracts the following (again, trimmed down) trace (in SSA form):

;; Instructions 0001 to 0013 removed 0014 > p32 HREFK 0013 "total" @11 0015 > num HLOAD 0014 0016 > num LT 0015 +500 0017 + num ADD 0015 +1 0018 num HSTORE 0014 0017 0019 + int ADD 0001 +1 0020 > int LE 0019 +501 0021 ------ LOOP ------------ 0022 > num LT 0017 +500 0023 + num ADD 0017 +1 0024 num HSTORE 0014 0023 0025 + int ADD 0019 +1 0026 > int LE 0025 +501 0027 int PHI 0019 0025 0028 num PHI 0017 0023

The instructions following LOOP are unsurprising: a normal loop with the usual SSA phi nodes. Note that (as is usual in tracing JITs) a version of test has been inlined completely into a tight loop. Instruction 0016 is rather interesting; the > means that a particular instruction is a guard and that the trace bails out to the interpreter if the condition doesn’t hold. In this case, we bail out if 0015 (which is the value of total , as you can make out from instructions 0014 and 0015 ) is greater or equal to 500 . This is interesting because the tracing compiler doesn’t try to be a smartass and infer to not do anything when total is >= 500 , which also is correct behavior. All it knows is that when total is < 500 , the correct behaviour is to add 1 to total , because that is what it has observed in the trace. Notably, if the total >= 500 path gets taken often enough, it would be marked hot and be compiled into a side-trace (you should try reproducing this).

lj_record_ins in lj_record.c records each bytecode instruction into SSA before it is executed. One architectural subtlety is knowing which guard to emit — given a condition IF A < B THEN {0} ELSE {1} END do you emit a guard for A < B or for NOT (A < B) ? We get that information only after the condition is evaluated, if A < B is true during the trace recording phase, then we need a guard asserting (A < B) and vice versa. Instead of implementing a partial interpreter in the tracer, LuaJIT does this by postponing the insertion of the guard instructions till before lj_record_ins is called with the next bytecode. It knows the result of evaluating the condition by then.

Another subtlety involves tracing across already-compiled traces. LuaJIT keep track of compiled traces by replacing the normal loop or call bytecodes with special marker bytecodes. These special bytecodes signal the existence and provide the location of a compiled trace beginning at that point. However, when tracing, we’d like to be able to see the function bytecodes. This is done by temporarily patching the special JIT marker bytecode ( BC_JFUNCF or BC_JLOOP , for instance) with the original bytecode, tracing through the original bytecode and patching back the JIT marker bytecode once the tracing is done. To see this happening, have a look at rec_func_jit (in lj_record.c ) and trace_pendpatch in lj_trace.c .

Snapshots

Any VM that involves transitioning between two (or more!) modes of execution faces the problem of converting between unoptimized and optimized stack layouts. This is true of LuaJIT too, with a slightly different meaning of stack. We’d like observational equivalence between a interpreted trace and the same trace compiled. We’d like the compiled trace to have the same side effects and map the operand stack the same way as the interpreted trace. However, this equivalence need not hold inside the compiled trace, we only need to make sure that this holds in the trace exits (guarded or not).

LuaJIT solves this by maintaining a mapping from stack location to SSA IR instructions. Such mappings are called snapshots in the LuaJIT codebase. Using a snapshot, it can then reconstruct the operand stack just like how it would have been had the trace been interpreted ( snap_restoreval and ln_snap_restore in lj_snap.c ).

This is slow, but quoting Mike Pall: "State restoration using this data-driven approach is slow of course. But repeatedly taken side exits quickly trigger the generation of side traces. The snapshot is used to initialize the IR of the side trace with the necessary state using pseudo-loads. These can be optimized together with the remainder of the side trace. The pseudo-loads are unified with the machine state of the parent trace by the backend to enable zero-cost linking to side traces."