Subject : Re: [ANN] llvm-lua 1.0

: From : Mike Pall <mikelu-0906@...>

: Mike Pall <mikelu-0906@...> Date: Wed, 3 Jun 2009 05:26:12 +0200

Robert G. Jakabosky wrote: > One way llvm-lua could improve the speed of static compiled code would be to > add support for recording hints during runtime and writting those hints to a > file that would then be passed to the static compiler. Sure, feedback-directed optimization (FDO) would probably help a lot. But static compilers face some inherent difficulties when compiling a dynamic language: - Dynamic languages generate tons of code paths, even for simple operations. A static compiler has to compile *all* code paths. Apart from the overhead, this pessimizes the fast code paths, too (esp. PRE and register allocation). A JIT compiler only generates code for paths that are actually executed (this may also differ between runs). - Even with FDO, static compilers have to be very conservative about inlining and specialization to limit code growth. A JIT compiler can be more aggressive, because it has to deal with a much lower number of code paths. - Not even the best C compilers are able to hoist complex control structures out of loops. Once you lower (say) a load from a hash table (a loop following the hash chain) to the LLVM IR, you lose the original semantics. This usually means you cannot optimize it or eliminate it, either. That's not to discourage you from your work -- just to point out some difficulties you'll be facing with llvm-lua. The following trivial Lua function is a nice optimization challenge: function f(x, n) for i=1,n do x = math.abs(x) end return x end The inner loop has two hash table loads (_G["math"], math["abs"]), a function dispatch and a call to a standard library function. Here's the corresponding checklist for a Lua compiler in increasing order of complexity: LJ1 LJ2 yes yes Type-specializes the hash table loads (object and key type). yes* yes Inlines both hash table loads (* partially inlined by LJ1). no yes Specializes the loads to the hash slot (omit chain loop). yes yes Specializes to a monomorphic function dispatch for abs(). yes yes Inlines the abs() function. no yes Hoists both hash table loads out of the loop. no yes Hoists the abs() out of the loop (abs(abs(x)) ==> abs(x)). no yes Emits an empty loop (could be optimized away, too). Now try to get 8x yes with a static compiler, while preserving the full Lua semantics (e.g. somebody might do math.abs = math.sin or _G might have a metatable etc.). Good luck! :-) > Below are the results of running scimark-2008-01-22.lua [1] for a performance > comparison. > > # taskset -c 1 /usr/bin/lua scimark-2008-01-22.lua large The large dataset is an out-of-cache problem and cache thrashing (esp. for FFT) dominates the timings. IMHO the small dataset is better suited to compare the performance of the different VMs. Anyway, for the large dataset LuaJIT 1.1.x gets 89.80 and Lua 5.1.4 gets 13.87 on my box. And here's a preliminary result for LJ2: FFT 85.91 [1048576] SOR 821.95 [1000] MC 73.14 SPARSE 324.05 [100000, 1000000] LU 820.51 [1000] SciMark 425.11 [large problem sizes] (i.e. 31x faster than plain Lua) For comparison, GCC 4.3.2 -m32 -march=core2 -O3 -fomit-frame-pointer: FFT 93.00 [1048576] SOR 883.53 [1000] MC 180.16 SPARSE 716.08 [100000, 1000000] LU 1149.43 [1000] SciMark 604.44 [large problem sizes] (I've reformatted the C program output to match the Lua output.) Ok, so LJ2 is getting closer to C's performance. It's already on par with C for simpler benchmarks (like mandelbrot or spectralnorm). But there still remains a lot to do ... --Mike