guile 3 update, september edition

From: Andy Wingo Subject: guile 3 update, september edition Date: Mon, 17 Sep 2018 10:25:34 +0200 User-agent: Gnus/5.13 (Gnus v5.13) Emacs/26.1 (gnu/linux)

Hi! This is an update on progress towards Guile 3. In our last update, we saw the first bits of generated code: https://lists.gnu.org/archive/html/guile-devel/2018-08/msg00005.html Since then, the JIT is now feature-complete. It can JIT-compile *all* code in Guile, including delimited continuations, dynamic-wind, all that. It runs automatically, in response to a function being called a lot. It can also tier up from within hot loops. The threshold at which Guile will automatically JIT-compile is set from the GUILE_JIT_THRESHOLD environment variable. By default it is 50000. If you set it to -1, you disable the JIT. If you set it to 0, *all* code will be JIT-compiled. The test suite passes at GUILE_JIT_THRESHOLD=0, indicating that all features in Guile are supported by the JIT. Set the GUILE_JIT_LOG environment variable to 1 or 2 to see JIT progress. For debugging (single-stepping, tracing, breakpoints), Guile will fall back to the bytecode interpreter (the VM), for the thread that has debugging enabled. Once debugging is no longer enabled (no more hooks active), that thread can return to JIT-compiled code. Right now the JIT-compiled code exactly replicates what the bytecode interpreter does: the same stack reads and writes, etc. There is some specialization when a bytecode has immediate operands of course. However the choice to do debugging via the bytecode interpreter -- effectively, to always have bytecode around -- will allow machine code (compiled either just-in-time or ahead-of-time) to do register allocation. JIT will probably do a simple block-local allocation. An AOT compiler is free to do something smarter. As far as I can tell, with the default setting of GUILE_JIT_THRESHOLD=50000, JIT does not increase startup latency for any workload, and always increases throughput. More benchmarking is needed though. Using GNU Lightning has been useful but in the long term I don't think it's the library that we need, for a few reasons: * When Lightning does a JIT compilation, it builds a graph of operations, does some minor optimizations, and then emits code. But the graph phase takes time and memory. I think we just need a library that just emits code directly. That would lower the cost of JIT and allow us to lower the default GUILE_JIT_THRESHOLD. * The register allocation phase in Lightning exists essentially for calls. However we have a very restricted set of calls that we need to do, and can do the allocation by hand on each architecture. This (We don't use CPU call instructions for Scheme function calls because we use the VM stack. We might be able to revise this in the future but again Lightning is in the way). Doing it by hand would allow a few benefits: * Hand allocation would free up more temporary registers. Right now lightning reserves all registers used as part of the platform calling convention; they are unavailable to the JIT. * Sometimes when Lightning needs a temporary register, it can clobber one that we're using as part of an internal calling convention. I believe this is fixed for x86-64 but I can't be sure for other architectures! See commit 449ef7d9755b553cb0ad2629bca3bc42c5913e88. * We need to do our own register allocation; having Lightning also do it is a misfeature. * Sometimes we know that we can get better emitted code, but the lightning abstraction doesn't let us do it. We should allow ourselves to punch through that abstraction. The platform-specific Lightning files basically expose most of the API we need. We could consider incrementally punching through lightning.h to reach those files. Something to think about for the future. Finally, as far as performance goes -- we're generally somewhere around 80% faster than 2.2. Sometimes more, sometimes less, always faster though AFAIK. As an example, here's a simple fib.scm: $ cat /tmp/fib.scm (define (fib n) (if (< n 2) 1 (+ (fib (- n 1)) (fib (- n 2))))) Now let's use eval-in-scheme to print the 35th fibonacci number. For Guile 2.2: $ time /opt/guile-2.2/bin/guile -c \ '(begin (primitive-load "/tmp/fib.scm") (pk (fib 35)))' ;;; (14930352) real 0m9.610s user 0m10.547s sys 0m0.040s But with Guile from the lightning branch, we get: $ time /opt/guile/bin/guile -c \ '(begin (primitive-load "/tmp/fib.scm") (pk (fib 35)))' ;;; (14930352) real 0m5.299s user 0m6.167s sys 0m0.064s Meaning that "eval" in Guile 3 is somewhere around 80% faster than in Guile 2.2 -- because "eval" is now JIT-compiled. (Otherwise it's the same program.) This improves bootstrap times, though Guile 3's compiler will generally make more CPS nodes than Guile 2.2 for the same expression, which takes more time and memory, so the gain isn't earth-shattering. Incidentally, as a comparison, Guile 2.0 (whose "eval" is slower for various reasons) takes 70s real time for the same benchmark. Guile 1.8, whose eval was written in C, takes 4.536 seconds real time. It's still a little faster than Guile 3's eval-in-Scheme, but it's close and we're catching up :) I have also tested with ecraven's r7rs-benchmarks and we make a nice jump past the 2.2 results; but not yet at Racket or Chez levels yet. I think we need to tighten up our emitted code. There's another 2x of performance that we should be able to get with incremental improvements. For the last bit we will need global register allocation though, I think. I think I'm ready to merge to "master"; it's currently just in the "lightning" branch. Disable by passing --disable-jit to configure. Tests welcome from non-x86-64 architectures. Happy hacking, Andy

reply via email to

