Compiling to Java

In my advanced compilers class, we cover a range of intermediate and target languages. C is a popular target language for performance reasons, but Java has often-overlooked advantages. Some of Java's less-utilized features conspire to make it an easy target for high-level language constructs, e.g., lexically scoped closures become anonymous classes. It takes roughly a third to half the time (and code) to write a compiler that targets Java instead of C. In fact, we can compile the essential core of Scheme to Java with purely local transforms. (Scheme's macro system can easily desugar full Scheme into core Scheme.) As a result, the reference compiler at the bottom of this page is barely over 400 lines of highly-commented code. (The companion compiler into C is just over 1000 lines of code.)

Helpful resources:

The official Java Language Specification covers the nooks and crannies of the language which, while rarely or never used in ordinary programming, become extremely convenient during compilation.

Related blog posts:

Pros and cons

On top of these implementation benefits, targeting Java gives the language access to the Java library system, the portability of the JVM, just-in-time compilation/optimization and garbage collection. The lack of tail-call optimization in Java is a downside with the direct compilation strategy, but it's possible to do less direct compilation into Java that performs tail-call optimization with trampolining.

More broadly, in my class, I teach that language implementation is about trade-offs between implementation complexity and performance:

I argue that first, you should try an interpreter for your language. If that's not fast enough, try an SICP-style optimizing interpreter. If that's not good enough, try compiling to Java. If that's still not fast enough, try compiling to C. If that's still too slow, try continuation-passing-style compilation to assembly. If you need more speed, start doing basic compiler optimizations. If you're still out of luck, start doing static-analysis-driven compiler optimizations.

Each time you go down from n to n + 1 on this ladder, performance will go up, but implementation code size and complexity will go up by about a factor of two. It turns out that Java occupies a sweet spot: relatively low implementation complexity, but the biggest percentage-wise gain in performance.

Core Scheme with Sugar

The compiler I created is for a core Scheme, but it would not be hard to apply these same techniques to compiling a language like Python or Ruby:

<exp> ::= <const> | <prim> | <var> | (lambda (<var> ...) <exp>) | (if <exp> <exp> <exp>) | (set! <var> <exp>) | (let ((<var> <exp>) ...) <exp>) | (letrec ((<var> (lambda (<var>...) <exp>))) <exp>) | (begin <exp> ...) | (<exp> <exp> ...) <const> ::= <int>

A Scheme compiler could easily use macros to desugar full Scheme into this language, or in fact, an even simpler one. Many real Scheme compilers do exactly that.

Encoding Scheme values in Java

The first task in compiling to Java is to come up with a Java-level encoding of the corresponding Scheme values. To do that, I created the Java interface Value , and had all Scheme values inherit from that. The sub-classes and -interfaces of Value include VoidValue , BooleanValue , IntValue , ProcValue and Primitive . The Java stub code also provies a RuntimeEnvironment class, which binds all of the top-level primitives like addition, subtraction and I/O to Java names. Compiled programs are supposed to inherit from RuntimeEnvironment .

The top-level compile function has a typically Schemish dispatching feel:

; java-compile-exp : exp -> string (define (java-compile-exp exp) (cond ; core forms: ((const? exp) (java-compile-const exp)) ((prim? exp) (java-compile-prim exp)) ((ref? exp) (java-compile-ref exp)) ((lambda? exp) (java-compile-lambda exp)) ((if? exp) (java-compile-if exp)) ((set!? exp) (java-compile-set! exp)) ; syntactic sugar: ((let? exp) (java-compile-exp (let=>lambda exp))) ((letrec1? exp) (java-compile-exp (letrec1=>Y exp))) ((begin? exp) (java-compile-exp (begin=>let exp))) ; applications: ((app? exp) (java-compile-app exp))))

Integers

So, compilation breaks down into individual constructs.

Integers compile into a IntValue objects, rather than to themselves. For example, Scheme-level 3 compiles to new IntValue(3) in Java.

Primitives

Primitives are looked up in table and translated into their RuntimeEnvironment name:

(define (java-compile-prim p) (cond ((eq? '+ p) "sum") ((eq? '- p) "difference") ((eq? '* p) "product") ((eq? '= p) "numEqual") ((eq? 'display p) "display") (else (error "unhandled primitive " p))))

Variable references

Variable references have to be name-mangled, since Scheme identifiers are richer than Java identfiers. Mutable variables (those which are set!'d) are also handled differently. These are wrapped in ValueCell objects and prefixed with m_ , since variables captured by anonymous functions in Java have to be marked final . A pre-compilation code walk finds all of the mutable variables.

Lambda terms

Lambda terms are compiled into anonymous classes. For example, (lambda (v1 ... vN) exp) becomes:

new NullProcValueN () { public apply (final Value [mangle v1],...,final Value [mangle vN]) { // for each mutable formal vi: final ValueCell m_[mangle vi] = new ValueCell([mangle vi]) ; return [compile exp] ; }

NullProcValueN

NullProcValue

Value

There is aclass for each procedure arity. Theclasses provide default definitions for some of the methods defined in

Clearly, [mangle v] stands for the mangled name of the variable v , and [compile exp] stands for the compiled text of exp .

Conditionals

The form (if exp1 exp2 exp3) actually compiles into the ternary operator ?: instead of if () {} else {} :

([compile exp1]) ? ([compile exp2]) : ([compile exp3])

Mutable variables

The construct (set! var exp) relies on the compilation of lambda terms and variables references to wrap var in a ValueCell , so that it compiles to:

VoidValue.Void(m_[mangle var].value = [compile exp])

Variable binding

The let construct desugars into the application of a lambda term. That is, (let ((v e) ...) body) becomes:

((lambda (v ...) body) e ...)

Recursion

Letrec can be desugared into "lets and sets" or the Y combinator. I opted for the Y combinator just to show that it can be done without using side effects. Actually, the compiler generates a new Y combinator on the fly so that it matches the arity of the recursive procedure:

; xargs : nat -> list[symbol] (define (xargs n) (if (<= n 0) '() (cons (string->symbol (string-append "x" (number->string n))) (xargs (- n 1))))) ; Yn generates the Y combinator for n-arity procedures. (define (Yn n) `((lambda (h) (lambda (F) (F (lambda (,@(xargs n)) (((h h) F) ,@(xargs n)))))) (lambda (h) (lambda (F) (F (lambda (,@(xargs n)) (((h h) F) ,@(xargs n))))))))

In Java, the Y combinator for one-argument procedures ends up as:

((ProcValue1)(new NullProcValue1 () { public Value apply(final Value h) { return new NullProcValue1 () { public Value apply(final Value F) { return ((ProcValue1)(F)).apply(new NullProcValue1 () { public Value apply(final Value x) { return ((ProcValue1)(((ProcValue1)(((ProcValue1)(h)).apply(h) )).apply(F) )).apply(x) ; }} ) ; }} ; }} )).apply(new NullProcValue1 () { public Value apply(final Value h) { return new NullProcValue1 () { public Value apply(final Value F) { return ((ProcValue1)(F)).apply(new NullProcValue1 () { public Value apply(final Value x) { return ((ProcValue1)(((ProcValue1)(((ProcValue1)(h)).apply(h) )).apply(F) )).apply(x) ; }} ) ; }} ; }} )

Sequencing

Sequencing statements are desugared into let-bindings of unused variables. That is, (begin e1 ... eN) becomes:

(let ((_ e1)) (begin e2 ... eN))

Code

External resources