I’ve been reading and playing with VMs for a couple months now. I’ve silently created my own Ruby VM in C, running YARV bytecode. But that was an excuse to better understand the internals of Ruby, since it doesn’t have a parser, emitter and GC yet. I might blog about some of my findings later but today I want to write about another experiment I’ve been working on.

LLVM is an awesome tool and I’ve been wanting to build something with it for a while. So I’ve set myself to create a little toy language and it turned out being a lot easier then I though it would be. Luckily for me Tom Bagby created Ruby bindings for LLVM so I could code in Ruby instead of some damn ugly and painful C++ (I hate C++).

Here’s how I created my lil toy compiler called Orange.

The Parser

I used Treetop to create the parser. Writing the Treetop grammar was the most time consuming task. Also, because Treetop can’t produce context sensitive grammar, you’re limited in the kind of syntax you provide.

My goal was to have a syntax as close to Ruby as possible, but without OO. Something like this:

def test () x = 1 y = "ohaie" printf ( "%d, %s" , x , y ) end

Through trial an error I ended up with this: grammar.tt.

The Nodes

The parser takes some text as input and output AST nodes, one for each grammar rule matching your input. Parsing a = 1 would produce those nodes:

Expression Assign Var (a) Lit (1)

The idea is to generate some machine code from those nodes. If all node knows how to generate code for itself, then generating code for the whole tree is just a matter of generating code for the top level object, Expression in this case.

Treetop provides a nice mechanism to subclass those nodes.

rule call func :var arglist < Call > end

The <Call> part tells Treetop to instantiate a new Call class for this type of node.

class Call < Node def codegen ( g ) # ... end end

Here I added my own codegen method that will do all the magic of generating the machine code.

class Expression < Node def codegen ( g ) statements . map { | s | s . codegen ( g ) } end end

The Expression top level node will just call the codegen of each nodes it contain recursively. This is a very simple way to handle complex trees of nodes.

The Code Generator

The only thing missing is the part that does the actual work of generating machine code that can be run.

Since the Orange language is very close to the actual LLVM intermediate language (on purpose, to keep things simple), the generator serve as a way to abstract calls to LLVM-Ruby API. With a much more advanced language, the generator would have a lot more things to take care of.

For example, the Generator will translate a function call to this:

# Creates a new module to hold the code module = LLVM :: Module . new ( "orange" ) # Creates the main function that will be called main = module . get_or_insert_function ( "main" , Type . function ( INT , [ INT , Type . pointer ( PCHAR )])) # Create a block of code to build machine code within block = main . create_block . builder # Find the function we're calling in the module func = module . get_function ( "myfunc" ) # Call the function block . call ( func , arg ) block . return ( 0 . llvm )

This is somehow equivalent to the following C code:

int main ( int argc , char const * argv []) { myfunc ( arg ); return 0 ; }

This will then generate magic machine code for you.

(If you want more info on how LLVM-Ruby works, go read the “blog”:http://llvmruby.org/wordpress-llvmruby/)

Putting All the Pieces Together

Now we got all the pieces: the parser takes your text and translates it to nodes, you pass the code generator to the nodes that generates machine code and finally you run that machine code.

generator = Orange :: Generator . new parser = OrangeParser . new # Parse some code and get a top level node node = parser . parse ( code ) # Pass the generator to get the machine code node . codegen ( generator ) # Magic! generator . run

The cool thing with LLVM is that it can run as a JIT compiler, running stuff on the fly, but it can also compile and optimize code a-la GCC.