Making language implementations easier

I've noticed that the greater population of working programmers seems to consider designing and implementing a new programming language a black art that only large-corporations or super-gurus know how to do. And really, it has traditionally been the case that an enormous amount of work is involved.

In the past, you had to know how to use parsing and scanning tools like lex and yacc (or you built your own recursive descent parser from scratch), you had to use tree transforms to build an intermediate representation, then you either had to build a virtual machine that could run on the intermediate representation, or you had to compile the intermediate representation down to machine code. And, of course, this isn't even counting things like various forms of optimization and type validation.

Even if you understand the theory involved very well, building out a language implementation like this takes a ton of work. Fortunately for potential language designers, new tools and technologies have started making things easier.

Step 1: Reusable Virtual Machines

The existence of widespread, reusuable VMs like the JVM and the .Net CLR have made life a lot easier for language implementers. Instead of having to design your own VM and implement things like garbage collection, JIT compilation, and virtual method dispatch yourself, you can just use a pre-existing implementation that has the virtues of being well-tested and highly optimized.

The advent of these VMs has led to a great variety of language implementations for them, such as Scala, Groovy, IronPython, JRuby, and F#. But even with a virtual machine, there's still a fair bit of work left for a language implementer to do. You still have to do scanning and parsing, and you still have to generate all that bytecode from your abstract syntax tree.

Step 2: New Parsing Technologies

Of course, if you've been following parsers, there are a lot of cool technologies for making parsers a lot easier. Parser combinators such as Parsec make parsing a lot easier.

In fact, there are quite a few ways to put together a parser fairly easily these days. I saw several last week at Lang.Net 2008. Roman Ivantsov presented a very cool new toolkit called Irony that makes writing a combined parser/scanner about as easy as writing a BNF grammar. Harry Pierson presented some cool work he'd been doing with Parsing Expression Grammars in F#, and also was nice enough to suggest I go check out FParsec, an F# port of the Haskell Parsec library.

I'm planning on spending some time trying out all of these over the coming months, to try and see which is the best fit for how I work. I'll try to post blogs as I work through some of these.

So that simplifies two out of the three hard parts, but what about building bytecode?

Step 3: The DLR is awesome for generating bytecode

The DLR contains a general-purpose system of expression trees (a superset of the LINQ expression trees, if you're curious) that allow you to define not only simple expressions like x + 2, but statements such as variable assignment, function definition, control flow (loops and branches), and so on.

Once you construct a DLR expression tree out of these common constructs, it handles all the work of rendering your code to IL and just-in-time compiling it. In other words, once you build an abstract syntax tree, you're done.

Martin Maly from the DLR team has started publishing a lot of useful information on his blog on how to work with the DLR, but the easiest entry point may be to download the latest IronPython 2.0 alpha (which includes the DLR) and play with the ToyScript sample. IronPython, the DLR, and ToyScript are all released under the open source Microsoft Permissive License, so feel free to reuse anything you see for your own purposes.

To give you some idea of what I'm talking about, here's the toyscript source for taking a binary operation expression off the parse tree and constructing an equivalent DLR expression tree.

protected internal override MSAst.Expression Generate(ToyGenerator tg) { MSAst.Expression left = _left.Generate(tg); MSAst.Expression right = _right.Generate(tg); Operators op; switch (_op) { // Binary case Operator.Add: op = Operators.Add; break; case Operator.Subtract: op = Operators.Subtract; break; case Operator.Multiply: op = Operators.Multiply; break; case Operator.Divide: op = Operators.Divide; break; // Comparisons case Operator.LessThan: op = Operators.LessThan; break; case Operator.LessThanOrEqual: op = Operators.LessThanOrEqual; break; case Operator.GreaterThan: op = Operators.GreaterThan; break; case Operator.GreaterThanOrEqual: op = Operators.GreaterThanOrEqual; break; case Operator.Equals: op = Operators.Equals; break; case Operator.NotEquals: op = Operators.NotEquals; break; default: throw new System.InvalidOperationException(); } return Ast.Action.Operator(op, typeof(object), left, right); }

Operator is a ToyScript enumeration of all of the supported operators, while Operators is the DLR enumeration of all supported operators. And, of course, if your language supports some exotic operators that don't exist in the DLR, you can just generate a function call.

Doesn't get much simpler than that, does it?

So if you have a cool idea for a new language and you're not allergic to the .Net platform, consider trying out the DLR. You'll spend a lot less time implementing the back end for your language, which will leave you more time to concentrate on whatever interesting new characteristics you want your language to have.

Labels: .Net, c#, dlr, dotnet, fparsec, ironpython, irony, parsec, parsing, pegs, toyscript