How to Write an Interpreter in One Day

by Daniel Franke

On a Friday evening I was chatting with my Programming Language Principles instructor, Professor Manuel Bermúdez. That afternoon he had introduced the class to an obscure language called RPAL, the Right-reference Pedagogical Algorithmic Language, designed by Wozencraft and Evans for the purpose of teaching functional programming. I had never even heard of RPAL before, and that was more-or-less the point — he wanted us to rely on the language specs rather than on previously-acquired intuition. I had come to his office because I had noticed a bug in his RPAL implementation and wanted to get a brain-dump from him on it so I could track it down. He replied that he wasn't sure what was causing the bug, but that it was one of many and he really wished he had a more robust implementation.

I spent a few minutes thinking out loud about how to go about writing one. Then I announced, "I'll have an implementation for you tomorrow". I'm not sure whether or not he believed me. On that note, I departed his office and got to work. I hacked through the night until I finished the parser, got a couple hours of sleep, and then went back to hacking. By Saturday night I had a working version 0.1.0, sans the documentation and autoconf scripts.

For others in the class, a subset of this will be a semester's worth of work. Part of the difference is artificial: RPAL's grammar is LALR(1) but we'll be expected to convert it to LL and write a recursive descent parser by hand, which is obviously extra work. However, that doesn't account for a factor of 100. The real reason is that almost everyone1 will use either Java or C++, and I used Scheme.

Tools

The interpreter's frontend is written in Lex and Yacc. The backend/runtime is written in Guile. Yacc sucks, but the alternatives suck even more2, so I went with it anyway. The main reason that it sucks is that it doesn't support regular right-parts, but I already had a grammar that didn't use them much, so it worked out. My favorite Scheme implementation is MzScheme but I chose Guile because (a) it has an excellent set of C bindings for interacting with Yacc, and (b) it is already installed on most UNIX boxen.

The Frontend

Trees are just a special case of lists where the car is an atom, so I implemented the abstract syntax tree as a list. Every token returned by the lexer is a list with three elements. The car is a symbol representing the token type, the cadr is the lexeme (i.e. the text of the token), and the caddr is the line on which the token occurs (just for inclusion in error messages). Every semantic rule in the parser follows the same format: make a list of the lists returned by the terminals/non-terminals in the right-part, which become subtrees, and then cons a symbol onto the beginning, which becomes the parent node of those subtrees. Here's a sample:

ExprWhere: Tuple T_WHERE DefRec {$$=scm_list_3(scm_str2symbol("where"), $1, $3); }

If I wrote the semantic part of that directly in Scheme rather than using Guile's C API it might look something like

(lambda (Tuple T_WHERE DefRec) (list 'where Tuple DefRec))

The Backend

All the code after the parsing phase is written in Guile. It translates the abstract syntax tree into Scheme and then eval s it. The function which does the translating is called genst . The first argument is the AST to be translated. The second is the symbol table. The symbol table is structured as an assoc list. The keys are strings which represent the identifiers that were parsed from the RPAL code. The values are symbols which are bound to the scheme objects which represent the data to which those identifiers are bound. Except for the symbols for intrinsic functions, all of these symbols are gensyms.

Initially I call genst with the entire AST and a symbol table containing only intrinsics. It makes numerous recursive calls with pieces of the tree and a table augmented with the symbols bound within the lexical environment of those pieces.

Every node in the AST has a corresponding procedure which generates the code for it. An assoc list defined inside genst called nodes maps each node to its corresponding handler. Modulo some error handling, the body of genst is then simply

(apply (cdr (assoc (car tree) nodes)) (cdr tree))

A typical node-handling function looks like this:

(lambda (s t) `(let ((x ,(genst s tab)) (y ,(genst t tab))) (if (and (number? x) (number? y)) (quotient x y) (fail #f "Arguments to / are not integers" x y))))

That generates the code for the division operator. For some language constructs, the handler first performs a tree transformation to express the construct in terms of other constructs, and then returns the result of genst ing the modified tree. For example, the @ operator is syntactic sugar which allows you to call a function of two arguments using infix notation. The handler for @ transforms it into prefix syntax:

(lambda (s t u) (genst `(gamma (gamma ,t ,s) ,u) tab))

That's All, Folks

genst is less than 200 lines long, and the only other Scheme code in the program is definitions for RPAL's intrinsic functions, most of which are just a call to a Scheme library function plus a type check. The parser definition is 150 lines, the lexer is 100, and the remaining 250 or so is boilerplate fluff like GPL headers and command line parsing.

Speed

It's slow. It's an interpreter for a slow language written in a slow interpreted language. But it serves its purpose (teaching functional programming by interpreting toy programs) just fine, and the brain cycles saved by getting it done in a day are worth an awful lot of CPU cycles. It's also quite a bit easier to maintain than the C implementation, and less buggy.

RPAL home page

Notes:

When Professor Bermúdez was explaining what functional programming meant, he asked for a show of hands to see who knew Lisp. About 10% of us raised our hands, and we were all sitting in the same corner of the room. I'm working on a cure. Stay tuned.

Copyright (C) 2006 Daniel Franke