How to make interesting little languages.

A bit over a year ago I posted this little gem on Hacker News, where I define a simple programming language in 450 lines of JavaScript or so (that includes comments). I made a mental note at that moment to make a more detailed post explaining how it worked, but I never got around to it. So I'm going to go ahead and do it now.

That language was a nameless throwaway experiment, but for the purposes of this post I should probably give it a name. I'm going to go with "Teacup", an unassuming, adorable name, fit for the small and unassuming language we're going to make.

If you don't like it and would rather give it a mightier name like, I don't know, TeaRex or MegaLang3000, you can edit it below:

So without further ado, dear readers, let us write the Teacup programming language.

The tutorial is interactive, which means that besides changing the language's name to something that will make you giggle like a schoolgirl, you can edit the code in each box and run it with Ctrl+Enter. You can use the log function to pretty-print objects in the area to the right of the text editor. However, this is only going to be practical on a desktop.

Overview

We will construct Teacup as a series of transformations, going from the initial source code (a string) to the result of the computation. We will use JavaScript for the implementation, because it is widely known and runs in browsers, but it should be straightforward to translate to your favorite language.

The steps will be as follows:

We start with the source code, a string.

The lexer transforms a string into a list of tokens.

transforms a string into a list of tokens. The parser transforms a list of tokens into an abstract syntax tree.

transforms a list of tokens into an abstract syntax tree. The interpreter transforms an abstract syntax tree into a result.

Here we define a Pipeline class that will make our work easier by composing a series of transformations, passed as a parameter:

Lexer

The job of a lexer is to split the input text into a list of discrete tokens (atomic parts like numbers, strings, an opening parenthesis, and so on). Not all parsing methodologies have this step, but for the method we are going to use, it simplifies things.

First we will define a regular expression for each of our token types: we'll have numbers, words (variables), operators, strings (with double quotes) and comments (starting with # and up to the end of the line):

The order is important: the regular expression for word is a superset of the expression for number , for example, which is why number is before word . Whatever is not covered by these regular expressions, whitespace for instance, will be ignored by the parser.

Also, the reason why I define them as strings rather than regular expressions is that I'm going to concatenate them and I don't know how to do this if they are not strings.

Here's the code for transforming a string into a joyous list of tokens:

You can change testCode to something else you want to test (hit Ctrl+Enter to run).

Essentially, we build a large regular expression with one matching group for each token type, then we use String.split to get our list of tokens, which is a mix of matching strings and the value undefined (when there is no match). The type of each token can be figured out from its position in the list. Then we filter out the spurious tokens, those that are comments, have a null type, or have undefined in the text field.

We also store start and end positions for each token. It is very important to keep track of these numbers so that when things go wrong, we know where they went wrong.

Parser

Alright, so we have a list of tokens. Now we need to organize them into a structure that resembles some kind of program. For Teacup we are going to use a very simple, but elegant system: an operator precedence grammar.

OPGs are more limited than the LL or LR grammars that are used for most languages, but I think these are overkill anyway. OPGs have some interesting properties that in my view make up for the limitations. For instance, the parsing rules are highly local, which means they are, in principle, amenable to parallel parsing.

Also, nobody ever talks about them, so I can differentiate my offering in the unending flood of terrible tutorials that plague the Internet.

Now, the common understanding of operator precedence is that each operator has a precedence number and is either left-associative or right-associative. But forget about this system. It's rubbish. Let me show you a better one:

Operator precedence grammars

Instead, imagine that you have a precedence matrix, with a cell for each (ordered) pair of operators. So if you have the expression (a + b * c) , instead of going: "Okay, so + has priority 10, and * has priority 20, so * has higher priority and we must parenthesize it like this: (a + (b * c))" , you look up the matrix entry M(+, *) . There are three possibilities:

M(+, *) = -1 : the operator on the left ( + ) has precedence, so the expression should be parenthesized as ((a + b) * c) .

: the operator on the ( ) has precedence, so the expression should be parenthesized as . M(+, *) = 1 : the operator on the right ( * ) has precedence, so the expression should be parenthesized as (a + (b * c)) .

: the operator on the ( ) has precedence, so the expression should be parenthesized as . M(+, *) = 0 : both operators are equal, so b "belongs" to both of them. If you want a better intuition as to what this means, it means you are making the ternary operator +* and b is its middle operand.

If the expression, instead, was (a * b + c) , then we would consult M(*, +) , which may or may not be the opposite of M(+, *) . In fact:

M(+, +) = -1 means that + is left-associative .

means that is . M(^, ^) = 1 means that ^ is right-associative .

means that is . For a ternary operator, for example (a ? b : c) , then M(?, :) = 0 .

operator, for example , then . The same idea works for brackets: M([, ]) = 0 .

So you can see this is pretty powerful. You get associativity, you get ternary operators, brackets, and so on, for free. You could also define a complex operator like if a then b elseif c ... else d end if you wanted to (and we will want to do this).

Precedence definitions

Granted, defining a precedence matrix sounds like work. If you have ten operators, the matrix is five by ten, which means a hundred entries. If you have twenty, well… yeah. We don't want to fill that manually.

There are many solutions to this problem, but a tutorial can only afford so much sophistication, so what we're going to do is this: for each operator op , we will have two numeric priorities: a left-priority L(op) for when the operator is on the left and a right-priority R(op) for when it is on the right. And then we will define M(op1, op2) using L and R as follows:

M(op1, op2) = sign(R(op2) - L(op1))

Under this scheme, an operator is left-associative if L(op) > R(op) and right-associative if R(op) > L(op) . Unlike the one-priority scheme, this preserves our ability to define ternary operators. What follows are the priorities for Teacup. Some of them will seem strange at first glance, but I will explain everything.

Some clarifications:

lassoc defines a left-associative operator, which means its left-priority is slightly greater than its right-priority. rassoc defines a right-associative operator (higher right-priority). xassoc defines equal left and right-priorities.

A prefix operator is just an operator with very high right-priority. Imagine for example that the expression (a + * b) is in fact (a + null * b) , with an implicit "null" token between the two operators. A prefix operator will have a null left operand, whereas a suffix operator will have a null right operand.

Since we compare L(+) to R(*) , if R(*) is very high, it's always going to "win" the comparison. It will always take the null , so it will always be prefix. Likewise, a suffix operator would have very high left-priority.

So:

Opening brackets and keywords like if (which begin special forms) are defined as prefix operators with very low priority.

(which begin special forms) are defined as operators with very low priority. Closing brackets the end keyword are defined as suffix operators with a very low priority that matches the opening bracket/keyword.

keyword are defined as operators with a very low priority that matches the opening bracket/keyword. Keywords that are part of a special form, like else , have a very low priority that matches the opening and closing forms.

What about the entries for word , number and string? These are not operators, right? Well, to simplify the code, we will assume that they are operators, but these operators are nullary (they are both prefix and suffix).

Note that despite being prefix, ( can take an operand on its left, for instance if one writes f(x) , because R(() is not quite as high as L(f) .

What follows is the code to get a token's priority and the order method, which calculates the matrix entry corresponding to two tokens:

Algorithm

Now that we have a priority system for our operators here is the meat: the operator precedence parser. It is not a long algorithm, nor is it very complex. The most difficult is to properly visualize its operation, but you can uncomment the log line in the code to get a useful trace of what the algorithm is doing!

In the right pane, you can see the output of the raw algorithm, which is a deeply nested list. It may seem oddly organized at first glance, but it is actually very neat: notice that in each list (0-based), which we will also call a "handle", the odd indexes are the operator tokens, and the even indexes are the operands. For instance:

The atom a (which as I explained before is parsed as a nullary operator) corresponds to the list [null, a, null] . The operator is a (in position 1), and it has two null operands (in position 0 and 2).

(which as I explained before is parsed as a nullary operator) corresponds to the list . The operator is (in position 1), and it has two operands (in position 0 and 2). a + b corresponds to the list [a, +, b] : operator + in position 1, operands a and b in position 0 and 2.

corresponds to the list : operator in position 1, operands and in position 0 and 2. If you try the code (a) it will produce the list [null, (, a, ), null] . Operators in positions 1 and 3, operands in 0, 2 and 4.

Another neat property of this output is that it is order preserving: if you flatten it, you end up with the original list of tokens, with a null between each token pair.

You have certainly noticed that parse takes an additional argument, a function called finalize (we use the identity function in the example above). That is the subject of the next section:

Finalizing the parse

Although neat, the raw output of parse is not necessarily the one we want. The finalize function is given a handle after it is completed and must return the "value" associated to that handle. If you are writing a simple calculator, that can be an evaluation function; if you are writing a compiler or a more elaborate interpreter, it can be a node of your abstract syntax tree (or AST).

Here we will define a finalize function that transforms our infix AST nodes into prefix ones and computes location information. We will do this somewhat cleverly, for reasons that will become apparent later (note that in what follows, "a" is shorthand for the token object for a ):

Our transformation rules:

[null, "a", null] ==> "a"

[a, "+", b] ==> ["E + E", a, b]

[null, "+", b] ==> ["_ + E", b]

[null, "(", a, ")", null] ==> ["_ ( E ) _", a]

In other words, we will eliminate nullary operators, and aggregate the operators that we find in odd positions into an expression that describes which of the operands are not null.

(Also we are not really going to return an expression in prefix notation, instead we will put the function's "signature" in the type field and the arguments in the args field. That simplifies things a little: our AST will have our old token nodes as leaves, with types word , number , string . And now we're going to add inner nodes with type E + E and the like, so that later on we can simply dispatch on type .)

Interpreter

Now that we have our AST, there are a few things we could do. We could write a compiler for Teacup, either to JavaScript, or to a different language, or machine code. Or we could write a simple interpreter. We are going to do the latter, because it is a bit more interesting than compiling to JavaScript (and not a pain in the ass like compiling to machine code).

Environment

In our interpreter we will implement variables and lexical scoping. This means we need some kind of structure to map variable names to values: an environment. For this purpose we will use a simple JavaScript object with a null prototype (which means it has no built-in fields like toString that could interfere with symbol resolution). Then, to implement nested scopes, all we need to do is to call Object.create on the parent environment. JS's prototype inheritance will do the rest for us.

So we define the makeEnvironment function that takes an optional bindings argument (for initial bindings). It will also populate the environment with basic functionality that we want to have in Teacup, like truth, falsehood, arithmetic… that kind of nonsense.

Notice that we populate the environment with some symbolic names like "+" . We will make it so that operator applications like (a + b) will use these bindings.

The and and or functions are lazy: instead of receiving their arguments directly, these functions will receive functions that evaluate to the arguments, so they may choose to evaluate only a subset. The lazy wrapper we define here just sets the lazy property to true, we will deal with it explicitly later (see the runCall function).

Utilities

Before going any further we will write some utilities to manipulate the AST a bit more easily (their functionality and purpose should be clear enough from the comments):

Algorithm

Now comes the time to write our interpreter for Teacup, which will evaluate a source string into a value. It will have two parameters:

A parser to produce the AST.

to produce the AST. A mapping from node "type" (word, number, E + E, etc.) to a function that handles that kind of node ( handlers ).

One caveat for the mapping is that we will also map regular expressions to handlers. For instance, we may have a node of "type" _ if X elseif X elseif X else X end _ . We will therefore want to be able to match an arbitrary number of elseif . A regular expression will do just fine.

We will define an eval method on the interpreter which will take two arguments: a source code string or an AST node, and the environment we defined previously.

That's it, really. All we are missing are the handlers, which are admittedly the most important part, because at the moment our interpret literally can't interpret anything.

As we define them, we will push the handlers into the handlers variable.

Basic handlers

The basic handlers we need are as follows:

Type Example Description word xyz Variable lookup number 123 Numbers string "hohoho" Strings _ ( E ) _ (x) Grouping E ( E ) _ f(x) Function calls _ [ E ] _ [x, y, z] List building E [ E ] _ x[y] Indexing E . E x.y Dot notation for indexing E + E x + y Operator application E * E x * y " etc. etc. " _ if E then E else E end _ if x then y else z end Conditional statement

Variables and functions

Now that the basics are there, we will add the capability to define variables and functions, as well as simple list comprehension syntax:

Type Example Description E -> E x -> x + 1 Anonymous function _ let E in E end _ let x = 1 in x * x end Variable binding _ for E in E do E end _ for x in list do x * x end List comprehension

Well, look at that! Teacup is almost complete!

Adjustments

The one small detail we have to fix is unary - which, unfortunately, doesn't work. With our current setup, (10 + -2) will parse as ((10 +) - 2) .

There is an easy fix: we will write a pass that looks for sequences of infix tokens and make every operator except for the first one prefix. We will sandwich that pass between the lexer and parser.

There. We're done :)

Play pen

You can try the final result here. I have added syntax highlighting and automatic indent for your convenience. The highlighting depends only on the regular expressions defined in the lexer section, so it will adapt to whatever changes you make. Fun!

Note: if the play pen is empty, that's because you evaluated code in some other editor on your way here. If you refresh the page, you will see an example.

Hit Ctrl+Enter to evaluate.

Conclusion

We made Teacup! Yay! That wasn't too hard.

There's a bunch of stuff we could do to tweak or improve it, and which may be the subject of future posts, if there is interest:

Static type annotations and inference.

Interesting special forms, like pattern matching.

Simple macro system.

Incremental parser.

A git repository with the code is available here.

I also made a JSFiddle for your convenience.

Related links

The Earl Grey language which I designed using some of the principles in this tutorial.

which I designed using some of the principles in this tutorial. "Screw it, I'll make my own!", an article about my experience designing a programming language.

Unrelated links