So this article is my usual attempt to explain the topic to myself, making sure that when I forget how it works in a couple of months, I will have a simple way of remembering.

I encountered Crockford's article in the Beautiful Code book, but found it hard to understand. I could follow the code, but had a hard time grasping why the thing works. Recently I became interested in the topic again, tried to read the article once more, and again was stumped. Finally, by reading Pratt's original paper and Frederick Lundh's excellent Python-based piece , I understood the algorithm.

An alternative parsing algorithm was discovered by Vaughan Pratt in 1973. Called Top Down Operator Precedence, it shares some features with the modified RD parser, but promises to simplify the code, as well as provide better performance. Recently it was popularized again by Douglas Crockford in his article , and employed by him in JSLint to parse Javascript.

The third article describes a method that combines RD parsing with a different algorithm for parsing expressions to achieve better results. This method is actually used in the real-world, for example in GCC and Parrot ( source ).

Recursive-descent parsers have always interested me, and in the past year and a half I wrote a few articles on the topic. Here they are in chronological order:

To accommodate this difference, TDOP allows for different treatment of tokens in prefix and infix contexts. In TDOP terminology the handler of a token as prefix is called nud (for "null denotation") and the handler of a token as infix is called led (for "left denotation").

To parse the traditional infix-notation expression languages , we have to differentiate between the prefix form and infix form of tokens. The best example is the minus operator ( - ). In its infix form it is subtraction:

If we want to express the convention of multiplication having a higher precedence than addition, let's define the binding power ( bp ) of * to be 20 and that of + to be 10 (the numbers are arbitrary, what's important is that bp(*) > bp(+) ). Thus, by the definition we've made above, the 2 will be associated with * , since its binding power is higher than that of + .

Here it is once again with A, E, B identified:

Consider a substring AEB where A takes a right argument, B a left, and E is an expression. Does E associate with A or with B? We define a numeric binding power for each operator. The operator with the higher binding power "wins" - gets E associated with it . Let's examine the expression:

Operator precedence and associativity is a fundamental topic to be handled by parsing techniques. TDOP handles this issue by assigning a "binding power" to each token it parses.

Top down operator precedence parsing (TDOP from now on) is based on a few fundamental principles:

The TDOP algorithm

Here's a basic TDOP parser:

def expression (rbp= 0 ): global token t = token token = next() left = t.nud() while rbp < token.lbp: t = token token = next() left = t.led(left) return left class literal_token ( object ): def __init__ ( self , value): self .value = int (value) def nud ( self ): return self .value class operator_add_token ( object ): lbp = 10 def led ( self , left): right = expression( 10 ) return left + right class operator_mul_token ( object ): lbp = 20 def led ( self , left): return left * expression( 20 ) class end_token ( object ): lbp = 0

We only have to augment it with some support code consisting of a simple tokenizer and the parser driver:

import re token_pat = re.compile( "\s*(?:(\d+)|(.))" ) def tokenize (program): for number, operator in token_pat.findall(program): if number: yield literal_token(number) elif operator == "+" : yield operator_add_token() elif operator == "*" : yield operator_mul_token() else : raise SyntaxError( 'unknown operator: %s' , operator) yield end_token() def parse (program): global token, next next = tokenize(program).next token = next() return expression()

And we have a complete parser and evaluator for arithmetic expressions involving addition and multiplication.

Now let's figure out how it actually works. Note that the token classes have several attributes (not all classes have all kinds of attributes):

lbp - the left binding power of the operator. For an infix operator, it tells us how strongly the operator binds to the argument at its left.

- the left binding power of the operator. For an infix operator, it tells us how strongly the operator binds to the argument at its left. nud - this is the prefix handler we talked about. In this simple parser it exists only for the literals (the numbers)

- this is the prefix handler we talked about. In this simple parser it exists only for the literals (the numbers) led - the infix handler.

The key to enlightenment here is to notice how the expression function works, and how the operator handlers call it, passing in a binding power.

When expression is called, it is provided the rbp - right binding power of the operator that called it. It consumes tokens until it meets a token whose left binding power is equal or lower than rbp . Specifically, it means that it collects all tokens that bind together before returning to the operator that called it.

Handlers of operators call expression to process their arguments, providing it with their binding power to make sure it gets just the right tokens from the input.

Let's see, for example, how this parser handles the expression:

3 + 1 * 2 * 4 + 5

Here's the call trace of the parser's functions when parsing this expression:

<<expression with rbp 0>> <<literal nud = 3>> <<led of "+">> <<expression with rbp 10>> <<literal nud = 1>> <<led of "*">> <<expression with rbp 20>> <<literal nud = 2>> <<led of "*">> <<expression with rbp 20>> <<literal nud = 4>> <<led of "+">> <<expression with rbp 10>> <<literal nud = 5>>

The following diagram shows the calls made to expression on various recursion levels:

The arrows show the tokens on which each execution of expression works, and the number above them is the rbp given to expression for this call.

Apart from the initial call (with rbp=0 ) which spans the whole input, expression is called after each operator (by its led handler) to collect the right-side argument. As the diagram clearly shows, the binding power mechanism makes sure expression doesn't go "too far" - only as far as the precedence of the invoking operator allows. The best place to see it is the long arrow after the first + , that collects all the multiplication terms (they must be grouped together due to the higher precedence of * ) and returns before the second + is encountered (when the precedence ceases being higher than its rbp ).

Another way to look at it: at any point in the execution of the parser, there's an instance of expression for each precedence level that is active at that moment. This instance awaits the results of the higher-precedence instance and keeps going, until it has to stop itself and return its result to its caller.

If you understand this example, you understand TDOP parsing. All the rest is really just more of the same.