This tokenizer is split into two parts; one language-specific parser that turns the source program into a stream of literals, names, and operators, and a second part that turns those into a token instances. The latter checks both operators and names against the symbol table (to handle keyword operators), and uses a psuedo-symbol (“(name)”) for all other names.

You could combine the two tasks into a single function, but the separation makes it a bit easier to test the parser, and also makes it possible to reuse the second part for other syntaxes.

We can test the new tokenizer with the old parser definition:

>>> parse("1+2") (+ (literal 1) (literal 2)) >>> parse(1+2+3") (+ (+ (literal 1) (literal 2)) (literal 3)) >>> parse(1+2*3") (+ (literal 1) (* (literal 2) (literal 3))) >>> parse(1.0*2+3") (+ (* (literal 1.0) (literal 2)) (literal 3)) >>> parse("'hello'+'world'") (+ (literal 'hello') (literal 'world'))

The new tokenizer supports more literals, so our parser does that too, without any extra work. And we’re still using the 10-line expression implementation we introduced at the beginning of this article.

The Python Expression Grammar #

So, let’s do something about the grammar. We could figure out the correct expression grammar from the grammar snippet shown earlier, but there’s a more practical description in the section “Evaluation order” in Python’s language reference. The table in that section lists all expression operators in precedence order, from lowest to highest. Here are the corresponding definitions (starting at binding power 20):

symbol( "lambda" , 20) symbol( "if" , 20) infix_r( "or" , 30); infix_r( "and" , 40); prefix( "not" , 50) infix( "in" , 60); infix( "not" , 60) infix( "is" , 60) infix( "<" , 60); infix( "<=" , 60) infix( ">" , 60); infix( ">=" , 60) infix( "<>" , 60); infix( "!=" , 60); infix( "==" , 60) infix( "|" , 70); infix( "^" , 80); infix( "&" , 90) infix( "<<" , 100); infix( ">>" , 100) infix( "+" , 110); infix( "-" , 110) infix( "*" , 120); infix( "/" , 120); infix( "//" , 120) infix( "%" , 120) prefix( "-" , 130); prefix( "+" , 130); prefix( "~" , 130) infix_r( "**" , 140) symbol( "." , 150); symbol( "[" , 150); symbol( "(" , 150)

These 16 lines define the syntax for 35 operators, and also provide behaviour for most of them.

However, tokens defined by the symbol helper have no intrinsic behaviour; to make them work, additional code is needed. There are also some intricacies caused by limitations in Python’s tokenizer; more about those later.

But before we start working on those symbols, we need to add behaviour to the pseudo-tokens too:

symbol( "(literal)" ).nud = lambda self: self symbol( "(name)" ).nud = lambda self: self symbol( "(end)" )

We can now do a quick sanity check:

>>> parse( "1+2" ) (+ (literal 1) (literal 2)) >>> parse( "2<<3" ) (<< (literal 2) (literal 3))

Parenthesized Expressions #

Let’s turn our focus to the remaining symbols, and start with something simple: parenthesized expressions. They can be implemented by a “nud” method on the “(” token:

def nud (self): expr = expression() advance( ")" ) return expr symbol( "(" ).nud = nud

The “advance” function used here is a helper function that checks that the current token has a given value, before fetching the next token.

def advance (id=None): global token if id and token.id != id: raise SyntaxError( "Expected %r" % id) token = next()

The “)” token must be registered; if not, the tokenizer will report it as an invalid token. To register it, just call the symbol function:

symbol( ")" )

Let’s try it out:

>>> 1+2*3 (+ (literal 1) (* (literal 2) (literal 3))) >>> (1+2)*3 (* (+ (literal 1) (literal 2)) (literal 3))

Note that the “nud” method returns the inner expression, so the “(” node won’t appear in the resulting syntax tree.

Also note that we’re cheating here, for a moment: the “(” prefix has two meanings in Python; it can either be used for grouping, as above, or to create tuples. We’ll fix this below.

Ternary Operators #

Most custom methods look more or less exactly like their recursive-descent counterparts, and the code for inline if-else is no different:

def led (self, left): self.first = left self.second = expression() advance( "else" ) self.third = expression() return self symbol( "if" ).led = led

Again, we need to register the extra token before we can try it out:

symbol( "else" ) >>> parse( "1 if 2 else 3" ) ( if (literal 1) (literal 2) (literal 3))

Attribute and Item Lookups #

To handle attribute lookups, the “.” operator needs a “led” method. For convenience, this version verifies that the period is followed by a proper name token (this check could be made at a later stage as well):

def led (self, left): if token.id != "(name)" : SyntaxError( "Expected an attribute name." ) self.first = left self.second = token advance() return self symbol( "." ).led = led >>> parse( "foo.bar" ) (. (name foo) (name bar))

Item access is similar; just add a “led” method to the “[” operator. And since “]” is part of the syntax, we need to register that symbol as well.

symbol("]") def led(self, left): self.first = left self.second = expression() advance("]") return self symbol("[").led = led >>> parse("'hello'[0]") ([ (literal 'hello') (literal 0))

Note that we’re ending up with lots of code of the form:

def led (self, left): ... symbol(id).led = led

which is a bit inconvenient, if not else because it violates the “don’t repeat yourself” rule (the name of the method appears three times). A simple decorator solves this:

def method (s): assert issubclass(s, symbol_base) def bind (fn): setattr(s, fn.__name__, fn) return bind

This decorator picks up the function name, and attaches that to the given symbol. This puts the symbol name before the method definition, and only requires you to write the method name once.

@ method (symbol(id)) def led (self, left): ...

We’ll use this in the following examples. The other approach isn’t much longer, so you can still use it if you need to target Python 2.3 or older. Just watch out for typos.

Function Calls #

A function call consists of an expression followed by a comma-separated expression list, in parentheses. By treating the left parentesis as a binary operator, parsing this is straight-forward:

symbol(")"); symbol(",") @method(symbol("(")) def led(self, left): self.first = left self.second = [] if token.id != ")": while 1: self.second.append(expression()) if token.id != ",": break advance(",") advance(")") return self >>> parse("hello(1,2,3)") (( (name hello) [(literal 1), (literal 2), (literal 3)])

This is a bit simplified; keyword arguments and the “*” and “**” forms are not supported by this version. To handle keyword arguments, look for an “=” after the first expression, and if that’s found, check that the subtree is a plain name, and then call expression again to get the default value. The other forms could be handled by “nud” methods on the corresponding operators, but it’s probably easier to handle these too in this method.

Lambdas are also quite simple. Since the “lambda” keyword is a prefix operator, we’ll implement it using a “nud” method:

symbol( ":" ) @ method (symbol( "lambda" )) def nud (self): self.first = [] if token.id != ":" : argument_list(self.first) advance( ":" ) self.second = expression() return self def argument_list (list): while 1: if token.id != "(name)" : SyntaxError( "Expected an argument name." ) list.append(token) advance() if token.id != "," : break advance( "," ) >>> parse( "lambda a, b, c: a+b+c" ) ( lambda [(name a), (name b), (name c)] (+ (+ (name a) (name b)) (name c)))

Again, the argument list parsing is a bit simplified; it doesn’t handle default values and the “*” and “**” forms. See above for implementation hints. Also note that there’s no scope handling at the parser level in this implementation. See Crockford’s article for more on that topic.

Constants can be handled as literals; the following “nud” method changes the token instance to a literal node, and inserts the token itself as the literal value:

def constant (id): @ method (symbol(id)) def nud (self): self.id = "(literal)" self.value = id return self constant( "None" ) constant( "True" ) constant( "False" ) >>> parse( "1 is None" ) ( is (literal 1) (literal None)) >>> parse( "True or False" ) ( or (literal True) (literal False))

Python has two multi-token operators, “is not” and “not in”, but our parser doesn’t quite treat them correctly:

>>> parse( "1 is not 2" ) ( is (literal 1) ( not (literal 2)))

The problem is that the standard tokenize module doesn’t understand this syntax, so it happily returns these operators as two separate tokens:

>>> list(tokenize( "1 is not 2" )) [(literal 1), ( is ), ( not ), (literal 2), ((end))]

In other words, “1 is not 2” is handled as “1 is (not 2)”, which isn’t the same thing:

>>> 1 is not 2 True >>> 1 is ( not 2) False

One way to fix this is to tweak the tokenizer (e.g. by inserting a combining filter between the raw Python parser and the token instance factory), but it’s probably easier to fix this with custom “led” methods on the “is” and “not” operators:

@ method (symbol( "not" )) def led (self, left): if token.id != "in" : raise SyntaxError( "Invalid syntax" ) advance() self.id = "not in" self.first = left self.second = expression(60) return self @ method (symbol( "is" )) def led (self, left): if token.id == "not" : advance() self.id = "is not" self.first = left self.second = expression(60) return self >>> parse( "1 in 2" ) ( in (literal 1) (literal 2)) >>> parse( "1 not in 2" ) ( not in (literal 1) (literal 2)) >>> parse( "1 is 2" ) ( is (literal 1) (literal 2)) >>> parse( "1 is not 2" ) ( is not (literal 1) (literal 2))

This means that the “not” operator handles both unary “not” and binary “not in”.

Tuples, Lists, and Dictionary Displays #

As noted above, the “(” prefix serves two purposes in Python; it’s used for grouping, and to create tuples (it’s also used as a binary operator, for function calls). To handle tuples, we need to replace the “nud” method with a version that can distinguish between tuples and a plain parenthesized expression.

Python’s tuple-forming rules are simple; if a pair of parenteses are empty, or contain at least one comma, it’s a tuple. Otherwise, it’s an expression. Or in other words:

() is a tuple

(1) is a parenthesized expression

(1,) is a tuple

(1, 2) is a tuple

Here’s a “nud” replacement that implements these rules:

@method(symbol("(")) def nud(self): self.first = [] comma = False if token.id != ")": while 1: if token.id == ")": break self.first.append(expression()) if token.id != ",": break comma = True advance(",") advance(")") if not self.first or comma: return self # tuple else: return self.first[0] >>> parse("()") (() >>> parse("(1)") (literal 1) >>> parse("(1,)") (( [(literal 1)]) >>> parse("(1, 2)") (( [(literal 1), (literal 2)])

Lists and dictionaries are a bit simpler; they’re just plain lists of expressions or expression pairs. Don’t forget to register the extra tokens.

symbol("]") @method(symbol("[")) def nud(self): self.first = [] if token.id != "]": while 1: if token.id == "]": break self.first.append(expression()) if token.id != ",": break advance(",") advance("]") return self >>> parse("[1, 2, 3]") ([ [(literal 1), (literal 2), (literal 3)]) symbol("}"); symbol(":") @method(symbol("{")) def nud(self): self.first = [] if token.id != "}": while 1: if token.id == "}": break self.first.append(expression()) advance(":") self.first.append(expression()) if token.id != ",": break advance(",") advance("}") return self >>> {1: 'one', 2: 'two'} ({ [(literal 1), (literal 'one'), (literal 2), (literal 'two')])

Note that Python allows you to use optional trailing commas when creating lists, tuples, and dictionaries; an extra if-statement at the beginning of the collection loop takes care of that case.

At roughly 250 lines of code (including the entire parser machinery), there are still a few things left to add before we can claim to fully support the Python 2.5 expression syntax, but we’ve covered a remarkably large part of the syntax with very little work.

And as we’ve seen thoughout this article, parsers using this algorithm and implementation approach are readable, easy to extend, and, as we’ll see in a moment, surprisingly fast. While this article has focussed on expressions, the algorithm can be easily extended for statement-oriented syntaxes. See Crockford’s article for one way to do that.

All in all, Pratt’s parsing algorithm is a great addition to the Python parsing toolbox, and the implementation strategy outlined in this article is a simple way to quickly implement such parsers.

As we’ve seen, the parser makes only a few Python calls per token, which means that it should be pretty efficient (or as Pratt put it, “efficient in practice if not in theory”).

To test practical performance, I picked a 456 character long Python expression (about 300 tokens) from the Python FAQ, and parsed it with a number of different tools. Here are some typical results under Python 2.5:

topdown parse (to abstract syntax tree): 4.0 ms

built-in parse (to tuple tree): 0.60 ms

built-in compile (to code object): 0.68 ms

compiler parse (to abtract syntax tree): 4.8 ms

compiler compile (to code object): 18 ms

If we tweak the parser to work on a precomputed list of tokens (obtained by running “list(tokenize_python(program))”), the parsing time drops to just under 0.9 ms. In other words, only about one fourth of the time for the full parse is spent on token instance creation, parsing, and tree building. The rest is almost entirely spent in Python’s tokenize module. With a faster tokenizer, this algorithm would get within 2x or so from Python’s built-in tokenizer/parser.

The built-in parse test is in itself quite interesting; it uses Python’s internal tokenizer and parser module (both of which are written in C), and uses the parser module (also written in C) to convert the internal syntax tree object to a tuple tree. This is fast, but results in a remarkably undecipherable low-level tree:

>>> parser.st2tuple(parser.expr( "1+2" )) (258, (326, (303, (304, (305, (306, (307, (309, (310, (311, (312, (313, (314, (315, (316, (317, (2, '1' ))))), (14, '+' ), (314, (315, (316, (317, (2, '2' )))))))))))))))), (4, '' ), (0, '' ))

(In this example, 2 means number, 14 means plus, 4 is newline, and 0 is end of program. The 3-digit numbers represent intermediate rules in the Python grammar.)

The compiler parse test uses the parse function from the compiler package instead; this function uses Python’s internal tokenizer and parser, and then turns the resulting low-level structure into a much nicer abstract tree:

>>> import compiler >>> compiler.parse( "1+2" , "eval" ) Expression(Add((Const(1), Const(2))))

This conversion (done in Python) turns out to be more work than parsing the expression with the topdown parser; with the code in this article, we get an abstract tree in about 85% of the time, despite using a really slow tokenizer.

Code Notes #

The code in this article uses global variables to hold parser state (the “token” variable and the “next” helper). If you need a thread-safe parser, these should be moved to a context object. This will result in a slight performance hit, but there are some surprising ways to compensate for that, by trading a little memory for performance. More on that in a later article.

All code for the interpreters and translators shown in this article is included in the article itself. Assorted code samples are also available from: