Reminder - recursive descent (RD) parsers Here's an article I wrote on the subject a few months ago. It provides a good introduction on how RD parsers are constructed and what grammars they can parse. Here I want to focus on a couple of problems with the RD parser developed in that article, and propose solutions.

Problem #1: operator associativity If you recall from the previous article, the expr rule of the parser looks like this (BNF notation): <expr> : <term> + <expr> | <term> - <expr> | <term> It's built this way ( expr on the right-hand side of the expression, term on the left-hand side), to avoid left-recursion in the grammar, which can crash a RD parser by sending it wheeling in an infinite loop. But as I hinted in the footnotes (and some readers caught on in the comments), this injects an associativity problem into the grammar. Let's see why. Wikipedia is much better than me at explaining what operator associativity is, so I'll assume you've read and understood it. In short, however, left associativity of the minus operator means that 5 - 1 - 2 = (5 - 1) - 2 and not 5 - (1 - 2) (which returns a different result). But if you run 5 - 1 - 2 in the parser with the above BNF for expr , you'll get 6 instead of 2. So what went wrong? The problem is in the grammar definition (BNF) itself. The way the expr rule is defined makes it inherently right-associative instead of left-associative. The hierarchy of the rules implicitly defines their associativity, because it defines what will be grouped together. To understand it better, perhaps the code implementing the expr rule will help: def _expr ( self ): lval = self ._term() if self .cur_token.type == '+' : self ._match( '+' ) op = lambda a, b: a + b elif self .cur_token.type == '-' : self ._match( '-' ) op = lambda a, b: a - b else : print 'returning lval = %s' % lval return lval rval = self ._expr() print 'lval = %s, rval = %s, res = %s' % ( lval, rval, op(lval, rval)) return op(lval, rval) Note that the first term is parsed, and then the rule recursively calls itself for the next one. So the expression is being built from right to left, and this causes its right-associativity. As you can see, I've added a couple of printouts to better show what's going on. When run on the expression 5 - 1 - 2 , this prints: returning lval = 2 lval = 1, rval = 2, res = -1 lval = 5, rval = -1, res = 6 We clearly see the problem here. The actual returns are done from right to left because of the recursion. Note that this grammar evaluates addition, multiplication, subtraction and division in a right-associative way. This causes problems for both subtraction and division, but not for addition and multiplication, because these operations compute the same whether right-to-left or left-to-right .

A solution for the associativity problem I suppose the problem can be solved by rewriting the BNF rules in some sophisticated way that makes them both left-associative and not left-recursive , but I'll pick another way. BNF is somewhat limiting, since it doesn't really allow much options when defining rules. All the rules must have a very strict structure, and if you want to customize something you must resort to defining sub-rules and referencing them recursively. Enter EBNF. It was developed to fix some of the deficiencies of plain BNF. One of those is the addition of repetition of sub-rules. For instance, we can write the expr rule in EBNF as follows: <expr> : <term> {+ <term>} | <term> {- <term>} Note the braces { ... } . In EBNF, these mean "repeated 0 or more times". This is still a LL(1) grammar, but now it's expressed a bit more comfortably. Such a representation is very suitable for coding, because the repetition can be expressed naturally with a loop. Here's a re-implementation of the expr rule using this idiom: def _expr ( self ): lval = self ._term() while ( self .cur_token.type == '+' or self .cur_token.type == '-' ): if self .cur_token.type == '+' : self ._match( '+' ) lval += self ._term() elif self .cur_token.type == '-' : self ._match( '-' ) lval -= self ._term() return lval Note the while loop "eating up" all successive terms in the expression and accumulating the result in the expected left-to-right manner. Now the computation 5 - 1 - 2 will correctly produce 2 .

The code This is a good place to refer to the code. In here you will find the source of both the old (BNF-based) parser and the new (EBNF-based) one, along with the lexer module that implements the tokenizer. Each of the parsers is self contained and can be used separately. Note that they were developed and tested with Python 2.5

Right-associative operators Some operators are inherently right-associative. Exponentiation, for example. 2^3^2 = 2^(3^2) = 512 , and not (2^3)^2 (which equals 64). We can leave these operators defined as before, using a recursive rule that naturally results in right-associativity. Here's the code of the power rule that was added to the EBNF-based parser to support exponentiation: # <power> : <factor> ** <power> # | <factor> # def _power ( self ): lval = self ._factor() if self .cur_token.type == '**' : self ._match( '**' ) lval **= self ._power() return lval

Intermission We now have a correct recursive descent parser that uses EBNF-based rules to parse expressions with the desired associativity for each operator. This parser can be readily employed to parse simple languages - it is production-use ready. The next "problem" I present only has to do with the parser's efficiency, so it is probably of no concern unless performance is crucial.