Once every few years I end up, for one reason or another, having to implement an operator precedence parser. Getting it right is a bit fiddly. I once found a great article about how you do it and tend to just follow that but a while ago that article disappeared.

Now, again, I find myself having to implement an operator precedence parser, in yet another language, and so I dug up the old article on archive.org and decided: since I find it super useful and it’s completely gone from the web, maybe I should just host a copy here. So here it is. It’s a design doc from the sugar parser library written by Douglas Gregor. So, to be clear, I didn’t write this doc I’m just reposting it here (with a few minimal formatting changes) because it’s good and that way it’s at least available somewhere.

If anyone, particularly Douglas Gregor, has an opinion or an objection please leave a comment.

Shift/Reduce Expression Parsing

Introduction

The Sugar expression parser is similar to the class of operator-precedence parsers, but has been extended to support common requirements when parsing expressions, such as function application, confix (grouping) operators, and operator name disambiguation. Additionally, Sugar is intended to be usable without any precompiling phase, making it ideal for rapid or on-the-fly construction of expression parsers.

Expressions

For the purposes of this document, an expression is a sequence of operators and operands, where operators fall into one of the following categories:

Type Arity Placement Examples Prefix Unary Prior to operand Unary minus Postfix Unary After operand Factorial Infix Binary Between operands Addition, multiplication, and division Confix Unary Surrounding operand Parentheses, half-open ranges Function application Binary After first operand and surrounding second operand Mathemetical functions (sin(x)), array indexes(a[5])

The confix and function application operators are essentially split into their component parts, an open symbol and a close symbol, during the parsing phase. The “open” symbol will occur on the left-hand side and the “close” symbol will occur on the right-hand side.

Constructing the parser

The expression parser is a shift/reduce parser with zero lookahead that utilizes two separate stacks: one for operators and one for operands. Any operands in the input stream are immediately shifted onto the operator stack; operators are immediately shifted onto the operator stack only if the operator stack is empty. Otherwise, the following table determines the action of the parser depending on the type of the operator on top of the operator stack and on the type of the current operator token.

Current operator Prefix Postfix Infix Confix Open Confix/Function Close Function Open End of Input Top

of

Stack Prefix shift precedence precedence shift reduce precedence reduce Postfix – reduce reduce – reduce reduce reduce Infix shift precedence precedence/associativity shift reduce precedence reduce Confix Open shift shift shift shift shift shift reduce Confix/Function Close reduce reduce reduce reduce reduce reduce reduce Function Open shift shift shift shift shift shift reduce

Description of parsing actions

A shift operation pushes the current operator token onto the operator stack.

A reduce operation pops the operator token off the top of the operator stack, and then pops the appropriate number of operands from the operand stack. Then the operator is applied to the operand(s) and the result is pushed back on the operand stack. Reduction of confix operators and of function application requires popping two operators off the operator stack.

A precedence operation compares determines the relative precedence of the operator on top of the operator stack (top) and the current operator (current). If top has a lower precedence than current, shift. If top has a higher precedence than current, reduce.

A precedence/associativity operation first compares the precedence according to the precedence operation: if the precedence is equivalent, associativity is considered: If top associates left of current, reduce. If top associates right of current, shift.



Rejecting Invalid Expresions

Operator-precedence parsers are often not used because they accept invalid strings. The shift-reduce parser as specified above will consider the expressions x + x, + x x, and x x + equivalent, even though only the first form is correct. This weakness is easily remedied with the use of the following state machine to track what type of operator or operand is expected at any given point in time.

Disambiguation of Operator Names

The state machine contains three states: the pre-operand state where we collect confix open and prefix operators while waiting for an operand, the post-operand state where we have received an operand and are applying postfix operators to it and closing confix operators or finishing function calls, and finally an error state that will be entered when an invalid parse is detected.

Within many domains, certain operators are reused in different contexts. Several obvious examples are the unary and binary minus operators that use the same symbol ‘-‘, the absolute-value confix operator that uses the symbol ‘|’ as both its open and close symbol, and the ‘+’ operator for regular expressions that is both a postfix positive closure operator and an infix operator for specifying alternatives.

Disambiguation of operator names is in many cases directly related to the state machine used to identify invalid sequences. Given any operator name, we determine the set of operator types that it may belong to. We then intersect this with the set of operator types that are valid at our current state within the state machine to determine role(s) this operator may play in this context. Several cases are left ambiguous by this intersection. These cases are considered below with either a specific resolution or are considered impossible by this class of parser.

Disambiguation at this phase requires lookahead of one additional token, and is also based on the state machine. Disambiguation is possible when the possible meanings of the operator differ in the states that will result from their interpretation. For instance, if a given operator is both postfix and infix, the postfix interpretation would remain in the post-operand state whereas the infix interpretation would transfer to the pre-operand state. Looking ahead one symbol, we can determine if the next symbol would be valid in either state: if it is valid in only one of the resulting states, we can disambiguate the prior (non-lookahead) symbol to ensure that the appropriate state is reached so that the lookahead symbol will not result in a parse error.

{prefix, confix open} : ambiguous (requires arbitrary lookahead).

: ambiguous (requires arbitrary lookahead). {postfix, infix} : single lookahead disambiguation based on state.

: single lookahead disambiguation based on state. {confix/function close, infix} : single lookahead disambiguation based on state.

: single lookahead disambiguation based on state. {function open, infix} : ambiguous (requires arbitrary lookahead).

: ambiguous (requires arbitrary lookahead). {postfix, confix/function close} : ambiguous (requires arbitrary lookahead).

: ambiguous (requires arbitrary lookahead). {postfix, function open} : single lookahead disambiguation based on state.

: single lookahead disambiguation based on state. {function open, function/confix close}: single lookahead disambiguation based on state.

Parsing Examples

Mathematical Expressions

Parse the expression x * |y+z| + -3^x^y using the standard mathematical rules for precedence and associativity.

State Operand Stack Operator Stack Token Token type Action Pre x operand shift Post x * infix operator shift Pre x * | confix open or confix close disambiguate as confix open, shift Pre x * (confix open |) y operand shift Post x y * (confix open |) + infix or prefix operator disambiguate as infix, shift Pre x y * (confix open |) + z operand shift Post x y z * (confix open |) (infix +) | confix open or confix close disambiguate as close, reduce Post x (y+z) * (confix open |) | confix open or confix close disambiguate as close, reduce Post x (|y+z|) * + infix or prefix disambiguate as infix, compare precedence, reduce Post (x * (|y+z|)) + infix or prefix disambiguate as infix, shift Pre (x * (|y+z|)) (infix +) – infix or prefix disambiguate as prefix, shift Pre (x * (|y+z|)) (infix +) (prefix -) 3 operand shift Post (x * (|y+z|)) 3 (infix +) (prefix -) ^ infix compare precedence, shift Pre (x * (|y+z|)) 3 (infix +) (prefix -) ^ x operand shift Post (x * (|y+z|)) 3 x (infix +) (prefix -) ^ ^ infix compare precedence, compare associativity, shift Pre (x * (|y+z|)) 3 x (infix +) (prefix -) ^ ^ y operand shift Post (x * (|y+z|)) 3 x y (infix +) (prefix -) ^ ^ end end reduce Post (x * (|y+z|)) 3 (x^y) (infix +) (prefix -) ^ end end reduce Post (x * (|y+z|)) (3^(x^y)) (infix +) (prefix -) end end reduce Post (x * (|y+z|)) (-(3^(x^y))) (infix +) end end reduce Post ((x * (|y+z|)) + (-(3^(x^y)))) empty end end accept

Douglas Gregor

Last modified: Sat Aug 18 12:46:13 EDT 2001