Constructing human-grade parsers

In the traditional conception of a parser, there's a grammar, there are strings that do or don't match the grammar, and there are actions to take in response to inputs that match the grammar, but little attention is often paid to what to do with inputs that don't match. Most of us have probably encountered a tool that produces obtuse, grammar-oriented errors of the " expected ';' or '∑' or 'blörp' " sort. Diagnostic quality aside, when a tool only produces one error at a time like this, the user has to go through a tedious loop of fixing one error, re-running the tool, fixing the next error, and so on, until they get a successful parse. Stopping parsing on erroring is the right thing to do when non-user-facing parsers for things like serialization and wire protocols encounter strings that don't match their grammar, but for tools that work with text that gets directly edited by human beings, like source code and configuration files, parse errors are much more likely, and working with a naive parser that can only report localized parse errors one at a time quickly gets irritating. A tool is much more user friendly if each run produces as many diagnostics as possible in one run, allowing the user to fix them all in one editing pass and have a higher likelihood of a successful run next time. For a user-facing tool, parse errors should be recoverable and the parser should be able to keep going to collect multiple errors.

Programming language compilers also usually run multiple passes over their input. After parsing, compilers perform type checking and other semantic analysis on the parse result to check additional invariants. In order to maximize the diagnostic coverage a compiler can achieve in a single pass, parsing should be able to produce partial output even in the face of errors, so that analysis passes after parsing can process the well-formed parts of the file. This is important for maximizing diagnostic coverage when running a compiler from the commandline, but it's even more critical when a compiler is used interactively as part of an IDE. When a user is actively editing a source file, the compiler is going to constantly be exposed to inconsistent states, but the user is going to expect to still get accurate information about the parts of the file they aren't actively editing. Features like code completion especially need to be able to use semantic information from the well-formed parts of a file in order to assist editing the incomplete parts.

So we want to build parsers that can recover from parse errors and produce partial output with the well-formed parts. Thinking about it a different way, we want parsing to always succeed at producing some kind of structured result. The result can contain error nodes inside it, but the error nodes don't have to replace the entire result. How do we make a parser that always succeeds, and how exactly do we recover when we find a parse error? We can look at both problems from the perspective of designing the grammar. Effectively, we want to take a grammar and extend it to make it total, so that every string matches a rule, by adding rules for erroneous inputs.

Doing this well is an art, and requires some human judgment and iteration to get the best results. To get an idea of how we might go about it, let's consider a toy example. We'll look at a language for arithmetic expressions, with the basic operators " + ", " - ", " * ", and " / ", where " * " and " / " bind more tightly than " + " and " - ", and parentheses " () " can be used to group expressions against their normal precedence. The grammar might look something like this in EBNF:

expr ::= mul-expr (add-op mul-expr)* mul-expr ::= term (mul-op term)* term ::= number | '(' expr ')' add-op ::= '+' | '-' mul-op ::= '*' | '/' number ::= add-op? [0-9]+

1+(2-3)*4/5

0

9

+

-

*

/

()

term ::= number | '(' expr ')' number ::= add-op? ([0-9] | error-alnum)+ error-alnum ::= ([:Letter:] | [:Number:])+ mul-op ::= '*' | '/' | error-op error-op ::= [:Mark:] | [:Punctuation:] | [:Symbol:]

*

/

expr ::= mul-expr (add-op mul-expr?)* -- error if mul-expr is missing mul-expr ::= term (mul-op? term?)* -- error if term or mul-op is missing | mul-op term? (mul-op? term?)* -- error with missing first term term ::= number | '(' expr? ')'? -- error if empty or closing paren missing | expr? ')' -- error if opening paren is missing number ::= add-op? ([0-9] | error-alnum)+ | add-op -- error with missing number after sign

data ArithExpr = Add ArithExpr ArithExpr -- '1 + 2' | Sub ArithExpr ArithExpr -- '1 - 2' | Mul ArithExpr ArithExpr -- '1 * 2' | Div ArithExpr ArithExpr -- '1 / 2' | Parens ArithExpr -- '(1)' | Num Double -- '1.23' -- invalid productions: | InvalidOp String ArithExpr ArithExpr -- '1?2' (with invalid char '?') | InvalidNum String -- 'xyz' (with invalid char 'xyz') | MissingOp ArithExpr ArithExpr -- '(1)(2)' (no op in between) | MissingNum -- '' (no number where expected) | MissingOpeningParen -- '1)' | MissingClosingParen -- '(1' parseArithExpr :: String -> ArithExpr -- always returns an ArithExpr, no Maybe or Either parseArithExpr "1+2*3" -- Add (Num 1) (Mul (Num 2) (Num 3)) parseArithExpr "(1+2)*3" -- Mul (Parens (Add (Num 1) (Num 2))) (Num 3) parseArithExpr "1&2&3" -- InvalidOp "&" (InvalidOp "&" (Num 1) (Num 2)) (Num 3) parseArithExpr "1+2+a/0*" -- Add (Add (Num 1) (Num 2)) -- (Mul (Div (InvalidNum "a". (Num 0)) MissingNum) parseArithExpr "4)()(5" -- MissingOp (MissingOp (MissingOpeningParen (Num 4)) (Parens (MissingNum))) -- (MissingClosingParen (Num 5))

Div _ (Num 0)

Real world languages usually have grammars a lot more involved than this example, and thinking through every possible combination of tokens to give them specific treatment quickly becomes intractable. There's a more general fallback technique we can use to recover from parse errors. Language grammars tend to have natural synchronization points, intentional or not, that unambiguously delimit (or at least have a high likelihood of delimiting) independent subcomponents. For instance, C always ends statements with semicolons, and it groups expressions in matching pairs of parentheses, square brackets, or curly braces. It also reserves keywords like " struct ", " if ", " while ", and so on that always introduce a new declaration or statement. Absent any more specific recovery, a good fallback strategy when a parse error is encountered is to scan ahead to one of these synchronization point tokens. Once a synchronizing token is found, the parser can unwind to the level of syntax the token indicates, filling in the incomplete tree underneath as best as possible. For instance, if a C parser encounters an error in the middle of an expression, it can start scanning, and if it finds a closing paren " ) ", it can match that to the previous " ( " and complete the surrounding paren expression or argument list and continue parsing the expression grammar from there. Otherwise, if the parser finds a semicolon, it can complete the current partial statement and start parsing a new statement. If it finds a closing curly brace, it can match that to the previous opening brace, complete the compound statement, and start parsing a new statement or declaration afterward.

Next time you need to write a parser for a configuration or programming language, hopefully this article gives some ideas on how to architect it in a way it can recover after parse errors, produce partial results when errors occur, and thereby allow your tools to give a higher-quality user experience. If you're designing a grammar from scratch, it's also good to think about how your grammar can be parsed in a recoverable way, by considering what kinds of errors or incomplete edits users may make, and what kinds of synchronization points you can design into the grammar so that a parser can recover from malformed input.