Parsing list comprehensions is hard

I have a toy problem that I like to test on anyone who thinks they've “solved” parsing: Haskell list comprehensions. These are given by the grammar:

LISTCOMP ::= "[" EXPR "|" CLAUSE["," CLAUSE]* "]" CLAUSE ::= EXPR | PAT "<-" EXPR

The problem is that when parsing a CLAUSE , until you see a “ <- ”, you don't know whether you've been parsing a pattern PAT or an expression EXPR . This is extra hard because patterns and expressions overlap: (x, Left 2) could be either, but (x, Left (2+3)) is definitely an expression. You can get arbitrarily deep into parsing a pattern before you realize it's actually an expression!

Neither LL nor LR1 parsers can handle this. In fact, GHC's parser uses a hack: it parses patterns as expressions, and only later checks that the expression it parsed was a valid pattern! This works so long as patterns are a subset of expressions. But if some patterns aren't valid expressions (e.g. OCaml's or-patterns p|q ), then you need to get clever.2,3

Naïve recursive descent parsers—which are basically LL(1)—can't handle this, but they can resort to a classic trick: backtracking! First, try parsing EXPR ; if that fails, try parsing PAT "<-" EXPR . Parsec permits this via the try combinator, and PEGs do it by default. One worry here is that backtracking can lead to exponential explosion. I think this isn't a problem for list comprehensions, because expressions can't nest inside patterns.4 (PEGs duck the exponential blowup by memoising, anyway.)

Still, it's tricky to reason about the behavior of backtracking. Backtracking in the wrong place can over-eagerly commit to a wrong parse. And because it “forgets” branches it backtracks out of, it can give sub-par error messages.5 Still, it's probably the best solution if you're hand-rolling a parser.

Python also has list comprehensions, but neatly sidesteps this issue:

LISTCOMP ::= "[" EXPR STMT* "]" STMT ::= "if" EXPR | "for" PAT "in" EXPR

The unique prefixes if and for prevent ambiguity; this is LL(1) and easily parsed with recursive descent. This is both pragmatic and ergonomic, at least for list comprehensions; I'd find writing monadic do-notation in this style a bit tedious.

Of course, the original Haskell-style grammar can be parsed by all-purpose parsing algorithms like GLR, GLL, parsing with derivatives, Earley, and CYK. I'm not sure how efficiently they do it, but Jeffrey Kegler claims that his Marpa parser (a modernized Earley variant) handles practically any unambiguous grammar in linear time. Sounds great! Maybe in the future we won't need fiddly hacks just to parse Haskell-style list comprehensions.

Footnotes