Parsing is a surprisingly challenging problem. No wonder I often see simple parsing problems as interview questions. In my own projects, I’ve tortured myself trying to find robust and efficient ways to scrape data from websites. I couldn’t find much help online except for people saying that using regular expressions is a bad approach.

In retrospect, this was one of those times where I simply didn’t know the right keywords to search. I finally feel like I’ve figured it all out, but it was a long journey filled with academic jargon that was hard to understand and often misused. The purpose of this article is to make the theory and practice of parsers more accessible.

I’m going to start out with some theory about formal grammars because I found it very useful to understand when people start throwing around fancy words like context-free grammar and the like. In second half of this article, we will build a parser from scratch so you can knock it out of the park in your next interview. This isn’t a quick-read, so make sure you have a nice cup of joe and a fresh mind before proceeding.

Theory of Formal Grammars

The theory behind parsers has its roots in a 1956 seminal paper by Noam Chomsky, Three Models for the Description of Language. In this paper, he describes the Chomsky Hierarchy of four classes of formal grammars. Each is a subset of another distinguished by the complexity of the algorithm required to validate them. We’ll go through them one-by-one.

Type 3 — Regular languages

Type 3 languages are called regular languages. Anything you can match with a regular expression is perfect example of a regular language — hence the name “regular” expression. Any regular language can be solved with a deterministic finite automata (DFA). For those less familiar, a DFA is a program that can be represented as a fixed set of states (nodes) and transitions (edges). There are some pretty awesome visualization tools out there for regular expressions based on this insight. For example, check out this one:

A deterministic finite automata representing the following regular expression: /-?[0–9]+\.?[0–9]*/

John: Some regular expression engines are actually more powerful than interpreting just regular languages. The Oniguruma engine that Ruby uses is a perfect example. Me: Whoa!

Type 2 — Context-Free languages

Type 2 languages are called context-free (a.k.a. CFG for context-free grammar) and can be represented as a pushdown automata which is an automata that can maintain some state with a stack. This part of the hierarchy gets the most attention because most programming languages and domain specific languages are context-free. A perfect example of a language that is context-free but not regular is a language defined by “n 0’s followed by n 1’s for any n” . Mathematicians would define this grammar with the following notation:

{ 0^n 1^n | n >= 0 }

If we were to try to write this as a regular expression, we’d start by writing something like 0{n}1{n} . But that is not a valid regular expression because we'd have to specify exactly what what n is, hence this is not a regular language. However we can verify this grammar with a pushdown automata using the following procedure:

If you see a 0, push it onto the stack and go back to (1). If you see a 1, pop an item from the stack and go back to (2). If the stack is empty, we’ve verified that the input string is “in the language”.

John: I have had many situations where I needed to parse balanced parenthesis and I think this is a more sympathetic case for programmers. Me: Good point! This is a simple case of the balanced parentheses problem.

A common way of specifying context-free grammars is with with Backus-Naur Form (BNF). This is an excellent article explaining how BNF works, common extensions to BNF (typically called EBNF or ABNF), and explains how top-down (LL) and bottom-up (LR) parsers work. Some other common terms you might hear are SLR, CLR, LALR and this StackOverflow comment does a good job at clarifying those.

You might also hear about something called a parsing expression grammar (PEG). PEG is the same as CFG except it is unambiguous and greedy — if you are a programmer and not a mathematician, PEG is most often what you are thinking.

Ambiguity is painful. Markdown is a good example of an ambiguous grammar and this is the primary reason Markdown parsers do not have a formal BNF grammar definition. The following examples are the expected parse results based on the latest CommonMark specification.

***bold**italic* => <em><strong>bold</strong>italic</em> ***italic*bold** => <strong><em>italic</em>bold<strong> *italic**not bold* => <em>italic**not bold</em>

A mathematician might define a CFG for Markdown using BNF like this:

The mathematician would say this is a valid CFG definition and verify that the three examples above are “in the language”. However, if you were a programmer trying to write a markdown parser, this CFG definition is pretty much useless to you. If you were to interpret this definition into an actual parser, it would fail because a program cannot be ambiguous. A PEG will eagerly match tokens as if it were a real program whereas the mathematician doesn’t actually care about that.

John: Markdown parser is great example and the ambiguity of the grammar is a pain. Markdown has become a wishy washy standard. Me: I think it might actually be the other way around. Check out the latest spec for how bold and italics works. I dare you to spend 15 minutes reading that spec, trying to understand it, and write a simple parser for bold and emphasis that satisfies examples 328–455. Maybe the problem is that CommonMark so over-specified and ridiculous that no one actually conforms to it.

For unambiguous context-free languages like C, there are all kinds of tools. The most popular are ANTLR, Bison, and YACC. They’re actually called compiler-compilers because they don’t just verify the grammar, they provide tools for generating compilers for those grammars as well. There’s a repo with a bunch of ANTLR grammar examples that are pretty cool to check out. In the JavaScript world, you can also check out PEG.js and Jison.

Type 1 — Context-Sensitive languages

Type 1 languages are called context-sensitive. They can be represented by an automata without using more memory than the length of the input. The following grammar can distinguish between Type 2 and Type 1:

{ a^n b^n c^n | n >= 0 }

We can verify that a string is in this language using an automata with the following procedure:

Read the first a and overwrite it with an x. Move to the first b and overwrite it with an x. Move to the first c and overwrite it with an x. Then move back to the beginning and go back to (1). If you make it to the end of the string looking for the first a and they’re all x’s, then we’ve verified the string.

The thing to recognize here compared to the previous example is that this parser has a state maintained by mutating the input memory that isn’t just a stack.

Type 0 — Recursively Enumerable languages

Type 0 languages are called recursively enumerable and is anything that can be solved by a computer — that is, not the halting problem. A simple, yet annoyingly math-heavy proof of a language that is Type 0 but not Type 1 is a language that reads two regular expressions and determines if they represent the same language. This is a well-known EXPSPACE problem so it clearly cannot be a Type 1.

An Unsolved Conjecture

A common question is how can we parse ambiguous context-free grammars. Well if you can prove that you cannot write a PEG for some CFG then you’ll be a decorated mathematician!

“It is conjectured that there exist context-free languages that cannot be parsed by a PEG, but this is not yet proven.” — Wikipedia , ACM

Otherwise, for all of your parsing needs, parser combinators are a one-size-fits-all elegant solution and that’s what we’re going to talk about next.