Let me start by saying — I was surprised how easy it was to write grammar for an Earley parser. I have been using regular expressions for over a decade. And I am used to parse things using regular expressions. Its fragile, its not always possible, etc. But it is fast and for the most part it serves the purpose.

Familiarising with parsing algorithms changed this attitude forever.

This is a long article. Therefore, I have used the He Man meme to keep you entertained throughout the journey. I promise you an all-powerful weapon at the end of the article.

Enjoy the journey ;-)

The reason for writing a parser

I am working on a declarative HTML scraper. The syntax of the scraper depends on a custom DSL, which is an extension of the CSS3 selector specification.

Here is an example of a scraper manifest used to declare what to parse, and how to validate and format the resulting data:

selector: body

properties:

title: title

articles:

selector: article {0,}

properties:

body: .body::property(innerHTML)

summary: .body p {0,}[0]

imageUrl: img::attribute(src)

title: .title::text()::test(/foo/)::format(upperCase)

There is a bunch of stuff going on there that isn’t part of the CSS spec:

A quantifier expression ( {0,} )

) An accessor expression ( [0] )

) An XPath-esque function invocation ( ::test(/foo/)::format(upperCase) )

I needed a way to parse it.

Picking the right tool for the job

My first thought was to use regular expressions. In fact, I have used regular expressions to write a prototype of the parser. After all, in the early stages of the program design, you need to be able to quickly prototype the solution; the prototype stage is not the time to be anal about the edge cases.

This is not to say that regexp cannot be used as a parser. Regular expressions can be used to parse regular languages; CSS selectors are context-free. By the way, if terms “context-free” or “regular language” do not make much sense, I recommend reading The true power of regular expressions (5 minutes worth of reading).

However, for the production releases, I needed a strict, extendable parser.

I have started looking for parsers in JavaScript and I found Jison and PEG.js. However, neither of the algorithms support left-recursion. I wanted a parser that supports left-recursions!

The shark-like tank, with powerful jaws.

I kid you not — I didn’t even know what left-recursion is at the time of making this decision. However, I found it odd that it was emphasised that these algorithms to do not support it. Nevertheless, it was a good hunch — as I have learned later, left-recursion allows to keep the parser grammar simple and it can be a lot more performant.

Long story short, on the second page of Google search for “JavaScript parser” I found http://nearley.js.org/, an implementation of Earley parsing algorithm.

The author describes it as:

Earley parsers are awesome, because they will parse anything you give them. Depending on the algorithm specified, popular parsers such as lex/yacc, flex/bison, Jison, PEGjs, and Antlr will break depending on the grammar you give it. And by break, I mean infinite loops caused by left recursion, crashes, or stubborn refusals to compile because of a “shift-reduce error”.

– Better Earley than never (http://hardmath123.github.io/earley.html)

It sounded like a skill I want to learn.

Who needs all of these other parsers when there is He-Man.

So I continued reading.

Setup

Install nearley package.

$ npm install nearley

nearley consists of the main package (the parser API) and several CLI programs:

$ ls -1 ./node_modules/.bin

nearley-railroad

nearley-test

nearley-unparse

nearleyc

These programs are:

nearley-railroad is used to generate railroad diagrams.

is used to generate railroad diagrams. nearley-test is used to test an arbitrary input against the compiled grammar.

is used to test an arbitrary input against the compiled grammar. nearley-unparse is used to generate random strings that satisfy the grammar.

is used to generate random strings that satisfy the grammar. nearleyc is used to compile Nearley grammar to JavaScript script.

To make these programs available to your shell, add ./node_modules/.bin to your $PATH ( export PATH=./node_modules/.bin:$PATH ) or install nearley with --global option.

We are only going to be using nearleyc and nearley-test .

Parsing “1+2+3”

A parser needs a grammar to parse the input.

The Earley algorithm parses a string based on a grammar in Backus-Naur Form (BNF). A BNF grammar consists of a set of production rules, which are expansions of nonterminals.

A grammar to parse “1+2+3” input is:

expression -> "1+2+3"

In layman terms, this grammar says: match “1+2+3” as “expression”.

A nonterminal is a construction of the language. A nonterminal has a name ( expression ) and a list of production rules. A production rule defines what to match. A production rule consists of series of either other nonterminals or strings ( 1+2+3 is a production rule consisting of a single terminal).

Note: expression is an arbitrary name. It does not have a semantic meaning.

Testing grammar

To test it, compile the grammar using nearleyc :

$ cat <<'EOF' > ./grammar.ne

expression -> "1+2+3"

EOF

$ nearleyc ./grammar.ne --out ./grammar.js

Instruct nearley-test to use the resulting ./grammar.js to parse an input:

nearley-test ./grammar.js --input '1+2+3'

Table length: 6

Number of parses: 1

Parse Charts

Chart: 0

0: {expression → ● expression$string$1}, from: 0

1: {expression$string$1 → ● "1" "+" "2" "+" "3"}, from: 0 Chart: 1

0: {expression$string$1 → "1" ● "+" "2" "+" "3"}, from: 0 Chart: 2

0: {expression$string$1 → "1" "+" ● "2" "+" "3"}, from: 0 Chart: 3

0: {expression$string$1 → "1" "+" "2" ● "+" "3"}, from: 0 Chart: 4

0: {expression$string$1 → "1" "+" "2" "+" ● "3"}, from: 0 Chart: 5

0: {expression$string$1 → "1" "+" "2" "+" "3" ● }, from: 0

1: {expression → expression$string$1 ● }, from: 0 Parse results:

[ [ '1+2+3' ] ]

Hooray! our program parsed a string literal “1+2+3”.

Table of partial parsings

It is important to understand the output of nearley-test , because this is the tool you will be using to debug the grammars.

Earley works by producing a table of partial parsings, i.e. the nth column of the table contains all possible ways to parse s[:n] , the first n characters of s .

Chart: 0

0: {expression → ● expression$string$1}, from: 0

1: {expression$string$1 → ● "1" "+" "2" "+" "3"}, from: 0

In the above example, “Chart: 0” is the first column. It shows that we have a single nonterminal expression that can be equal to expression$string$1 .

As far as I understand, expression$string$1 is just a temporary variable used to represent terminal structures to avoid repeating them. More about that later.

• is a marker used to denote how far we have parsed. At the moment, we are at the beginning of the string.

As we progress, we continue to match the terminal character by character.

Chart: 1

0: {expression$string$1 → "1" ● "+" "2" "+" "3"}, from: 0 Chart: 2

0: {expression$string$1 → "1" "+" ● "2" "+" "3"}, from: 0 Chart: 3

0: {expression$string$1 → "1" "+" "2" ● "+" "3"}, from: 0 Chart: 4

0: {expression$string$1 → "1" "+" "2" "+" ● "3"}, from: 0 Chart: 5

0: {expression$string$1 → "1" "+" "2" "+" "3" ● }, from: 0

If the entire terminal is matched, the program produces a token for the match.