Beyond RegEx

Writing a parser in JavaScript

JavaScript problems (XKCD)

Regular expressions are a great tool for some things. For others, not so much. Like parsing HTML. Here are some of the things Stackoverflow has to say about that:

You can’t parse [X]HTML with regex. Because HTML can’t be parsed by regex. Regex is not a tool that can be used to correctly parse HTML. […] dear lord help us how can anyone survive this scourge using regex to parse HTML has doomed humanity to an eternity of dread torture and security holes using regex as a tool to process HTML establishes a breach between this world and the dread realm of c͒ͪo͛ͫrrupt entities (like SGML entities, but more corrupt) […]

There is more but I think you get the idea. While I was working on Breezy, an isomorphic HTML5 view engine, I luckily didn’t make the mistake of trying to parse HTML with regular expressions. But also learned that it is much easier than that to run into a situation where they are not the best tool for the job.

The regular expression rabbit hole

It all started with allowing simple property replacements in a string using brackets, for example {{property}} and nested paths like {{path.to.property}}. This is fairly straightforward using a regular expression and some JavaScript logic:

var regex = /\{{2}\s*((\w|\.)+)\s*\}{2}/g;



var text = "{{hello}} from {{some.nested}}";

var data = {

some: {

nested: 'nested content'

},

hello: 'Hello'

};



var result = text.replace(regex, function(match, group) {

var path = group.split('.');

var current = data;

while(path.length && current){

current = current[path.shift()];

}



return current || match;

});



console.log(result);

Next, I wanted to allow method calls like {{path.to.method first second}} which is also still doable (just split the content on any number of whitespaces). Things got more complicated though with trying to support truthy and falsy sections, similar to the ternary operator, e.g. {{isEqual name otherName ? yes : no}}. However, the biggest problem didn’t come up until supporting single- and double quoted strings, ideally with JSON style escaping. Now you couldn’t just split on white-spaces anymore but actually had to figure out where the strings end and begin and also support escaping (like double quotes or newlines etc.). And all that wasn’t even considering meaningful error messages yet (what if someone forgot a closing quote?).

That one person (XKCD)

While anybody programming computers for a living probably knows that one person who can (or will at least try to) solve any problem using regular expressions I remembered from my university days that this is exactly the problem where a parser is the more appropriate tool of choice:

A parser is a software component that takes input data (frequently text) and builds a data structure — often some kind of parse tree, abstract syntax tree or other hierarchical structure — giving a structural representation of the input, checking for correct syntax in the process.

PEGJS — A parser generator for JavaScript

Parsers can be automatically generated by defining a grammar and the first search result for “JavaScript parser generator” is PEGJS. It has a simple gammar syntax which also allows to run JavaScript code when something is matched. There is a helpful online version to try it out right away and also some good examples with grammars for CSS, JavaScript and JSON (conveniently just what I needed). The first step in defining the grammar was to figure out what expressions should actually look like, in a pseudo-definition like this:

path[.to.method] [args... ] [? truthy] [: falsy]

path can be a single or dot separated property accessor of variable names (just like JavaScript’s path.to.property). There can be any number of args (in case the property is a function) and each argument can be either a path or a single- or double quoted string. Finally there is the optional ? truthy and : falsy section which can both again be either a string or a path. The path and arguments should be separated by a whitespace which makes a good first rule in our grammar:

ws "whitespace" = [ \t]

This defines a rule called ws with a display name of whitespace as either a simple whitespace or the tab character. The display name is used for error messages so you will get something like “Expected whitespace but saw…” instead of “Expected [ \t] but saw…”. Next we can define some other basic rules:

escape_character = "\\"

doublequote "double quote" = '"'

singlequote "single quote" = "'"

unescaped = [\x20-\x21\x23-\x5B\x5D-\u10FFFF]

HEXDIG = [0-9a-f]i

unescaped and HEXDIG are copied right out of the JSON example grammar. We will need double- and single quotes and the backslash character to escape characters within strings. The next rule defines valid variable names:

variable = $([0-9a-zA-Z_\$]+)

This tells the parser to match at least one (+) or any amount of numbers, letters, underscores or dollar signs. Wrapping the rule with $() means to return the actual matched text (instead of an array with matched tokens). Next we can also copy the JSON string character definition from the example:

character =

unescaped

/ escape_sequence



escape_sequence "escape sequence" = escape_character sequence:(

doublequote

/ singlequote

/ "\\"

/ "/"

/ "b" { return "\b"; }

/ "f" { return "\f"; }

/ "n" { return "

"; }

/ "r" { return "\r"; }

/ "t" { return "\t"; }

/ "u" digits:$(HEXDIG HEXDIG HEXDIG HEXDIG) {

return String.fromCharCode(parseInt(digits, 16));

}

)

{ return sequence; }

This allows escaping for special characters like newlines, double quotes, single quotes, unicode characters etc. and tells the parser to e.g. return a newline character if the sequence of

is matched.

Now that we have all the characters defined we can add the rule for a string itself. It looks a little different than the strings from the JSON gammar because we want to allow double and single quoted strings:

string "string" =

doublequote text:(doublequote_character*) doublequote {

return { type: 'string', value: text.join('') };

}

/ singlequote text:(singlequote_character*) singlequote {

return { type: 'string', value: text.join('') };

}

Here we are basically defining two rules. Match any number of characters that are either enclosed in double- or single quotes. Assign that match (which will be an array of characters) the variable name text. Then execute the JavaScript code in curly brackets which returns an object representation of the matched string in the form of:

{

"type": "string",

"value": "matched string text here"

}

There are two different character matching rules for double and single quoted characters. In both cases we want to match any character that is not the quote type the string has been opened with (because that obviously ends it). In a PEGJS grammar this lookahead is done using the ! operator:

doublequote_character =

(!doublequote) c:character { return c; }



singlequote_character =

(!singlequote) c:character { return c; }

For this rule we need to return the matched character because otherwise we’d get an array of matches like [‘’, ‘character here’] (the first entry is the the match for the !doublequote rule which will always be empty).

Now that we can parse strings and variables we also need to define a rule for dot-separated nested properties which I call path:

path =

first:variable rest:("." s:variable { return s; })* {

return {

type: 'path',

value: [first].concat(rest)

};

}

It matches at least one variable name and then any number of additional dot separated variable names. The nice thing here is that the parser will already break our path into an array of variable names. For example for path.to.property we will get:

{

"type": "path",

"value": [ "path", "to", "property" ]

}

Arguments (for function parameters and truthy and falsy blocks) can either be a path or a string:

argument = path / string

A function call parameter is an argument separated by at least one whitespace:

parameter = ws+ a:argument { return a; }

Truthy and falsy sections consist of a question-mark or colon separated by at least one whitespace and an argument. As the result we will return the matched argument object. Both are optional, so we also add a rule to match an empty string:

truthy =

ws+ "?" ws+ arg:argument { return arg; }

/ ""



falsy =

ws+ ":" ws+ arg:argument { return arg; }

/ ""

And this are all the rules we need to parse an expression. The main rule comes almost naturally together from the pseudo-code above:

expression =

main:path args:parameter* truthy:truthy falsy:falsy {

return {

type: 'expression',

path: main.value,

args: args,

truthy: truthy || null,

falsy: falsy || null

}

}

All together, an expression consists of a path (called main), zero or more parameters (called args), a truthy block (called truthy) and a falsy block (called falsy). It returns an object with all the objects our individual rule matches returned and parses strings with escaping, is easily extensible and will throw useful error messages (like what column the error happened and what it expected instead). For example, the expression

helpers.equal name 'David' ? "This is David"

Will return an object like this:

{

"type": "expression",

"path": [

"helpers",

"equal"

],

"args": [

{

"type": "path",

"value": [

"name"

]

},

{

"type": "string",

"value": "David"

}

],

"truthy": {

"type": "string",

"value": "This is David"

},

"falsy": null

}

The last step is to have those expressions embedded into any other text. For this to happen we need three more basic rules, two for the opening and closing tags around expressions and one for matching pretty much any character:

open = "{{"

close = "}}"

any = .

A rule for an expression that is enclosed by those tags:

enclosedexpression =

open ws* e:expression ws* close { return e; }

Then a rule for matching any other text which means anything that does not open an enclosed expression:

text =

characters:$((!open) c:any)+ {

return {

type: 'text',

value: characters

}

}

And finally the top level start rule that will match any number of enclosed expressions or normal text:

start = (text / enclosedexpression)*

This means we will now get an array of text blocks and expressions. You can look at the complete grammar in this Gist and then try it yourself by copy & pasting it into the PEGJS online version.

I hope this small walkthrough of how and when to use a parser in JavaScript will make it easier for you to decide the next time whether or not a regular expression is the right tool for the job or you are going down a rabbit hole.