Disclaimer I’m going to use JavaScript for the examples, but the concepts apply to any language

Let’s start from the answer, literally:

Now, if we want to do anything with this program (executing it, analyzing it, formatting it…) we have to transform it into some data structure that we can work with. The first step towards this goal is usually called tokenization, i.e. identifying the minimum sequences of characters (tokens) that have some meaning:

Tokens represent the alphabet of a language: they can’t be broken into smaller parts and they can be combined together to form a program.

Intuitively, not any combination of valid tokens produces a valid program. Consider for instance:

Not a valid program. So there must be something that dictates the valid ways of combining tokens; usually, this is called grammar. A grammar defines the relationship between tokens, by grouping them into intermediate structures that can be recursively be combined together.

We can also say that a grammar describes the syntax of a language.

For instance:

Here the grammar tells us that this is a valid VariableDeclaration , which is composed by one (or many) VariableDeclarator , that in turn have a left-hand side which is an Identifier and a right-hand side which can be any expression and in this case is simply a NumericLiteral .

You may have noticed that these structures are arranged in a tree structure, and since they represent the syntax of a language it is natural to call them Syntax Trees.

We have only one question left: why Abstract?

Let’s consider a few more variants of the previous example:

What is the syntax tree of these variants? It turns out that it’s the same as the original example. Things likes spaces, formatting and semicolons are usually ignored in these tree representations because they don’t generally carry useful information.

And that’s why these trees are called Abstract: they are not a faithful concrete representation of the original source code, rather an abstraction that discards some details to focus on the syntactic structure instead.

A word about JavaScript AST(s)

If you’re specifically interested in JavaScript, at this point you may be wondering: ok, where is the definition for the JavaScript AST? Good question, and the answer is: there’s more than one.

It turns out that there are many existing and competing ASTs of JavaScript (and variants), used by several different parsers (as a refresher, a parser is a program that produces an AST given a source file).

I’ve tried to summarize them in a table:

You’ll notice that ESTree (which started as the specification of Firefox's internal AST) is pretty much the default for every parser out there.

babylon (the parser that powers babel ) uses the Babylon AST , which is also a variant of ESTree , with a few deviations.

flow (a static type checker for JavaScript) uses ESTree as well, extending it with custom nodes for type annotations and type-related structures.

TypeScript is the only one using an independently developed AST, most likely because its development predates the de-facto standardization of ESTree as lingua franca.

This “tree mismatch” is the reason why it took some effort to add TypeScript compatibility to existing tools such as prettier and ESLint : they both work with a completely different data structure ( ESTree ).

This concludes our ASTs tour, with a dive into the JavaScript world of multiple parsers and AST specifications.

If you are asking yourself “why should I care about ASTs at all?”, don’t worry! I’ll cover some practical applications of this AST knowledge in a future blog post. Stay tuned ;-).

—

PS

If you want to play around with ASTs (as Jordan Dichev suggested in the comments) I highly recommend http://astexplorer.net. It’s a wonderful playground and it has support for many different languages (including Scala!). Definitely check it out!

—

If you want to work in a place where we care about the quality of our development workflow, take a look at https://buildo.io/careers