February 16, 2009 at 19:40 Tags Articles , Compilation

CSTs - Concrete Syntax Trees (a.k.a. Parse Trees) and ASTs - Abstract Syntax Trees (a.k.a. Syntax Trees) are two terms that come up a lot in compiler & parser nomenclature. However, it is not always understood what is the difference between the two, because they are closely related.

In this article I attempt to explain the difference, at least for myself.

Concrete Syntax Trees CSTs are a representation of grammars in a tree-like form. Here's a definition from the Dragon book : A parse tree pictorially shows how the start symbol of a grammar derives a string in the language. That is, the CST is a one-to-one mapping from the grammar to a tree-form. An example will make this easier to understand. Consider parsing the C statement return a + 2; . Following the ANSI C grammar defined in K&R2 and appearing in its YACC representation in various places on the web, the parse tree of this statement looks thus: And this is not even the complete tree. The path between conditional_expression and additive_expression goes through the sequence: logical_or_expression , logical_and_expression , inclusive_or_expression , exclusive_or_expression , and_expression , equality_expression , relational_expression and shift_expression . I kid you not! There are two more nodes between unary_expression and constant as well. Indeed, this is a very formal representation that strictly shows how the parser understands the statement return a + 2; , but it is not a very useful representation to work with, is it? This is where ASTs come in.

Abstract Syntax Trees ASTs are simplified syntactic representations of the source code, and they're most often expressed by the data structures of the language used for implementation . Quoting from the Dragon book again: Abstract syntax trees, or simply syntax trees, differ from parse trees because superficial distinctions of form, unimportant for translation, do not appear in syntax trees. ASTs don't show the whole syntactic clutter, but represent the parsed string in a structured way, discarding all information that may be important for parsing the string, but isn't needed for analyzing it. Here's an AST for return a + 2; , as generated by pycparser: Looks like something much more manageable, doesn't it? Note that an AST conveys information in the internal nodes as well, and not only at the leaves. So, instead of going through a long list of expression nodes just to represent the type of the expression, a simple op attribute in a BinOp node does the job. Also, the AST doesn't hold syntactic nodes like ';' , because these are useless for analysis. ASTs are usually the last product of the front-end of a compiler. They represent the structure of the code in a compact and useful way that facilitates convenient analysis and further processing.

From a CST to an AST Parsers usually either construct ASTs directly in their actions, or first construct CSTs and then convert them to ASTs. It's interesting to note that CSTs are trivial to build once you have a grammar written , and some tools will build it automatically. After all, a CST is just another representation of the grammar. But beware of tools that promise to automatically build ASTs - this can work in a general sense only for trivial grammars, and what you'll get is most likely a CST rather than an AST. For some languages, it is far from trivial to construct an AST from a CST. A good example is the syntax of type declarations in C, which is known to be tricky. Consider the declaration int* arp[5]; - an array of 5 pointers to int . Here's its parse tree, according to the ANSI C grammar: Clutter aside, there are two other problems here that make analysis difficult: Note that the pointer node is actually above the array, so working through the CST directly, it can be thought that we've declared a pointer to an array! This is the "reverse logic of declarations" problem of C. Where is the array name? It's buried deep down in the tree, which isn't very convenient. Because of these problems, it would be very difficult to work directly with a CST for C, without converting it to AST. pycparser converts to AST on the fly (without generating an explicit CST), and here's the AST that represents int* arp[5]; : Both problems are solved. The pointer is now clearly below the array, and the variable name was raised to the top-most node. This AST can be just read out loud to produce the correct interpretation of the declaration: "arp is an array of 5 pointers to int".