Be sure to check out Part 2 first! If you're encountering this series for the first time, you can find the rest of the posts at the bottom of this article.

The Big Picture

Grammars

Parser

In the context of parsing, parser can refer both to the software that performs the whole process and also just to the proper parser that analyzes the tokens produced by the lexer. This is simply a consequence of the fact that the parser takes care of the most important and difficult part of the whole process of parsing. By most important, we mean what the user cares about the most and will actually see. In fact, as we said, the lexer works as an helper to facilitate the work of the parser.

In any sense, the parser's output is an organized structure of the code — usually, a tree. The tree can be a parse tree or an abstract syntax tree. They are both trees, but they differ in how closely they represent the actual code written and the intermediate elements defined by the parser. The line between the two can be blurry at times. We are going to see their differences a bit better in a later paragraph.

The form of a tree is chosen because it is an easy and natural way to work with the code at different levels of detail. For example, a class in C# has one body, this body is made of one statement, the block statement, that is a list of statements enclosed in a pair of curly brackets, and so on…

Syntactic vs. Semantic Correctness

A parser is a fundamental part of a compiler, or an interpreter, but of course can be part of a variety of other software. For example, in this article, we parsed C# files to produce diagrams.

The parser can only check the syntactic correctness of piece of code, but the compiler can use its output in the process of checking the semantic validity of the same piece of code.

Let's see an example of code that is syntactically correct, but semantically incorrect.

int x = 10 int sum = x + y

The problem is that one variable ( y ) is never defined and thus, if executed, the program will fail. However, the parser has no way of knowing this because it does not keep track of variables — it just looks at the structure of the code.

A compiler instead would typically traverse the parse tree a first time and keep a list of all the variables that are defined. Then, it traverses the parse tree a second time and checks whether the variables that are used are all properly defined. In this example, they are not, and it will throw an error. That is one way the parse tree can also be used to check the semantics by the compiler.

Scannerless Parser

A scannerless parser, or more rarely, a lexerless parser, is a parser that performs the tokenization (i.e. the transformation of sequence of characters in tokens) and proper parsing in a single step. In theory, having a separate lexer and parser is preferable because it allows a clearer separation of objectives and the creation of a more modular parser.

A scannerless parser is a better design for a language where a clear distinction between the lexer and parser is difficult or unnecessary. An example is a parser for a markup language, where special markers are inserted in a sea of text. It can also facilitates the handling of languages where traditional lexing is difficult, like C. That is because a scannerless parser can more easily deal with complex tokenizations.

Issues With Parsing Real Programming Languages

In theory, contemporary parsing is designed to handle real programming languages. In practice, there are challenges with some real programming languages. It might be harder to parse them using normal parsing generator tools.

Context-Sensitive Parts

Parsing tools are traditionally designed to handle context-free languages, but sometimes, the languages are context-sensitive. This might be done to simplify the life of programmers or simply because of bad design. I remember reading about a programmer that thought it could produce a parser for C in a week, but then it found so many corner cases that a year later he was still working on it…

A typical example of context-sensitive elements are soft keywords, i.e. strings that can be considered keywords in certain places but otherwise can be used as identifiers.

Whitespace

Whitespace plays a significant role in some languages. The most well-known example is Python, where the indentation of a statement indicates if it is part of a certain block of code.

In most places, whitespace is irrelevant — even in Python, spaces between words or keywords do not matter. The real problem is the indentation that is used to identify blocks of code. The simplest way to deal with it is to check the indentation at the start of the line and transform in the proper token, i.e. create a token when the indentation changes from the previous line.

In practice, a custom function in the lexer produces INDENT and DEDENT tokens when the indentation increases or decreases. These tokens play the role that, in C-like languages, is played by curly brackets — they indicate the start and end of code blocks.

This approach makes lexing context-sensitive instead of context-free. This complicates parsing and you normally, would not want to do it, but you are forced to do in this case.

Multiple Syntaxes

Another common issue is dealing with the fact that a language might actually include a few different syntaxes. In other words, the same source file may contain sections of code that follow a different syntax. In the context of parsing, the same source file contains different languages. The most famous example is probably the C or C++ preprocessor, which is actually a fairly complicated language on its own and can magically appear inside any random C code.

An easier case to deal with is annotations, which are present in many contemporary programming languages. Among other things, they can be used to process code before it arrives to the compiler. They can command the annotation processor to transform the code in some way; for instance, to execute a specific function before executing the annotated one. They are easier to manage because they can appear only in specific places.

Dangling Else

The dangling else is a common problem in parsing linked to the if-then-else statement. Since the else clause is optional, a sequence of if statements could be ambiguous — for example, this one:

if one then if two then two else ???

It is unclear if the else belongs to the first if or the second one.

To be fair, this is, for the most part, a problem of language design. Most solutions don't really complicate parsing that much; for example, requiring the use of an endif or requiring the use of blocks to delimit the if statement when it includes an else clause.

However, there are also languages that offer no solution; that is to say, they're designed ambiguously — for example, you guessed it, C. The conventional approach is to associate the else to the nearest if statement, which makes the parsing context-sensitive.

Parsing Tree and Abstract Syntax Tree

There are two terms that are related and sometimes they are used interchangeably: parse tree and abstract syntax tree (AST). Technically, the parse tree could also be called a concrete syntax tree (CST) because it should reflect more concretely the actual syntax of the input, at least compared to the AST.

Conceptually, they are very similar. They are both trees; there is a root that has nodes representing the whole source code. The roots have children nodes that contain subtrees representing smaller and smaller portions of code, until single tokens (terminals) appears in the tree.

The difference is in the level of abstraction. A parse tree might contains all the tokens that appeared in the program and possibly, a set of intermediate rules. The AST, instead, is a polished version of the parse tree, in which only the information relevant to understanding the code is maintained. We are going to see an example of an intermediate rule in the next section.

Some information might be absent both in the AST and the parse tree. For instance, comments and grouping symbols (i.e. parentheses) are usually not represented. Things like comments are superfluous for a program and grouping symbols are implicitly defined by the structure of the tree.

From Parse Tree to Abstract Syntax Tree

A parse tree is a representation of the code closer to the concrete syntax. It shows many details of the implementations of the parser. For instance, usually, each rule corresponds to a specific type of a node. A parse tree is usually transformed in an AST by the user, possibly with some help from the parser generator. A common help allows you to annotate some rule in the grammar in order to exclude the corresponding nodes from the generated tree. Another one is an option to collapse some kinds of nodes if they have only one child.

This makes sense because the parse tree is easier to produce for the parser since it is a direct representation of the parsing process. However, the AST is simpler and easier to process through the steps of the program. They typically include all the operations that you may want to perform on the tree: code validation, interpretation, compilation, etc.

Let's look at a simple example to show the difference between a parse tree and an AST. Let’s start with a look at an example grammar.

In this grammar, we can define a sum using both the symbol plus ( + ) or the string plus as operators. Imagine that you have to parse the following code. These could be the resulting parse tree and abstract syntax tree.

In the AST the indication of the specific operator is disappeared and all that remains is the operation to be performed. The specific operator is an example of an intermediate rule.

Graphical Representation of a Tree

The output of a parser is a tree, but the tree can also be represented in graphical ways. That is to allow an easier understanding to the developer. Some parsing generator tools can output a file in the DOT language, a language designed to describe graphs (a tree is a particular kind of graph). Then this file is fed to a program that can create a graphical representation starting from this textual description (i.e. Graphviz).

Let’s see a DOT text based on the previous sum example.

digraph sum { sum -> 10; sum -> 21; }

The appropriate tool can create the following graphical representation.

If you want to see a bit more of DOT, you can read this article, where we show how to create a Visual Studio Code plugin that can handle DOT files.

You can check out the rest of the series below: