/.NET /tools

Developers work with structured set of data and time to time they need to convert this data to format, which can be perceived by a machine or other developer for making some actions. This task usually takes a lot of time and is usually laborious, so tools, with formal description of data conversion as input parameters and return syntax tree with which you can continue your work, are very useful and sometimes are even real rescue. Today I'd like to tell you about one of this magic parser box called ANTLR.

ANTLR or ANother Tool for Language Recognition was created by a professor of computer science with great sence of humor at the University of San Francisco Terence Parr, who has been working on language tools since 1989.

I’ve got acquainted with ANTLR in Binary Studio Academy Pro. The main idea of that course was to give basic knowledges about ANTLR and its main features. There were 4 interesting lectures with lots of practice examples.

On the first lecture lector explained what ANTLR can and where it is used. Also he explained what lexer and parser are used for in ANTLR. After all of that we’ve built simple grammar and tried to parse strings in Visual Studio. But then he asked to make a simple parser for calculator as the hometask to see how much simpler it can be made by using ANTLR in future.

On the second lecture we were speaking a lot about grammar rules, Extended Backus-Naur Form. At the end of lecture we’ve built JSON parser. Next hometask was much better because making parser for calculator with ANTLR is much simpler than make it by hands.

Third lecture was the hardest. Sasha was talking a lot about Parser Structure and about ANTLR Error Handling. But this lecture proved that ANTLR is really cool thing, especially its ability to recover after failures.

The last lecture was about ANTLR cool abilities, for example recursive rules, semantic predicates, actions and lots of other.

So, what is ANTLR? As Parr said himself, it is “a powerful parser generator for reading, processing, executing or translating structured text or binary files”. It works as follows: first you write formal description or grammar, then ANTLR generates recognizer classes for target programming language (for example C#, Java, etc), then you give data to them and ANTLR return syntax tree, which was build by your description. You can see this process on figure 1.

Figure 1 – ANTLR data recognition

ANTLR is specified using a context-free grammar which is expressed using Extended Backus-Naur Form (EBNF). EBNF – is set of rules, which determine relations between terminal and nonterminal symbols. Terminal symbols are minimal set of symbols which don't have own grammatical structure, for example some digits or words. Nonterminal symbols are grammatical elements, which have own structure and names, for example math operation.

Let's take a look on recognition process in more details on simple example. For this purpose, we will build a simple calculator and find answer to children riddle 2+2*2.

First of all we should build following grammar:

grammar Calculator; statement: expr NEWLINE | ID '=' expr NEWLINE | NEWLINE | ID; expr: a=expr op=('*' | '/') b=expr | a=expr op=('+' | '-') b=expr | INT | ID; MUL : '*' ; DIV : '/' ; ADD : '+' ; SUB : '-' ; ID : [a-zA-Z]+ ; // match identifiers INT : [0-9]+ ; // match integers NEWLINE:'\r'? '

' ; // return newlines to parser (is end-statement signal) WS : [ \t]+ -> skip ; // toss out whitespace 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 grammar Calculator ; statement : expr NEWLINE | ID '=' expr NEWLINE | NEWLINE | ID ; expr : a = expr op = ( '*' | '/' ) b = expr | a = expr op = ( '+' | '-' ) b = expr | INT | ID ; MUL : '*' ; DIV : '/' ; ADD : '+' ; SUB : '-' ; ID : [ a - zA - Z ] + ; // match identifiers INT : [ 0 - 9 ] + ; // match integers NEWLINE : '\r' ? '

' ; // return newlines to parser (is end-statement signal) WS : [ \ t ] + - > skip ; // toss out whitespace

As you can see main rule is statement, that can be expression (2+2*2), can declare identifier (a=2) or use them (a+b). Expression consists of two operands 'a' and 'b' and arithmetic operation with them. Note, that MUL and DIV are declared before ADD and SUB. It was made to realized priority mechanism which works in ANTLR by declaring order. Syntax diagram of this rule is shown on figure 2.

Figure 2 – Syntax diagram of expr rule

When ANTLR gets string, it will divide it for tokens. In this example there are will be 5 tokens: three tokens '2' with type INT and two tokens with operation types ADD and MUL. It calls lexical analysis and it was made by lexer. In grammar lexical rules start with uppercase.

Then all tokens go to syntax analysis, which is made by parser. Parser rule starts with lower case. Input linear sequence of tokens compares with written grammar and on output we got syntax tree as on figure 3.

Figure 3 – Syntax tree of expression “2+2*2”

As you can see, if we use data from this tree in calculator we will get 6 as the answer instead of 8 and we'll be right.

It was very simple example of using ANTLR. It has a lot of buns, for example it has powerful mechanism for recovering after fails, it copes good with left recursion, gave ability to create own error handlers and makes parsing really good. So, if you want to create your own language or write a program with user-friendly dialogs between user and your app you should pay attention on ANTLR.

Work with grammars is much simpler when you use gui editor in which you can debug your rules. ANTLR has its gui editor, but it doesn’t work ideally, so one of students has written plugin. It highlights grammar syntax and visualize syntax tree, which is very useful for debugging grammatics. You can find plugin with installation instructions by following github link.