Readers of this website will know that ANTLR is a great tool to quickly create parsers and help you in working with a known language or create your DSL. While the tool itself is written in Java, it can also be used to generate parsers in several other languages, for instance Python, C# or Javascript (with more languages supported by the recently released 4.6 version).

If you want to use C#, there are two options: one is the official version of ANTLR, the other is the special C#-optimized version of ANTLR by Sam Harwell. There are two options because in the past the official ANTLR tool did not include the ability to generate C#, so you had to use the second option. In this tutorial we are going to use the second option because it offer better integration with Visual Studio. Changing between the two options is simple, but not without issue: the C# parser generated is not compatible and there are few differences in the API:

You can integrate ANTLR in your favorite IDE. The runtime itself works also on Mono, and can be used as a standalone. You can also look at the issues for the C#-optimzed version of ANTLR 4 to see if you can make it work with other setups. However, the easiest way is to use Visual Studio and the provided extension to integrate the generation of the grammar into the your C# project.

Setup

At the moment of the release of this article the extension did not supported on Visual Studio 2017. Since May 2018 the extension finally support the latest version of Visual Studio. Check the info ANTLR Language Support page on the official Visual Studio Markeplace. In any case the lack of support for a Visual Studio version means only that you will have to create a .g4 file manually. You can still use the ANTLR nuget package. We keep this notice in case you will read this article after a future release of Visual Studio that is not supported by the extension.



The first step is to install ANTLR Language Support extension for Visual Studio, you just have to search for it in for Visual Studio going to Tools → Extensions and Updates. This will allow to easily integrate ANTLR into your workflow by generating automatically the parser and, optionally, listener and visitor starting from your grammar. Now you can add a new ANTLR 4 Combined Grammar or an ANTLR 4 Lexer/Parser in the same way you add any other new item1. Then, for each one of your projects, you must add the Nuget package for ANTLR4. If you want to manage options and, for instance disable the visitor/listener generation, you can see the github project documentation.

Even if you cannot use the Language Extension, you can manually integrate the compiling of ANTLR grammar files into the normal Visual Studio build workflow, following the instructions on the C#-optimized version.

Create the Grammar

We are going to create a grammar that parses two lines of text that represents a chat between two people. This could be the basis for a chat program or for a game in which whoever says the shortest word get beaten up with a thesaurus. This is not relevant for the grammar itself, because it handles only the recognition of the various elements of the program. What you choose to do with these elements is managed through the normal code.

Add a new ANTLR 4 Combined Grammar with the name Speak1. You will see that there is already some text in the new file; delete it all and replace it with the following text.

grammar Speak; /* * Parser Rules */ chat : line line EOF ; line : name SAYS opinion NEWLINE; name : WORD ; opinion : TEXT ; /* * Lexer Rules */ fragment A : ('A'|'a') ; fragment S : ('S'|'s') ; fragment Y : ('Y'|'y') ; fragment LOWERCASE : [a-z] ; fragment UPPERCASE : [A-Z] ; SAYS : S A Y S ; WORD : (LOWERCASE | UPPERCASE)+ ; TEXT : '"' .*? '"' ; WHITESPACE : (' '|'t')+ -> skip ; NEWLINE : ('r'? 'n' | 'r')+ ;

While you may create separate lexer and parser grammar, for a simple project you will want to use a combined grammar and put the parser before the lexer. It is important to put the more specific tokens first and then the generic ones, like WORD or ID later. That is because as soon as ANTLR recognize a token in the lexer part, it stop searching. In this example, if we had inverted SAYS and WORD, SAYS would have been hidden by WORD. Another thing to notice is that you cannot use fragments outside of lexer rules.

Having said that, the lexer part is pretty straightforward:

we identify a SAYS, that could be written uppercase or lowercase;

a WORD, that could be composed of any letter uppercase or lowercase;

a TEXT, that include everything between two double quotes (“) marks;

a NEWLINE.

Any text that is WHITESPACE, space and tab, is simply ignored. While this is clearly a simple case, lexer rules will hardly be more complicated than this.

The only slightly complicated rule is the the one for TEXT, the beginning and end symbols (i.e., the double quotes) are clear, the stuff in the middle is composed of two parts, the dot (.) and the *? part. The dot means that every single character is allowed, while the *? means that the rule will exit when it finds whatever is on the right. In this case, the double quotes is on the right, so the end result is that everything between double quotes is included in TEXT.

Usually the worst thing that could happen is to have to use semantic predicates. These are essentially statements that evaluates to true or false, and in the case they are false they disable the following rule. For instance, you may want to use a ‘/’ as the beginning of a comment, but only if it is the first character of a line, otherwise it should be considered an arithmetic operator.

The parser is usually where things gets more complicated, although that’s not the case this time. Every document given to a speak grammar must contain a chat, that in turn is equal to two line rules followed by a End Of File marker. The line must contain a name, the SAYS keyword and a opinion. Name and opinion are similar rules, in the sense that they both correspond to individual lexer rules. However, they have different names because they correspond to different concepts, and they could easily change in a real program. For example, you may want the concept of a name for the user to change to correspond to a username that could contain underscores (_).

Visiting the tree

Just like we have seen for Roslyn, ANTLR will automatically create a tree and base visitor (and/or listener). We can create our own visitor class and change what we need. Let’s see an example.

public class SpeakVisitor : SpeakBaseVisitor<object> { public List<SpeakLine> Lines = new List<SpeakLine>(); public override object VisitLine(SpeakParser.LineContext context) { NameContext name = context.name(); OpinionContext opinion = context.opinion(); SpeakLine line = new SpeakLine() { Person = name.GetText(), Text = opinion.GetText().Trim('"') }; Lines.Add(line); return line; } }

The first line shows how to create a class that inherit from the SpeakBaseVisitor class, that is automatically generated by ANTLR. If you need it, you could restrict the type, for instance for a calculator grammar you could use something like int or double.

SpeakLine (not shown) is a custom class that contains two properties: Person and Text. The line 5 shows how to override the function to visit the specific type of node that you want, you just need to use the appropriate type for the context, that contains the information provided by the parser generated by ANTLR.

At line 13 we return the SpeakLine object that we just created, this is unusual and it’s useful for the unit testing that we will create later. Usually you would want to return base.VisitLine(context) so that the visitor could continue its journey across the tree.

This code simply populate a list of SpeakLine that hold the name of the person and the opinion they have spoken. The Lines properties will be used by the main program.

Putting It All Together

private static void Main(string[] args) { try { string input = ""; StringBuilder text = new StringBuilder(); Console.WriteLine("Input the chat."); // to type the EOF character and end the input: use CTRL+D, then press <enter> while ((input = Console.ReadLine()) != "u0004") { text.AppendLine(input); } AntlrInputStream inputStream = new AntlrInputStream(text.ToString()); SpeakLexer speakLexer = new SpeakLexer(inputStream); CommonTokenStream commonTokenStream = new CommonTokenStream(speakLexer); SpeakParser speakParser = new SpeakParser(commonTokenStream); SpeakParser.ChatContext chatContext = speakParser.chat(); SpeakVisitor visitor = new SpeakVisitor(); visitor.Visit(chatContext); foreach(var line in visitor.Lines) { Console.WriteLine("{0} has said {1}", line.Person, line.Text); } } catch (Exception ex) { Console.WriteLine("Error: " + ex); } }

As you can see there is nothing particularly complicated. The lines 15-18 shows how to create the lexer and then create the tree. The subsequent lines show how to launch the visitor that you have created: you have to get the context for whichever starting rule you use, in our case chat, and the order to visit the tree from that node.

The program itself simply output the information contained in the tree. It would be trivial to modify the grammar program to allow infinite lines to be added, both the Visitor and the main Program would not need to be changed.

Unit Testing

Testing is useful in all cases, but it is absolutely crucial when you are creating a grammar, to check that everything is working correctly. If you are creating a grammar for an existing language, you probably want to check many working source file. In any case you want to start with unit testing the single rules.

Luckily, since the creation of the Community edition of Visual Studio, there is a free version of Visual Studio that includes an unit testing framework. All you have to do is to create a new Test Project, add all the necessary nuget packages and add a reference to the project assembly you need to test.

[TestClass] public class ParserTest { private SpeakParser Setup(string text) { AntlrInputStream inputStream = new AntlrInputStream(text); SpeakLexer speakLexer = new SpeakLexer(inputStream); CommonTokenStream commonTokenStream = new CommonTokenStream(speakLexer); SpeakParser speakParser = new SpeakParser(commonTokenStream); return speakParser; } [TestMethod] public void TestChat() { SpeakParser parser = Setup("john says "hello" n michael says "world" n"); SpeakParser.ChatContext context = parser.chat(); SpeakVisitor visitor = new SpeakVisitor(); visitor.Visit(context); Assert.AreEqual(2, visitor.Lines.Count); } [TestMethod] public void TestLine() { SpeakParser parser = Setup("john says "hello" n"); SpeakParser.LineContext context = parser.line(); SpeakVisitor visitor = new SpeakVisitor(); SpeakLine line = (SpeakLine) visitor.VisitLine(context); Assert.AreEqual("john", line.Person); Assert.AreEqual("hello", line.Text); } [TestMethod] public void TestWrongLine() { SpeakParser parser = Setup("john sayan "hello" n"); var context = parser.line(); Assert.IsInstanceOfType(context, typeof(SpeakParser.LineContext)); Assert.AreEqual("john", context.name().GetText()); Assert.IsNull(context.SAYS()); Assert.AreEqual("johnsayan"hello"n", context.GetText()); } }

There is nothing unexpected in this tests. One observation is that we can create a test to check the single line visitor or we can test the matching of the rule itself. You obviously should do both.

You may wonder how the last test works, since we are trying to match a rule that doesn’t match, but we still get the correct type of context as a return value and some correct matching values. This happens because ANTLR is quite robust and there is only checking one rule. There are no alternatives and since it starts the correct way it is considered a match, although a partial one.

Summary

Integrating an ANTLR grammar in a C# project is quite easy with the provided Visual Studio extensions and nuget packages. This makes it the best way to quickly create a parser for your DSL with ANTLR in C#. Finally, you can easily use a better alternative to piles of fragile RegEx(s), but don’t forget to implement testing.

1. Read the note in the Setup section

Original written in December 2016 – Revised and updated in January 2018