A quick intro to writing a parser with Treetop

Treetop is a Ruby library that allows you to create parsers easily by describing them using a Parsing Expression Grammar (PEG). Writing a parser using Treetop is a fairly pain-free process, but getting started can be non-trivial, especially if you’re not familiar with PEGs, so this is going to be a fairly short ‘getting started’ guide. To keep things simple our example parser is going to process a subset of S-Expression syntax like so:

1 ( define test ( lambda () 2 ( begin 3 ( display "something" ) 4 ( display 1 ) 5 ( display 3 . 08 )))) 6 7 ( test )

It will be a fairly simplistic example, but it should be enough to get you started!

Installing Treetop

Provided that you already have Ruby and RubyGems installed, installing Treetop is a very simple process. Simply enter this at your command line:

1 gem install treetop

If you do not have Ruby and RubyGems installed then you should have a look at the Ruby website for installation instructions.

Getting Started

There are going to be three main parts to our parser:

The Ruby class Parser which will contain the API for interacting with the parser

which will contain the API for interacting with the parser The Treetop grammar file in which we will create all the rules that define how our parser will behave

A series of simple classes that subclass Treetop::Runtime::SyntaxNode , these will allow us to easily describe our parsed structure by having each type of entity map directly to a custom syntax node class

The first thing we need to do is create skeletons for the three files involved:

1 # In file parser.rb 2 require 'treetop' 3 4 # Find out what our base path is 5 base_path = File . expand_path ( File . dirname ( __FILE__ )) 6 7 # Load our custom syntax node classes so the parser can use them 8 require File . join ( base_path , 'node_extensions.rb' ) 9 10 class Parser 11 12 # Load the Treetop grammar from the 'sexp_parser' file, and 13 # create a new instance of that parser as a class variable 14 # so we don't have to re-create it every time we need to 15 # parse a string 16 Treetop . load ( File . join ( base_path , 'sexp_parser.treetop' )) 17 @@parser = SexpParser . new 18 19 end

1 # In file sexp_parser.treetop 2 grammar Sexp 3 4 end

1 # In file node_extensions.rb 2 module Sexp 3 4 end

Basic Parser API & Error Reporting

Our API is going to be very simple, the Parser class will only have only public method: parse . Error reporting for the moment will also be minimal, however if you are making a parser that will be used widely then this is definitely not an area you should skimp on. Inscrutable error messages can make an otherwise well-written parser very frustrating to deal with.

1 # In file parser.rb 2 class Parser 3 4 . . . 5 6 def self . parse ( data ) 7 # Pass the data over to the parser instance 8 tree = @@parser . parse ( data ) 9 10 # If the AST is nil then there was an error during parsing 11 # we need to report a simple error message to help the user 12 if ( tree . nil? ) 13 raise Exception , "Parse error at offset: #{ @@parser . index } " 14 end 15 16 return tree 17 end 18 end

Abstract Syntax Tree Nodes

In order to be able to build a meaningful tree structure from our input we need to be able to distinguish between different types of nodes, for example:

Expression eg: (1 2 3)

eg: (1 2 3) Identifier eg: foo, bar, other_thing

eg: foo, bar, other_thing IntegerLiteral eg: 1, 2, 99

eg: 1, 2, 99 StringLiteral eg: “test”

and so on

1 # In file node_extensions.rb 2 module Sexp 3 class IntegerLiteral < Treetop :: Runtime :: SyntaxNode 4 end 5 6 class StringLiteral < Treetop :: Runtime :: SyntaxNode 7 end 8 9 class FloatLiteral < Treetop :: Runtime :: SyntaxNode 10 end 11 12 class Identifier < Treetop :: Runtime :: SyntaxNode 13 end 14 15 class Expression < Treetop :: Runtime :: SyntaxNode 16 end 17 18 class Body < Treetop :: Runtime :: SyntaxNode 19 end 20 end

The Parser Rules

Now that we have all of our other components in place we can start actually defining our PEG rules. We’ll start with the smallest parts: Identifier , FloatLiteral , StringLiteral , space and IntegerLiteral

1 # In file sexp_extensions.rb 2 grammar Sexp 3 rule integer 4 ( '+' / '-' )? [ 0 - 9 ]+ < IntegerLiteral > 5 end 6 7 rule float 8 ( '+' / '-' )? [ 0 - 9 ]+ (( '.' [ 0 - 9 ]+ ) / ( 'e' [ 0 - 9 ]+ )) < FloatLiteral > 9 end 10 11 rule string 12 '"' ( [^ " \\ ] / " \\ " . )* '" ' < StringLiteral > 13 end 14 15 rule identifier 16 [ a - zA - Z \ = \ *] [ a - zA - Z0 - 9 _ \ = \ *]* < Identifier > 17 end 18 19 rule space 20 [ \ s ]+ 21 end 22 end

Each rule above comes in two parts: first the regular expression that defines what this entity looks like, and second what type of syntax node it should be turned into. Treetop looks inside the <...> and matches the name up with the custom syntax nodes that we defined earlier.

Rules in Treetop are written using what amounts to a superset of regular expressions, more info can be found at the Treetop website.

You might be wondering why we have a PEG rule for space … This is so we can allow whitespace in between entities in the input string, if we did not have a rule for this we would have to be very strict about how source files were laid out.

Next let’s define the rule for expressions and bodies (the inner part of an expression), these will build upon the rules we have defined already and are an example of what makes Treetop so great: Treetop allows you to write rules by easily composing sets of smaller rules to make a larger one. This code re-use makes parsers based on PEGs like Treetop very easy to maintain and expand.

1 # In file sexp_parser.treetop 2 rule expression 3 space? '(' body ')' space? < Expression > 4 end 5 6 rule body 7 ( expression / identifier / float / integer / string / space ) * < Body > 8 end

At this point we should be able to parse the entire of our example S-Expression shown above, let’s try it with a simple example and see what the result is:

1 Parser . parse ( '(this "is" a test( 1 2.0 3))' ) 2 3 => Expression + Expression0 offset = 0 , "...s \" a test( 1 2.0 3))" ( body ): 4 SyntaxNode offset = 0 , "" 5 SyntaxNode offset = 0 , "(" 6 Body offset = 1 , "...is \" a test( 1 2.0 3)" : 7 Identifier + Identifier0 offset = 1 , "this" : 8 SyntaxNode offset = 1 , "t" 9 SyntaxNode offset = 2 , "his" : 10 SyntaxNode offset = 2 , "h" 11 SyntaxNode offset = 3 , "i" 12 SyntaxNode offset = 4 , "s" 13 SyntaxNode offset = 5 , " " : 14 SyntaxNode offset = 5 , " " 15 StringLiteral + String1 offset = 6 , " \" is \" " : 16 SyntaxNode offset = 6 , " \" " 17 . . . 18 . . .

You can see we have created a tree from our input, but it has a lot of extraneous nodes that we don’t need or care about. Let’s put in a system for cleaning up the tree:

1 # In file parser.rb 2 class Parser 3 4 def self . parse 5 . . . 6 . . . 7 8 self . clean_tree ( tree ) 9 10 return tree 11 end 12 13 private 14 15 def self . clean_tree ( root_node ) 16 return if ( root_node . elements . nil? ) 17 root_node . elements . delete_if { | node | node . class . name == "Treetop::Runtime::SyntaxNode" } 18 root_node . elements . each { | node | self . clean_tree ( node ) } 19 end 20 21 end

This method is going to traverse the entire tree and strip out any nodes that are not one of our custom classes, that way we are only left with nodes we care about. Let’s try our example again and see what we get:

1 Parser . parse ( '(this "is" a test( 1 2.0 3))' ) 2 3 => Expression + Expression0 offset = 0 , "...s \" a test( 1 2.0 3))" ( body ): 4 Body offset = 1 , "...is \" a test( 1 2.0 3)" : 5 Identifier + Identifier0 offset = 1 , "this" 6 StringLiteral + String1 offset = 6 , " \" is \" " 7 Identifier + Identifier0 offset = 11 , "a" 8 Identifier + Identifier0 offset = 13 , "test" 9 Expression + Expression0 offset = 17 , "( 1 2.0 3)" ( body ): 10 Body offset = 18 , " 1 2.0 3" : 11 IntegerLiteral + Integer0 offset = 19 , "1" 12 FloatLiteral + Float2 offset = 21 , "2.0" 13 IntegerLiteral + Integer0 offset = 25 , "3"

Much better! Now we have a nice clean representation of our input. But to be honest for most cases this is still more information than we need. Let’s alter our custom syntax node classes so they can output a simpler representation of themselves using simple nested sets of arrays and native Ruby data-types, we can achieve this in a simple fashion by giving each node a to_array method and using the delegation pattern:

1 # In file node_extensions.rb 2 module Sexp 3 class IntegerLiteral < Treetop :: Runtime :: SyntaxNode 4 def to_array 5 return self . text_value . to_i 6 end 7 end 8 9 class StringLiteral < Treetop :: Runtime :: SyntaxNode 10 def to_array 11 return eval self . text_value 12 end 13 end 14 15 class FloatLiteral < Treetop :: Runtime :: SyntaxNode 16 def to_array 17 return self . text_value . to_f 18 end 19 end 20 21 class Identifier < Treetop :: Runtime :: SyntaxNode 22 def to_array 23 return self . text_value . to_sym 24 end 25 end 26 27 class Expression < Treetop :: Runtime :: SyntaxNode 28 def to_array 29 return self . elements [ 0 ]. to_array 30 end 31 end 32 33 class Body < Treetop :: Runtime :: SyntaxNode 34 def to_array 35 return self . elements . map { | x | x . to_array } 36 end 37 end 38 end

You can see here that each complex node ( Expression , Body ) delegates the job of producing the array down to it’s children, and each primitive child node ( StringLiteral , FloatLiteral , Identifier ) knows how to return a simple version of itself. Let’s add some code to the Parser class to use this and then try it out:

1 # In file parser.rb 2 class Parser 3 4 def self . parse 5 . . . 6 . . . 7 8 return tree . to_array 9 end 10 11 end

1 Parser . parse ( '(this "is" a test( 1 2.0 3))' ) 2 #=> [:this, "is", :a, :test, [1, 2.0, 3]]

Perfect!

While the example here is fairly simplistic it touches all the parts needed to make a more complex parser using Treetop. You can have a look at the completed code from this example at GitHub, and if you want to have a look at a more involved example you can check out the reference parser for the Koi programming language which is also written using Treetop.