Writing an S-Expression parser in Ruby

S-Expressions are a very versatile syntax for defining data structures and program code that are used by most Lisp-derived programming languages.

When experimenting with new languages S-Expressions are fantastic because they are extremely flexible but only require minimal effort to parse. In this article I am going to cover the basic principles of creating a robust S-Expression parser in Ruby.

Note: if you just want a complete and working S-Expression parser library for Ruby please check out Sexpistol

Note: Since writing this article I have found a more performant and concise parsing method using StringScanner. I will be writing an updated post shortly. If you would like to to have a look at the completed code using the new method check it out at GitHub. The code in this article is still relevant as an introduction.

Our parser

One of the first things we should have a look at is a simple example of some program code written as an S-Expression:

1 (define test (lambda () ( 2 begin 3 (print "test") 4 (print 1) 5 ))) 6 7 (test)

The first step in most parsers is to break the input text up into ‘tokens’, and in order to do that we have to know what the possible tokens are in our target grammar. In our case we have it easy as there are only 5 distinct tokens that we care about initially:

Opening parentheses (

Closing parentheses )

Symbols define begin print etc...

String literals "test" "foo" etc...

Integer literals 2 5 99 etc...

The first thing we should notice is that one of these tokens is a ‘special case’ in that it can contain other sequences of characters that we would normally perceive as tokens in of themselves. This complicated token is the String literal.

In order to simplify the problem for ourselves the first thing we are going to do is find all of the string literals, copy them into an array, and then replace them with a special placeholder string:

1 def extract_string_literals ( string ) 2 string_literal_pattern = /"([^"\\]|\\.)*"/ 3 string_replacement_token = "___+++STRING_LITERAL+++___" 4 # Find and extract all the string literals 5 string_literals = [] 6 string . gsub ( string_literal_pattern ) { | x | string_literals << x } 7 # Replace all the string literals with our special placeholder token 8 string = string . gsub ( string_literal_pattern , string_replacement_token ) 9 # Return the modified string and the array of string literals 10 return [ string , string_literals ] 11 end

After running a string through this method we have a new string in which we do not have to worry about any special cases, there is no possibility a parentheses is going to be detected inside a string literal which makes our life much easier.

Next let’s split up the input into its individual tokens. We can do this quite simply by adding spaces around each opening or closing parentheses and then splitting the string up on whitespace like so:

1 def tokenize_string ( string ) 2 string = string . gsub ( "(" , " ( " ) 3 string = string . gsub ( ")" , " ) " ) 4 token_array = string . split ( " " ) 5 return token_array 6 end

Now that we have an array of tokens we actually need to add our string literals back into their correct places in the array, we can do this by iterating over our array of tokens and detecting our string replacement token ___+++STRING_LITERAL+++___ :

1 def restore_string_literals ( token_array , string_literals ) 2 return token_array . map do | x | 3 if ( x == '___+++STRING_LITERAL+++___' ) 4 # Since we've detected that a string literal needs to be replaced we 5 # will grab the first available string from the string_literals array 6 string_literals . shift 7 else 8 # This is not a string literal so we need to just return the token as it is 9 x 10 end 11 end 12 end

Finally we have a complete and clean array of tokens from our S-Expression! The next step is parsing these tokens into whatever form we require. In this case what we are going to do is turn each token into an object of it’s relevant Ruby class. Symbols to Symbol , integers to Fixnum , and strings to String . To do this we need to first be able to detect each type of token, we will define four simple methods to help us with this:

1 # A helper method to take care of the repetitive stuff for us 2 def is_match? ( string , pattern ) 3 match = string . match ( pattern ) 4 return false unless match 5 # Make sure that the matched pattern consumes the entire token 6 match [ 0 ]. length == string . length 7 end 8 9 # Detect a symbol 10 def is_symbol? ( string ) 11 # Anything other than parentheses, single or double quote and commas 12 return is_match? ( string , /[^\"\'\,\(\)]+/ ) 13 end 14 15 # Detect an integer literal 16 def is_integer_literal? ( string ) 17 # Any number of numerals optionally preceded by a plus or minus sign 18 return is_match? ( string , /[\-\+]?[0-9]+/ ) 19 end 20 21 # Detect a string literal 22 def is_string_literal? ( string ) 23 # Any characters except double quotes 24 # (except if preceded by a backslash), surrounded by quotes 25 return is_match? ( string , /"([^"\\]|\\.)*"/ ) 26 end

Now we are able to detect all of the tokens in our grammar! Next we need to convert them to our target Ruby objects:

1 def convert_tokens ( token_array ) 2 converted_tokens = [] 3 token_array . each do | t | 4 converted_tokens << "(" and next if ( t == "(" ) 5 converted_tokens << ")" and next if ( t == ")" ) 6 converted_tokens << t . to_i and next if ( is_integer_literal? ( t ) ) 7 converted_tokens << t . to_sym and next if ( is_symbol? ( t ) ) 8 converted_tokens << eval ( t ) and next if ( is_string_literal? ( t ) ) 9 # If we haven't recognized the token by now we need to raise 10 # an exception as there are no more rules left to check against! 11 raise Exception , "Unrecognized token: #{ t } " 12 end 13 return converted_tokens 14 end

Now we have a nice array of tokens that represent our data! Unfortunately it is still a flat array, so our last step needs to be to re-create the structure defined in the input text using arrays. To do this in an elegant way we are going to use a recursive method:

1 def re_structure ( token_array , offset = 0 ) 2 struct = [] 3 while ( offset < token_array . length ) 4 if ( token_array [ offset ] == "(" ) 5 # Multiple assignment from the array that re_structure() returns 6 offset , tmp_array = re_structure ( token_array , offset + 1 ) 7 struct << tmp_array 8 elsif ( token_array [ offset ] == ")" ) 9 break 10 else 11 struct << token_array [ offset ] 12 end 13 offset += 1 14 end 15 return [ offset , struct ] 16 end

Now that we have all the pieces we can parse S-Expressions by simply chaining the methods together!

1 def parse ( string ) 2 string , string_literals = extract_string_literals ( string ) 3 token_array = tokenize_string ( string ) 4 token_array = restore_string_literals ( token_array , string_literals ) 5 token_array = convert_tokens ( token_array ) 6 s_expression = re_structure ( token_array ) [ 1 ] 7 return s_expression 8 end

Feeding in an S-Expression like:

1 (this (is a number 1( example "s-expression")))

produces:

1 [[ :this , [ :is , :a , :number , 1 , [ :example , "s-expression" ]]]]

Obviously we would like to encapsulate this code into a class for neatness and convenience, if you don’t want to do this yourself you can check out Sexpistol, a pre-implemented and tested S-Expression parser library for Ruby.

For the full example code please see this Gist