For one of the final biology courses of my undergraduate years, I’m taking a class on phylogenetic methods, which has been really interesting. For my final project I’m implementing one such method, which tests for correlated evolution of discrete characters along a phylogeny, using my new favorite programming language, Julia. This post is about a silly hack in which I used Julia’s homoiconicity to make parsing the data this method works on really, really easy.

As with most software projects, this one involves some amount of drudgery that must be be undertaken before anything interesting to do with the actual problem at hand can happen. In this case the drudgery takes the form of parsing Newick strings, which are the standard in the field for serializing phylogenetic trees. Why not use a format for which parsers are ubiquitous, you might ask? Well, Newick strings were developed in 1984, long before XML or JSON were even glints in their respective standards committees’ eyes. Everyone still uses it today, so for better or worse, we are stuck with it.

Of course many languages (at least those used for scientific computing) already have Newick parsers. This includes Julia, in the form of Ben Ward’s Phylogenetics.jl. However, it’s a port of an existing R package and I’m not particularly a fan of the API, so I decided to write my own. Newick is a pretty darn simple format so this would not have been particularly difficult in any case, but after squinting at some examples for a bit I noticed a way it could be made drop dead simple.

A typical Newick string looks something like this:

( A :0 .1 , B :0 .2 ,( C :0 .3 , D :0 .4 ) :0 .5 );

Each set of nested parentheses is a clade (a group of organisms all descended from a single common ancestor), the numbers following the colons represent branch lengths, and the letters are the names of the tips. So this string might represent the tree:

The silly hack I’m about to tell you about stems from the observation that after some quick text munging, a Newick string is actually a valid Julia expression whose structure mirrors the structure of the tree we want to represent. This means that we can trick the Julia parser into doing most of the work for us. First let’s declare a type to represent a node on a phylogeny.

type PhyloNode label :: String children :: Vector { PhyloNode } length :: Float64 end

Simple enough; a node has a label (name), a branch length, and some children, which are also PhyloNode s. Now we’re going to turn our example Newick string into a tree of PhyloNode s by asking Julia to parse it and manipulating the resulting Expr data structure. Imagine the example string above is in a variable called example . As a preliminary, let’s chomp off the trailing semicolon and replace all the colons with plus signs (the reasons for this will become clear in a moment).

example = rstrip ( example , ';' ) example = replace ( example , ":" , "+" ) parsed_example = parse ( example )

Let’s take a look at parsed_example using the dump function:

julia > dump ( parsed_example ) Expr head : Symbol tuple args : Array ( Any ,( 3 ,)) 1 : Expr head : Symbol call args : Array ( Any ,( 3 ,)) 1 : Symbol + 2 : Symbol A 3 : Float64 0.1 typ : Any 2 : Expr head : Symbol call args : Array ( Any ,( 3 ,)) 1 : Symbol + 2 : Symbol B 3 : Float64 0.2 typ : Any 3 : Expr head : Symbol call args : Array ( Any ,( 3 ,)) 1 : Symbol + 2 : Expr head : Symbol tuple args : Array ( Any ,( 2 ,)) typ : Any 3 : Float64 0.5 typ : Any typ : Any

This is Julia’s post-parsing internal representation of the example string. Notice how its structure reflects the tree we are trying to build. Do you see why we replaced each : with a + ? Since + is an infix operator, the parser has associated each node with it’s branch length by making them both arguments of a call to the + function! Really any infix operator would have done here, I just chose + arbitrarily. OK, now let’s write a function that will take this parsed expression and build the tree we want!

function parsenewick ( newick :: Expr ) if newick . head == : tuple # this is a node without a length children = [ parsenewick ( child ) for child in newick . args ] name = "" length = - 1 elseif newick . head == : call && newick . args [ 1 ] == : + # + indicates node with length length = newick . args [ 3 ] if typeof ( newick . args [ 2 ]) == Expr # internal node name = "" children = [ parsenewick ( child ) for child in newick . args [ 2 ] . args ] elseif typeof ( newick . args [ 2 ]) == Symbol # tip name = string ( newick . args [ 2 ]) children = PhyloNode [] end end PhyloNode ( name , children , length ) end

We recursively descend the expression, generating the appropriate node. A call to + indicates a node with a length, a tuple indicates a node without a branch length (note that I’m using a length of -1 to represent this). Let’s try it out:

julia > parsenewick ( parsed_example ) PhyloNode ( "" ,[ PhyloNode ( "A" ,[], 0.1 ), PhyloNode ( "B" ,[], 0.2 ), PhyloNode ( "" ,[ PhyloNode ( "C" ,[], 0.3 ), PhyloNode ( "D" ,[], 0.4 )], 0.5 )], - 1 )

It works! That was easy. We now have a parser for 90% of the Newick strings in the wild in about 20 lines of code.

Of course this is not nearly a complete parser. Most notably it will fail if we have internal nodes with names, which are represented as:

( A :0 .1 , B :0 .2 ,( C :0 .3 , D :0 .4 ) E :0 .5 ) F ;