No jumps

When I began to work on EasyExtend in 2006 I grabbed a Python parser from the web, written by Jonathan Riehl ( it doesn’t seem to be available anymore ). It was a pure Python parser following closely the example of CPython’s pgen. The implementation was very dense and the generated parser probably as fast as a Python parser could be. It was restricted to LL(1) though which was a severe limitation when I stepped deeper into the framework.

In mid 2007 I created a parse tree checker. The problem was that a parse tree could be the return value of a transformation of another parse tree: T : P -> P. How do we know that P is still compliant with a given syntax? This can be easily be solved by chasing NFAs of the target grammar, both horizontally i.e. within an NFA as well as vertically: calling checkers recursively for each parse tree node which belong to a non-terminal. This checker generator was only a tiny step apart from a parser generator which I started to work on in summer 2007.

What I initially found when I worked on the parse tree checker was that horizontal NFA chasing never has to take into account that there are two alternative branches in rules like this

R: a* b | a* c

R: a* ( b | c )

From rules to grammars

The algorithm never checks out the first branch, runs through a sequence of ‘a’ until it hits ‘b’ and when this fails, jumps back and checks out the other branch. There was no backtracking involved, also no backtracking with memoization. There was simply never any jump. Instead both branches are traversed simultaneously until they become distinct. It’s easy to express this on grammar level by applying left factoring to the ruleHowever there was never any rule transformation to simplify the problem.

It’s actually an old approach to regular expression matching which is attributed to Ken Thompson. Russ Cox refreshed the collective memory about it a few years ago. This approach never seemed to make the transition from regular expressions to context free grammars – or it did and was given up again, I don’t know. I wanted a parser generator based on the algorithms I worked out for parse tree checkers. So I had to invent a conflict resolution strategy which is specific for CFGs. Take the following grammar

R: A | B A: a* c B: a* d

A

B

Again we have two branches, marked by the names of the non-terminalsandand we want to decide late which one to choose.

First we turn the grammar into a regular expression:

R: a* c | a* d

R: a* c <A>| a* d <B>

Reconstruction

but now we have lost context/structural information which needs to be somehow added:The symbols and do not match a character or token. They merely represent the rules which would have been used when the matching algorithm scans beyond ‘c’ or ‘d’. So once the scanner enters it will be finally decided that rule A was used. The same is true for . Our example grammar is LL() and in order to figure out if either A or B is used we need, in principle at least, infinite lookahead. This hasn’t been changed through rule embedding but now we can deal with the LL() grammar as-if it was an LL(1) grammar + a small context marker.

What is lacking in the above representation is information about the precise scope of A and B once they are embedded into R. We rewrite the grammar slightly by indexing each of the symbols on the RHS of a rule by the name of the rule:

R: A[R] | B[R] A: a[A]* c[A] B: a[B]* d[A]

R: a[A]* c[A] <A[R]>| a[B]* d[B] <B[R]>

{a[A], a[B]}, {a[A], a[B]}, {d[B]}, {<B[R]>}

Now we can embed A and B into R while being able to preserve the context:Matching now the stringyields the following sequence of sets of matching symbols:All of the indexed symbols in a set matches the same symbol. The used index has no impact on the matching behavior, so a[X], a[Y], … will alway match

Constructing a parse tree from the above set-sequence is done by reading the sequence from right to left and interpret it appropriately.

We start the interpretation by translating the rightmost symbol

<B[R]> -> [R,[B, .]]

[R, [B, .]]

[R, [B, ., d]]

[R, [B, ., a, d]]

[R, [B, ., a, a, d]]

[R, [B, a, a, d]]

Drawbacks

The dot ‘.’ is a placeholder for a sequence of symbols indexed with B. It remains adjacent to B and is removed when the construction is completed:

We can read the embedding process as ’embed rules A and B into R’ or dually ‘expand R using rules A and B’. I’ve chosen the latter expression for the Trail parser generator because an expanded rule R has its own characteristics and is distinguished from an unexpanded rule.

The drawback of this method is that its implementation turns out to be rather complicated. It is also limited because it may run into cyclic embeddings which need to be detected. Finally successive embeddings can blow up the expanded rule to an extent that it makes sense to artificially terminate the process and fall back to a more general and less efficient solution. So we have to mess with it. Finally isn’t there are performance penalty due to the process of reconstruction?

Performance

To my surprise I found that an LL(*) grammar that uses expansion quite heavily ( expanded NFAs are created with about 1000 states ) performs slightly better than a simple LL(1) grammar without any expansion in CPython. For comparison I used a conservative extension language P4D of Python i.e. a superset of Python: every string accepted by Python shall also be accepted by P4D.

In order to measure performance I created the following simple script

import time import decimal import langscape text = open ( decimal .__file__. replace ( ".pyc" , ".py" ) ) . read ( ) print "textlen" , len ( text ) python = langscape. load_langlet ( "python" ) p4d = langscape. load_langlet ( "p4d" ) def test ( langlet ) : tokens = langlet. tokenize ( text ) a = time . time ( ) for i in range ( 10 ) : langlet. parse ( tokens ) tokens. reset ( ) print "parser" , langlet. config . langlet_name , ( time . time ( ) - a ) / 10 test ( python ) test ( p4d )

It imports a reasonably big Python module ( decimal.py ) and parses it with two different parsers generated by Trail. Running it using CPython 2.7 yields the following result:

parser python 2.39329998493 parser p4d 2.25759999752

This shows that P4D is about 5% faster on average! Of course the overall performance is abysmal, but keep in mind that the parser is a pure Python prototype implementation and I’m mainly interested in qualitative results and algorithms at this point.

I’ve also checked out the script with PyPy, both with activated and deactivated JIT.

PyPy with option –JIT off:

parser python 6.5631000042 parser p4d 5.66440000534

Now the LL(*) parser of P4D is about 13-14 % faster than the LL(1) parser, which is much clearer. Activating the JIT reverses the pattern though and intense caching of function calls will pay of:

PyPy with JIT:

parser python 0.791500020027 parser p4d 1.06089999676

Here the Python parser is about 1/3 faster than the P4D parser.

The result of the competition depends on the particular implementation and the compiler/runtime optimizations or the lack thereof. The counter-intuitive result that an LL(*) parser is faster than an LL(1) parser could not be stabilized but also not clearly refuted. It’s still an interesting hypothesis though and rule expansion may turn out to be a valid optimization technique – also for LL(1) parsers which do not require it as a conflict resolution strategy. I will examine this in greater detail once I’ve implemented an ANSI C version of Trail.