Marpa: Parsing Revolution

While ago I wrote that PyPy threw my projects bottom-up. This time it was the marpa library and the algorithm it pivots on.

Similar to my LR&GLL studies I were looking for parsing techniques that could be applied across applications such as compilers and IDEs. Though this time I didn't expect that one single parser could cover every case I have.

Getting used to the idea of writing this stuff in python, I finished writing an Earley variant of a parser very similar to marpa. I didn't execute this with top score, but I still ended up with much more than I anticipated for.

The algorithm I used originates from "Practical Earley Parsing (Aycock & Horspool 2002)", additionally I tried to implement Leo's algorith, but I only got it to work if I use my parser as a recognizer. I'm sure I would get it to work once I actually need right recursive grammars.

Key to projectional editing

After I got the parser to work, I ended up thinking it would solve the user interface problems I had with structure editing. Since my previous editor didn't entirely fit the design of the parser, I decided to pick some Qt4 code from the eco editor and apply it to create a testbench. The little list editor implements structure editing in much smaller form and without all the features that'd be crucial for production ready editor that textended was supposed to be.

I picked relatively simple approach to embedding the parser. Editing operations would destroy the structure leaving flags around for parser to detect and reconstruct into semantically valid structures. Every text cell user filled would be classified into token classes in the grammar. This turned out to work surprisingly well. Though, it breaks down on ambiguous grammars and does not always produce entirely expectable results. The source code turned to resemble remarkably lot for the description of incremental parsing in harmonia research project.

Here's the key to plausibly good structure editing, but the discovery itself is demotivating me to look further into the subject, because it greatly extends what I can do with plain text parsing.

Pyllisp parsing engine

I'm writing to replace a hand-written parser in pyllisp with bytecode and the same parsing engine I used in the little list editor. I'm going to shape it into a grammar file such as this one:

file => expr expr => binop: expr plus:"+" symbol expr => symbol

And into a module that uses the grammar, which will look something like this:

import grammarlang def post_binop(env, lhs, op, rhs): pass # will contain code to # compile bytecode for # a binary operation. parser = grammarlang.load_parser("pyllisp.grammar") def build(source): env = Env() parser(globals(), env, source) return env.close()

It will let me design a language similar to the language pyllisp already parsed, but I've got far much greater control over how to extend the language. In fact this kind of construct lets me design the grammar and the compiler in sync! This is a kind of revolutionary find:

It's more pleasant for the user, because if my parser recognizes the structure, then it means that the structure is compiled.

The grammar may be decomposed into pieces, to allow users to be gradually taught into using the language. It will help tremendously with documentation and teaching. For example, the teacher may intentionally shadow some of the grammar rules, so he won't cut cornels with the students.

I may implement the bytecode compiler in plain python instead of rpython and then translate it to pyllisp when appropriate. It's feasible because the grammar file remains exactly same and lets easier reuse of the original compiler if it's ever needed.

The actual compiling part of the language can entirely concentrate on compiling, because the parser does the full job and doesn't leave ambiguities between anything such as left-hand and right-hand expressions in the assignment expression.

I can implement about every grammar rule I've wished for as long as they're not ambiguous together. This may seem minor thing unless you realise how many unambigous and nondeterministic grammars you could create. You need deterministic grammar when you LR/LL or hand-parse without backtracking.

Once I have implemented it in pyllisp, I can support customized languages and extensions into the grammar. Bit similar to how Racket is doing it, except that you get to implement such grammars and modules, that may be used interchangeably.

The syntax files might also allow code generation according to the attributes. At least translation between language might be possible if they use the same attributes with same meanings.

I can probably come up with some more bullet points, but I just got tired from writing them up.