Quick summary: Brzozowski's derivative

Brzozowski defined the derivative of a regular language in 1964.

The derivative of a language L with respect to a character c, written D c (L), works in two phases--filter and chop:

Find all of the strings in L that start with c. Chop the c off all those strings.

Formally, D c (L) = { w | cw ∈ L }.

For example, D f (foo|frak|bar)* = (oo|rak)(foo|frak|bar)*

Matching with derivatives

It's easy to use the derivative to recognize whether a string is in a language.

Just compute the derivative with respect to the first character, then the second, then the third and so on.

If the final language contains the empty string, then the original string was in the original language. That's all there is to it.

For example, D o D o D f (foo|frak|bar)* = (foo|frak|bar)* , which clearly matches the empty string.

Mathematical notation for regular expressions

We'll use mathematical notation for regular expressions.

If you're used to Unix-style regular expressions, here's the translation guide:

Math Unix Meaning ∅ no equivalent empty set; matches no strings ε empty/null string; matches "" c c matches character c A ○ B AB concatenation/sequence A ∪ B A|B union/alternative/choice

In examples, we'll still use juxtaposition to denote concatenation.

The derivative for regular languages

The useful property of the derivative from an implementation point of view is that it has a recursive definition over regular languages:

D c (∅) = ∅

(∅) = ∅ D c (ε) = ∅

(ε) = ∅ D c ( c ) = ε

( ) = ε D c ( c' ) = ∅ if c is not c'

( ) = ∅ if is not D c ( A ○ B ) = D c ( A ) ○ B if A does not contain the empty string

( ○ ) = ( ) ○ if does not contain the empty string D c ( A ○ B ) = D c ( A ) ○ B ∪ D c ( B ) if A contains the empty string

( ○ ) = ( ) ○ ∪ ( ) if contains the empty string D c (A ∪ B) = D c (A) ∪ D c (B).

The derivative for context-free languages

Since regular languages aren't structurally recursive, the naive implementation of the derivative terminates.

That's not the case for context-free grammars, which might be described as "recursive regular expressions."

Consider the classic context-free grammar, the language of balanced parens:

S = S ( S ) ∪ ε

Consider the derivative with respect to an open paren:

[D ( S] = [D ( S] ( S ) ∪ [S ) ]

It's infinitely recursive.

Fine for math. Bad for implementation.

Derivatives from laziness, memoizing and fixed points

Fortunately, it's not hard to stop this infinite recursion.

If we compute lazily, then the derivative terminates.

But, because the grammar is left-recursive, taking another derivative will force the computation when it tries to check whether the derived language contains the empty string.

So, that nullability check causes non-termination.

But, that's easy to fix too: just memoize the derivative.

In fact, the hardest part about computing the derivative is figuring out whether or not a language contains the empty string. The definition for nullability, δ(L), is also structurally recursive:

δ (∅) = false

(∅) = false δ (ε) = true

(ε) = true δ ( c ) = false

( ) = false δ ( A ○ B ) = δ ( A ) and δ ( B )

( ○ ) = ( ) and ( ) δ(A ∪ B) = δ(A) or δ(B).

Laziness and memoization don't work here.

Instead, the function δ has to be computed as a least fixed point.

So, even though Brzozowski defined the derivative with regular languages in mind, it works unaltered for context-free languages if you toss in laziness, memoization and fixed points.

The draft describes two ways to generate parse forests from this idea.

Origins and motivation

The inspiration for the draft came from my advanced compilers course.

My teaching assumes that to understand is to implement.

For instance, to teach lexical analysis, I had the class implement a lex -like DSEL using Brzozowski's derivatives.

When we moved to parsers, I wanted to have the class build a DSEL that provides the functionality of a tool like ANTLR or yacc.

Wanted: An easy way to do general parsing

I'm against parsing tools that don't protect the user from surprises.

LALR(1) generators are out because they don't accept all grammars.

(Imagine a regex library that didn't accept all regexes!)

Adding shift/reduce annotations to a specification feels wrong to me.

You don't have to know how a car works to drive one.

Why shouldn't we expect the same usability out of parsing tools?

What about parser combinators?

Parser combinators are great because they're in the language itself. That avoids the transaction cost of setting up and learning an external tool.

That lowered cost alone makes people more likely to use them instead of hopelessly flogging regular expressions beyond their design limits.

But, parser combinators are tricky to implement if you want avoid surprises and handle things like left recursion.

I needed something undergrads could learn and do in a week.

(If you've taken my classes, you know I consider ease of implementation one of the most important attributes of an algorithm.)

Derivatives to the rescue?

It wasn't initially obvious to me that derivatives could work for parsing, or that they would enable the abbreviated implementation of a parsing library.

Eventually David and I came to the conclusion that derivatives actually do work. In fact, there are two distinct ways to make them work!

One approach applies the derivative to parser combinators; the other approach applies the derivative to CFGs to create a small-step parsing machine that emits parse strings as it runs.

David produced a working implementation in Haskell in a couple days.

It took me about three to do it in Scala.

What shocked us was just how simple and flexible the implementation was.

It seemed absurd that in a couple hundred lines of code, we could be generating parse forests for any CFG--left recursive, right recursive or even infinitely recursive.

It was such a fun experience that we felt like sharing it with the community.

So, the draft was born.

ESOP rejection

The draft was rejected by ESOP 2010.

The reviews are not cursory, and they contain good points.

To summarize, the three main complaints about the paper are:

Failure to qualitatively compare with previous work in parsing. Failure to characterize the complexity of the algorithms. Failure to benchmark the implementations against existing tools.

These are fair criticisms of the paper.

An implicit failure of the paper is that we didn't convince the peer reviewers that this was a fun, easy way to implement a general parsing library.

One year later: Posted on arXiv

My research area is static analysis, so I didn't have a lot of time to devote to resuscitating a paper on parsing.

But, David needed to cite it for his Ph.D. school applications this year, so we tossed it on arXiv as it stood when it was rejected a year ago.

Had we known the attention it was about to receive, we'd have incorporated reviewer feedback and updated the paper with what we've learned since.

What we've learned

Over the past year, even in the little time we've had to work on the paper, we've learned a lot more about parsing with derivatives.

In the week after the community found it, you all taught us ten times more than that. Thank you!

To highlight just a few interactions:

It's been an inspiring (if accidental) experiment in "naked science."

Naive complexity: Exponential

We had inklings that the worst-case complexity of the naive implementation was exponential, and we've since been able to build pathological grammars that exhibit exponential behavior.

It is possible to double the size of the grammar with each derivative.

But, since exponentiality wasn't a problem for us in practice, we haven't been overly concerned with it.

As Alan Perlis once remarked, "for every polynomial-time algorithm you have, there is an exponential algorithm that I would rather run."

Parsing with derivatives might just be one of those exponential algorithms.

Average complexity: Linear?

Daniel Spiewak's model for complexity states that the cost of parsing with the derivative is equal to the sum of the cost of the n derivatives taken during parsing.

The cost of an individual derivative is bounded by the size of the grammar. (In practice, the cost is roughly constant; it takes a pathological grammar to impact every single nonterminal.)

Let G i be the ith grammar to appear during parsing.

Under Daniel's model, the cost of parsing is proportional to:

|G 0 | + |G 1 | + |G 2 | + ... |G n |

During our measurements, we found that the derived grammars remain roughly the same size as the original grammar after compaction, regardless of how many derivatives are taken.

Since the cost of compaction is also bounded by the size of the grammar, the average complexity of parsing with derivatives is O(n|G|).

Optimized complexity: Cubic?

Our intuition has always been that with the right caching and structural reductions, the parsing with derivatives ought to be no worse than cubic in the worst case.

In practice, we've found it's easier to exploit structural sharing for the parsing machine approach over the parser combinator approach, but our intuition says that whatever can be done for one is doable for the other.

David has developed a powerful approach to pruning, weeding and compacting derived grammars using a version of the algorithm written in continuation-passing style.

He's now working on a new version of that algorithm that uses zippers and fixed points; our intuition tells us that this approach will perform the fewest possible number of grammar reductions to produce a parse tree.

Daniel Spiewak believes it can be made cubic with a different technique.

Latest implementations

There are several implementations now:

David's latest Haskell version, with many unpublished optimizations, and the ability to print out intermediate grammars as dot files;

a newer Scala version I wrote about six months ago, which interacts with a derivative-based lexer I also wrote;

a Racket version I wrote on Sunday.

a Common Lisp implementation by Gregory Pfeil.

a very small Scala implementation by Daniel Spiewak.

An implementation in Python that uses decorators.

As usual, anything David or I have done is released under the CRAPL.

If you create an implementation, please let me know.

Implementation in Racket

The Racket implementation is meant to be instructive. By default, it omits even the simple optimizations made in the original Scala implementation.

The implementation is less than 240 lines of commented code.

It is the most faithful to the paper, because I added define/memoize and define/fix forms using macros. (The implementation is 130 lines without the definitions of these functions.)

All of the side effects hide behind these two forms, which export a purely functional interface, so in some sense, this implementation is also purely functional.

Some complained that parsing with derivatives uses "too much" laziness to be practical in other languages, so I used explicit promises to implement laziness in Racket, and then, only where strictly necessary.

This version, for instance, does not compute lazy parse forests.

It returns all parse trees at once.

The derive procedure looks exactly like the math:

(define/memoize (parse-derive c l) #:order ([l #:eq] [c #:equal]) (match l [(empty) (empty)] [(eps) (empty)] [(token pred class) ; => (if (pred c) (eps* (set c)) (empty))] [(orp l1 l2) ; => (alt (parse-derive c l1) (parse-derive c l2))] [(seqp (and (nullablep?) l1) l2) ; => (alt (cat (eps* (parse-null l1)) (parse-derive c l2)) (cat (parse-derive c l1) l2))] [(seqp l1 l2) ; => (cat (parse-derive c l1) l2)] [(redp l f) ; => (red (parse-derive c l) f)]))

The #:order keyword tells the memoizer in which order to cache the arguments, and which equality test to use.

The nullability procedure appears to be infinitely recursive:

(define/fix (nullable? l) #:bottom #f (match l [(empty) #f] [(eps) #t] [(token _ _) #f] [(orp l1 l2) (or (nullable? l1) (nullable? l2))] [(seqp l1 l2) (and (nullable? l1) (nullable? l2))] [(redp l1 _) (nullable? l1)]))

But, define/fix saves it by computing the least fixed point if it detects the function recurring over a cyclic graph instead of a tree. The #:bottom argument specifies where to begin the fixed point computation.

The auxiliary procedure parse-null parses a grammar with respect to the empty string, also using a fixed point:

(define/fix (parse-null l) #:bottom (set) (match l [(empty) (set)] [(eps* S) S] [(eps) (set l)] [(token _ _) (set)] [(orp l1 l2) (set-union (parse-null l1) (parse-null l2))] [(seqp l1 l2) (for*/set ([t1 (parse-null l1)] [t2 (parse-null l2)]) (cons t1 t2))] [(redp l1 f) (for/set ([t (parse-null l1)]) (f t))]))

The implementation of parse is short and sweet:

(define (parse l s) (cond [(stream-null? s) (parse-null l)] [else (parse (parse-derive (stream-car s) l) (stream-cdr s))]))

Ambiguity: Not a problem

Russ Cox claimed in his post that the following ambiguous grammar would blow up our implementation:

S ::= S + S | 1

His argument was that our algorithm was somehow doing backtracking.

When we tried it out (valid and invalid inputs), it worked fine:

(define good-input '(N + N + N ... + N)) (define bad-input '(N + N + N ... + + N)) (display (format "good: ~s~n" (length good-input))) (display (format "bad: ~s~n" (length bad-input))) (time (recognize? S (list->stream good-input))) (time (recognize? S (list->stream bad-input)))

returns:

good: 101 bad: 102 cpu time: 73 real time: 76 gc time: 0 #t cpu time: 43 real time: 44 gc time: 0 #f

Times are measured in milliseconds.

Update: To be clear, Russ is right that trying to non-lazily compute all the parse trees is inherently exponential.

What this example shows is that derivatives can still accept/reject without considering all possible parse trees.

And, if parse trees are computed lazily, you don't have to consider them all.

(Or, you can use Ruzzo's 1979 algorithm to convert a recognizer into a parser with only a logarithmic penalty in the size of the input.)

Parsing with derivatives is different enough from other methods that in order to understand it, you really have to play around with it.

Even my intuitions are still trumped by experience with the implementation.

The Racket and Haskell implementations each include a dot-file renderer that allows insight-building visualizations of derived grammars.

Compaction

The biggest development since the draft is compaction.

Compaction shrinks the size of the grammar, and it can eliminate recursive references which would otherwise stick around, hogging memory.

The draft does mention simplifying reductions on grammars.

For example, ∅ ○ A = ∅ and B ∪ ∅ = B.

If these simplifying reductions are performed recursively and memoized, you end up with a tight compactor for grammars:

; Note: nullp matches languages which are exactly ; the empty string--not languages which just contain ; the empty string. nullablep matches those. (define/memoize (compact [l #:eq]) (match l [(empty) l] [(eps) l] [(emptyp) (empty)] [(nullp) (eps* (parse-null l))] [(token p c) l] [(orp (emptyp) l2) (compact l2)] [(orp l1 (emptyp)) (compact l1)] [(seqp (nullp t) l2) (red (compact l2) (lambda (w2) (cons t w2)))] [(seqp l1 (nullp t)) (red (compact l1) (lambda (w1) (cons w1 t)))] [(orp l1 l2) (alt (compact l1) (compact l2))] [(seqp l1 l2) (cat (compact l1) (compact l2))] [(redp (and e (nullp)) f) ; => (eps* (for/set ([t (parse-null e)]) (f t)))] [(redp (seqp (nullp t) l2) f) ; => (red (compact l2) (lambda (w2) (f (cons t w2))))] [(redp (redp l f) g) ; => (red (compact l) (lambda (w) (g (f w))))] [(redp l f) (red (compact l) f)]))

Interleaving compaction and derivation keeps grammars roughly the same size throughout parsing.

Without compaction, these are the sizes of the grammar (and memory usage) after each derivative on a simple list of about 100 tokens:

size: 18; mem: 35411644 size: 34; mem: 35425164 size: 56; mem: 35443372 size: 84; mem: 35469044 size: 118; mem: 35504188 size: 158; mem: 35553100 size: 204; mem: 35611372 // ... about 60 derivatives later. size: 261650; mem: 129317020 size: 272359; mem: 133152924 size: 283338; mem: 137075028 size: 294590; mem: 141042592 size: 306118; mem: 144652724 size: 317925; mem: 148514052 size: 330014; mem: 152762344 size: 342388; mem: 157118524 size: 355050; mem: 161558660 ^Cuser break

I had to quit after about 75 derivatives, because my machine was locked up.

With compaction turned on, these are the sizes of the grammar (and memory usage) after each derivative:

size: 12; mem: 217035240 size: 15; mem: 217079664 size: 18; mem: 217115032 size: 21; mem: 217154032 size: 24; mem: 217202808 size: 27; mem: 217256200 size: 30; mem: 217317536 size: 33; mem: 217392216 size: 36; mem: 217471024 size: 39; mem: 217565616 size: 42; mem: 217657380 size: 45; mem: 217760612 size: 43; mem: 217877292 size: 40; mem: 218010620 size: 37; mem: 218113580 size: 34; mem: 218194388 size: 31; mem: 218271924 size: 28; mem: 218337300 size: 25; mem: 218406196 size: 22; mem: 218455836 size: 19; mem: 218500164 size: 16; mem: 218536804 size: 16; mem: 218589524 size: 17; mem: 218640748 size: 17; mem: 218703388 // ... (stays 17 for 94 more deriavtives) size: 17; mem: 222544772 size: 17; mem: 222585892 size: 19; mem: 222620636 size: 22; mem: 222661852 size: 25; mem: 222707660 size: 23; mem: 222758316 size: 20; mem: 222810180 size: 17; mem: 222849524 size: 13; mem: 222897964

And, this time, it returned interactively.

Derivatives not only add a lot of structure in a grammar, they also invalidate a lot of structure and make it available for simplification.

Since the net gain or loss in nodes is usually small, this suggests optimizations that reuse recently deactivated nodes.

In particular, the start node often appears only at the top level.

So, much of the time, it should be safe to destructively transform it into the next derivative, thereby avoiding allocation and deallocation altogether.

A port of this technique to non-garbage-collected languages will certainly want to exploit this fact.

Related work

Total Parser Combinators by Nils Danielsson appeared at ICFP 2010.

Danielsson also uses Brzozowski's derivative, but somewhat differently, and toward a different end: proving termination for parsers.

The key differences between our work and Danielsson's are:

Danielsson's approach requires manual annotation of grammars with delays (sharps) and forces (flats) to guarantee termination. We are viscerally against annotations, like shift/reduce or sharp/flat, that require the user to understand how a parsing library was implemented. We require no such annotations. Danielsson doesn't handle all grammars. (Some kinds of left recursion don't work.) We handle all grammars, even infinitely recursive ones. Danielsson doesn't simplify or compact derived grammars, leading to terrible performance in practice. Because of sharp/flat annotations, Danielsson's definition of the derivative for sequences is markedly more complex.

One ESOP reviewer, commenting on what was then an unpublished draft of what would later evolve into Danielsson's submission to ICFP, noted that there should be room for both papers should Danielsson's be published. We fully agree, which is why we'll be trying again with ours.

How you can help kill yacc

What we need now are more implementations in other languages and "realistic" benchmarks in those implementations.

If you put one together or build some benchmarks, please send it our way.

In our own implementations, we've noticed that execution becomes dominated by garbage collection time as inputs grow larger.

Our diagnosis is a lack of sharing among derived grammars.

The same sub-grammar will reappear frequently during parsing, but we fail to discover this automatically. Instead, our implementations duplicate and release lots of tiny data structures, which stresses the garbage collector.

We (David, I and now also Daniel Spiewak) are working on several solutions to the "sharing" issue, but we'd love to hear from if you have an idea as well.

More resources