Earley parsing in Idris (Part 1)

In this post series we write a certified Earley recognizer and I tell how it went. I wouldn't like to do a multi-part blog post, but this time it seems I have to.

In this first post we find out what we have to prove and write an uncertified Earley recognizer in Idris.

Programming with a real dependently typed language can be gratifying as it gives you programs with certainty that they can do what you expect them to do, to the point that you know it works. This experience of "finishing" is usually absent in programming with C# or Python. Especially with C#.

In addition the programming environment is capable of guiding your programming.

But I have to tell you one thing first.

Idris does suck

Idris can really get pissy when you're working with it, it's got lot of sharp corners and also it's still more of a programming language than a theorem proving system.

The idris-vim -plugin goes for few seconds before producing results. Every time you ask it to complete some statement be prepared to wait for 2 seconds. The Idris plugin can be especially useful as it is able to fill up patterns and show the type environment of a program in unfilled holes. It's frustrating that the system is so useful because the outcome is that you're waiting for results from it most of the time, actually.

Most of the time autocompletion features in Idris produces exactly what you want, but if you give it anything more complex it gives you garbage.

Eventually when working with code you end up having expressions Idris cannot unify together, and it doesn't handle this well. This usually happens when you're querying for the environment. Instead of the environment you get a message that the unification failed. It's tedious to pinpoint from the error what actually happened.

Some language features like the implicits can be incredibly useful, but they break in weird quirky ways. Also why I can't return implicit proofs?

On the other hand, Idris doesn't have built-in sets, graph, map or relation types with provable properties. It's weird because these are common data structures you work with over and over again.

Also you can't reorder the definitions in your module like you'd want, they have to follow each other, I don't like that the compiler decides the order of my definitions.

Idris is unable to represent the idea that two values share the same constructor, this makes it incredibly annoying to write contradictions that are important for writing correct software. Just look at the decEq later here!

Installing Idris was a one frigging little quest.

Ok. Next to the code.

Describing what a recognizer is

Recognizer is a program that takes in:

A grammar Accept symbol(s) Input symbols

Then it tells you whether the input symbols can be generated from the grammar with some of the given symbol.

In Idris we could describe this kind of a program like this:

recognize : (g:Grammar s) -> (lhs:s) -> (input:List s) -> Dec (Generates g lhs input)

The Dec (Generates g lhs input) means that the program is only correct if it proves or refutes that the string is generated by the grammar.

If we can fill this function, it means that the program works correctly.

But first we have to get some sort of useful definitions of grammars and what it means to generate strings with them.

Definition of context-free-grammars

Context-free-grammars are bunch of production rules. They have one symbol on left side and some on right hand side. I require that the production-rule is non-empty, as it makes the Earley algorithm simpler.

infixl 10 ::= data Rule : Type -> Type where (::=) : s -> (xs:List s) -> {auto nonempty:NonEmpty xs} -> Rule s total Grammar : Type -> Type Grammar s = List (Rule s)

Non-empty production rules are not a bad limitation as you can work around the non-empty production-rule limitation by rewriting grammars and mapping between them.

Some small utility functions help when working with production rules. Mainly we like to take the rules apart and apply the nonempty property.

lhs : Rule s -> s lhs (x ::= xs) = x rhs : Rule s -> List s rhs (x ::= xs) = xs rhs_not_empty : (r:Rule s) -> NonEmpty (rhs r) rhs_not_empty ((::=) x xs {nonempty}) = nonempty rhs_head : Rule s -> s rhs_head (x ::= xs) = head xs

So then we'd like to understand what does it mean for a grammar to generate some string?

mutual data Generates : Grammar s -> s -> List s -> Type where Trivial : Generates g s [s] Derive : Indexing r g -> Seq g (rhs r) ys -> {auto zl:(lhs r = lhs_r)} -> Generates g lhs_r ys data Seq : Grammar s -> List s -> List s -> Type where Nil : Seq g [] [] (::) : Generates g s xs -> Seq g ss ys -> {auto zz:(xs ++ ys = zs)} -> Seq g (s::ss) zs

Things such as xs ++ ys and lhs r are neutral forms. They specifically cause difficult-to-handle unification errors if they were inside the constructor. To avoid this we enclose them into proof expressions that can be activated when needed.

The above system allows us to describe the relations between grammars and valid input strings:

Trivial means that if we expect for just a symbol, and we get that symbol, then we accept it. Derive means that if a sequence of accepted symbols can build a production rule-based expression.

The sequence represents for concatenated sequences of generated strings.

Indexers

The code frequently ends up indexing lists. I made bunch of abstractions to make work convenient with these structures.

Indexer is a natural number that relates to some specific list in a way, that it can retrieve an item or tail from the list.

data Indexer : (xs : List a) -> Type where Z : Indexer (x :: xs) S : Indexer xs -> Indexer (x :: xs)

Every nonempty list has a zero indexer. We could also prove that empty lists do not have any indexers, but I didn't need it for now.

z_ndex : (list:List s) -> NonEmpty list -> Indexer list z_ndex (_ :: _) IsNonEmpty = Z

This means that rules also always have the first index.

rule_first_index : (r:Rule s) -> Indexer (rhs r) rule_first_index r = z_ndex (rhs r) (rhs_not_empty r)

Next I describe a way to select items from a list, such that we can refer to it by its index. Note that the type doesn't guarantee this one is correct.

lemma_select : {p:List a} -> (i:List a) -> (Indexer i -> Indexer p) -> (Indexer p -> Maybe b) -> List b lemma_select [] _ _ = [] lemma_select (x :: xs) f g = let zs = lemma_select xs (f . S) g in case g (f Z) of Just z => z :: zs Nothing => zs select : (p:List a) -> (Indexer p -> Maybe b) -> List b select xs f = lemma_select xs (\a => a) f

Likewise we have an another definition that describes what it means to index with the indexer.

total ndex : {list:List a} -> Indexer list -> a ndex {list = []} Z impossible ndex {list = []} (S _) impossible ndex {list = (a :: _)} Z = a ndex {list = (_ :: xs)} (S x) = ndex {list=xs} x data Indexing : a -> List a -> Type where Ndex : (ix:Indexer list) -> Indexing (ndex ix) list

And finally I have written a function that allows to increment an index.

lemma_next_ndex : (xs:List a) -> (ys:List a) -> ((Indexer ys) -> Indexer xs) -> (Indexer ys) -> Maybe (Indexer xs) lemma_next_ndex xs (y::[]) f Z = Nothing lemma_next_ndex xs (y::(y2 :: ys)) f Z = Just (f (S Z)) lemma_next_ndex xs (y::ys) f (S s) = lemma_next_ndex xs ys (f . S) s next_ndex : Indexer xs -> Maybe (Indexer xs) next_ndex {xs} ix = lemma_next_ndex xs xs (\a => a) ix

Note that definitions aren't necessarily bad even if they were complex. You can sometimes prove that they hold the properties that you expect. Actually let me show.

This is a simple expression that checks whether we get nothing if we're at the last index of the list.

next_ndex_empty_tail : Dec (xs=[]) -> Dec (next_ndex (Z {xs=xs}) = Nothing) next_ndex_empty_tail (Yes Refl) = Yes Refl next_ndex_empty_tail {xs = []} (No contra) = Yes Refl next_ndex_empty_tail {xs = (x :: xs)} (No contra) = No (\assum => case assum of Refl impossible)

Here's an another one, it tells that we get the (S Z) as the next index if we ask for the next index of Z .

next_ndex_is_next : (next_ndex (Z {xs=(x::(y::xs))}) = Just (S Z {xs=(x::(y::xs))})) next_ndex_is_next = Refl

All right, next we're interested about whether the generation rules actually allow proving something about grammars and symbol strings.

The test grammar

Lets define bunch of symbols and a test grammar with them.

data Symbol = A | B | C | D test_grammar : Grammar Symbol test_grammar = [ D ::= [A,B,C], D ::= [B,C] ]

This next definition is just bullshit. It gives observational equivalence for our structures.

DecEq Symbol where decEq A A = Yes Refl decEq A B = No (\pf => case pf of Refl impossible) decEq A C = No (\pf => case pf of Refl impossible) decEq A D = No (\pf => case pf of Refl impossible) decEq B B = Yes Refl decEq B A = No (\pf => case pf of Refl impossible) decEq B C = No (\pf => case pf of Refl impossible) decEq B D = No (\pf => case pf of Refl impossible) decEq C C = Yes Refl decEq C A = No (\pf => case pf of Refl impossible) decEq C B = No (\pf => case pf of Refl impossible) decEq C D = No (\pf => case pf of Refl impossible) decEq D D = Yes Refl decEq D A = No (\pf => case pf of Refl impossible) decEq D B = No (\pf => case pf of Refl impossible) decEq D C = No (\pf => case pf of Refl impossible)

But seriously this would be short if I had a way to say that two values have the same constructor.

x -> y -> DecSameConstructor x y

With the result that I can then proceed to prove like this:

decEq x y = case sameConstructor x y of Yup A A => Yes Refl Yup B B => Yes Refl Yup C C => Yes Refl Yup D D => Yes Refl Nope contra => No (\assum => case assum of Refl => contra Refl)

Ok, time to try this thing out with that grammar.

Trying the definitions out

Even with worst definition of string generation we should be able to prove that we can generate strings that consist of an one production rule.

generates_abc : Generates Parsing.test_grammar D [A,B,C] generates_abc = Derive (Ndex Z) [Trivial, Trivial, Trivial]

Next interesting thing to prove would be to confirm that our grammars indeed cannot generate empty strings because the production rules are not empty!

First we need a little lemma saying that being an empty list transfers over the concatenation:

lemma_empty_lists : {xs:List s} -> {ys:List s} -> (xs ++ ys = []) -> (xs = [], ys = []) lemma_empty_lists {xs = []} {ys = []} prf = (Refl, Refl) lemma_empty_lists {xs = []} {ys = (_ :: _)} Refl impossible lemma_empty_lists {xs = (_ :: _)} {ys = []} Refl impossible lemma_empty_lists {xs = (_ :: _)} {ys = (_ :: _)} Refl impossible

Then we can use this lemma to conclude that indeed, there exist no grammar that would generate empty string, and that nonempty sequences neither do it:

mutual total no_empty_seq : Seq g t [] -> NonEmpty t -> Void no_empty_seq [] IsNonEmpty impossible no_empty_seq ((::) x z {zz}) IsNonEmpty = let (Refl, Refl) = lemma_empty_lists zz in does_not_generate_empty x total does_not_generate_empty : {g:Grammar x} -> Generates g s [] -> Void does_not_generate_empty {g} (Derive {r} in_grammar y) = no_empty_seq y (rhs_not_empty r)

Since we're trying to prove that we can decide whether string generates by grammar, because that's what we really care about! It must be possible to at least disprove that strings aren't in the grammar.

b_must_be_b : Generates [D ::= [A, B, C], D ::= [B, C]] B xs -> (xs = [B]) b_must_be_b Trivial = Refl b_must_be_b (Derive (Ndex Z) y {zl=Refl}) impossible b_must_be_b (Derive (Ndex (S Z)) y {zl=Refl}) impossible s_abc_rule_lemma : Seq Parsing.test_grammar (rhs (index 0 Parsing.test_grammar)) [A, A, A] -> Void s_abc_rule_lemma ((::) Trivial ((::) x y {zz=ww}) {zz=Refl}) = case (b_must_be_b x, ww) of (Refl, Refl) impossible s_abc_rule_lemma ((Derive (Ndex Z) z {zl=Refl}) :: y) impossible s_abc_rule_lemma ((Derive (Ndex (S Z)) z {zl=Refl}) :: y) impossible s_bc_rule_lemma : Seq Parsing.test_grammar (rhs (index 1 Parsing.test_grammar)) [A, A, A] -> Void s_bc_rule_lemma ((::) Trivial _ {zz=Refl}) impossible s_bc_rule_lemma ((Derive (Ndex Z) z {zl=Refl}) :: y) impossible s_bc_rule_lemma ((Derive (Ndex (S Z)) z {zl=Refl}) :: y) impossible does_not_generate_aaa : Generates Parsing.test_grammar u [A,A,A] -> Void does_not_generate_aaa (Derive x y {r} {zl}) = case (x,zl) of (Ndex Z, Refl) => s_abc_rule_lemma y (Ndex (S Z), Refl) => s_bc_rule_lemma y

These were some computing-time-consuming proofs, so I commented most of it away for the rest of the time I was working on the module.

Doing such proofs helped in ensuring that the definitions work well enough to be useful.

Since that was possible, we can start describing some things that are specific to Earley parsers.

Definitions of items and item sets

Next it's time to describe what Earley parser is built of. The heart of it are item sets. Each item set represents the current parsing state.

Let me show some state transitions as they appear in Idris prompt:

step_0 : (List (Nat, Symbol), List (List (Symbol, Item Parsing.test_grammar False))) step_0 = ([], [predict D]) step_1 : (List (Nat, Symbol), List (List (Symbol, Item Parsing.test_grammar False))) step_1 = accept (snd step_0) A step_2 : (List (Nat, Symbol), List (List (Symbol, Item Parsing.test_grammar False))) step_2 = accept (snd step_1) B step_3 : (List (Nat, Symbol), List (List (Symbol, Item Parsing.test_grammar False))) step_3 = accept (snd step_2) C

So we basically step through the string A,B,C here.

This is the step_0 :

([], [[(A, Incomplete 0 (Ndex Z) Z), (B, Incomplete 0 (Ndex (S Z)) Z)]])

The step_1 :

([(0, A)], [[(B, Incomplete 1 (Ndex Z) (S Z))], [(A, Incomplete 0 (Ndex Z) Z), (B, Incomplete 0 (Ndex (S Z)) Z)]])

The step_2 :

([(0, B)], [[(C, Incomplete 2 (Ndex Z) (S (S Z)))], [(B, Incomplete 1 (Ndex Z) (S Z))], [(A, Incomplete 0 (Ndex Z) Z), (B, Incomplete 0 (Ndex (S Z)) Z)]])

The step_3 :

([(0, C), (3, D)], [[], [(C, Incomplete 2 (Ndex Z) (S (S Z)))], [(B, Incomplete 1 (Ndex Z) (S Z))], [(A, Incomplete 0 (Ndex Z) Z), (B, Incomplete 0 (Ndex (S Z)) Z)]])

At the third step it tells you it recognized the D in the input string.

To make it easier to describe that single-stepping, I provided the function accept :

accept : DecEq s => {g : Grammar s} -> (st:List (List (s, Item g False))) -> s -> (List (Nat, s), List (List (s, Item g False))) accept st x = let (c, p) = completion st [] [(0,x)] in (c, p :: st)

Now it's time to look in how the parser is put together. Lets approach it from top-down as it's better to explain things that way.

The definition of the recognizer

The recognizer routine is more of an "user interface". It helps to try it out.

Although there's some detail in this.

lemma_recognize : DecEq s => {g : Grammar s} -> s -> Nat -> (st : List (List (s, Item g False))) -> (input : List s) -> Bool lemma_recognize lhs i {g} st [] = False lemma_recognize lhs i {g} st (x :: []) = let (c,p) = completion st [] [(0, x)] in case mark (i,lhs) c of Nothing => True (Just _) => False lemma_recognize lhs i {g} st (x :: xs) = let (c,p) = completion st [] [(0, x)] p2 = prediction [] (map fst p) {g=g} in lemma_recognize lhs (S i) ((p ++ p2) :: st) xs {g=g} --recognize : (g : Grammar s) -> (lhs:s) -> (input:List s) -- -> Dec (Generates g lhs input) recognize : DecEq s => (g : Grammar s) -> (lhs:s) -> (input:List s) -> Bool recognize g lhs input = let ist = prediction [] [lhs] {g=g} in lemma_recognize lhs 1 [ist] input {g=g}

Note we have the goal recognizer function commented out there, the desired condition is replaced by a boolean. It's the content of the next post, whenever I do it.

When the recognizer starts up, it predicts the expected symbol will be looked for. The prediction initiates every rule that might appear in the derivation tree.

When a symbol is consumed, the completion step takes over and produces a shifted state from the previous. In the shifted state every item has stepped and items that reached reduction trigger shifts from earlier item sets. Earley parser keeps a chain of item sets as a parsing state when it runs.

If there's a next symbol after a completion step is done, the prediction is done again on every symbol that might participate in a parse tree.

Both prediction and shifting needs to keep track of already predicted symbols. It's achieved by keeping a "page" where you mark whether the item was already visible or not.

mark : DecEq s => s -> List s -> Maybe (List s) mark x [] = Just [x] mark x (y :: xs) = case decEq x y of (Yes prf) => Nothing (No contra) => mark x xs >>= Just . (y::)

Items in item sets are divided by whether they completed or not. The completed items are separated and the shifting repeats on them, with earlier symbol.

data Item : (g:Grammar s) -> Bool -> Type where Completed : Nat -> Indexing r g -> Item g True Incomplete : Nat -> Indexing r g -> Indexer (rhs r) -> Item g False

The prediction is made from two smaller pieces. First we have a way to select a rule if it happens to match with the expected lhs symbol:

lemma_predict : DecEq s => {g:Grammar s} -> s -> (Indexer g) -> Maybe (s, Item g False) lemma_predict x ix = case decEq x (lhs$ndex ix) of (No contra) => Nothing (Yes _) => Just (rhs_head (ndex ix), Incomplete Z (Ndex ix) (rule_first_index (ndex ix)))

Then we extend it over whole grammars, such that we can do a single step of prediction.

predict : DecEq s => {g : Grammar s} -> s -> List (s, Item g False) predict {g} x = select g (lemma_predict x)

Finally we extend the prediction step into a transitive closure that is bounded by the marking-mechanism as described earlier. The predictor only needs to track which symbols have been visited already.

prediction : DecEq s => {g : Grammar s} -> (visited:List s) -> (completed:List s) -> List (s, Item g False) prediction visited [] = [] prediction visited (x :: xs) = case mark x visited of Nothing => prediction visited xs Just visited_x => let p = predict x in p ++ (prediction visited_x (map fst p ++ xs))

The completion-step is divided similarly. On the lowest layer we describe what shifting of a item means.

lemma_shift : {g : Grammar s} -> Nat -> Item g False -> Either (Item g True) (s, Item g False) lemma_shift off (Incomplete k x y) = case (next_ndex y) of Nothing => Left (Completed (k+off) x) (Just y2) => Right (ndex y2, Incomplete (k+off) x y2)

Then we describe what it means to shift an item set with some symbol. This already divides the item sets into completed and incomplete. Note that they're indexed by a boolean to get there.

shift : DecEq s => {g : Grammar s} -> Nat -> s -> List (s, Item g False) -> (List (Item g True), List (s, Item g False)) shift k x [] = ([], []) shift k x ((a,item) :: xs) = let (c,p) = shift k x xs in case decEq x a of (Yes Refl) => case lemma_shift k item of (Left item) => (item :: c, p) (Right pitem) => (c, pitem :: p) (No contra) => (c,p)

Next we need a small helper function to convert the completed item into the index and a symbol for marking visited items during completion. This forgets which rule caused the completion.

extractComplete : {g:Grammar s} -> Item g True -> (Nat, s) extractComplete (Completed ix _ {r}) = (ix, lhs r)

The completion lemma ensures we don't index too far, though this would be something superfluous to check. If we had proven the recognizer correct, we wouldn't have to do this.

lemma_completion : DecEq s => {g : Grammar s} -> (i:Nat) -> s -> (st:List (List (s, Item g False))) -> (List (Item g True), List (s, Item g False)) lemma_completion k x xs = case inBounds k xs of (Yes prf) => shift (S k) x (index k xs) (No contra) => ([],[]) -- For now handling the boundaries here.

Finally we track the already completed states, like before. Forming a similar "transitive closure" for completions as there were for predictions.

completion : DecEq s => {g : Grammar s} -> (st:List (List (s, Item g False))) -> (visited:List (Nat, s)) -> (completed:List (Nat, s)) -> (List (Nat, s), List (s, Item g False)) completion st visited [] = (visited, []) completion st visited (x :: xs) = case mark x visited of Nothing => completion st visited xs Just visited_x => let (c,p) = lemma_completion (fst x) (snd x) st (c2,p2) = completion st visited_x (map extractComplete c++xs) in (c2,p++p2)

That's it. It's not a total function and it's not proven correct yet.

Here's the download link for this Parsing.idr module. I'll perhaps place it into a repository once it has grown a bit.

What use could this be?

Idris can be compiled to several targets including javascript, node.js and C. So potentially the parser itself is possibly useful for somebody once we can verify it works. It's trivial to extend the recognizer algorithm to produce parse trees.

Parsers are nice user interfacing components. Look up inverse parsing.

The Earley parsing in itself is a nice benchmark for correct-by-construction programming environments. We did smaller proofs about a whole bunch of small structures that appear everywhere. Things such as indexers and sets.

Once you have precise good enough definitions of interfaces, there will be little reason to change them afterwards. It can be used to write software that will work if and only if it can be plugged together.

Once you have a valid models of programs, with suitable interactive tooling it should be possible to produce highly-optimized version of those programs that implement their models exactly.