Parsing, CFGs, and Type Hacking

This is what I have been playing with for the last day or so.

Haskell has a very nice monadic parser library for predictive parsing (parsec), and a decent lex/yacc-style parser and lexer generator suite (happy and alex). That said, though, it’s more fun and educational to write code than to worry about what’s already been written, I set out to do something similar. In particular, my goals are:

Write an LR parser, like that generated by parser generators.

Write readable grammars directly in Haskell source files, with no external parser generator utilities.

Let the type checker do its job by checking and inferring types for semantic rules.

I haven’t written the parsing code yet, but I did squeeze out something that I’m nearly happy with for the first two goals.

{-# LANGUAGE MultiParamTypeClasses #-} {-# LANGUAGE UndecidableInstances #-} {-# LANGUAGE NoMonomorphismRestriction #-}

All the language extensions I’ll be using. This is the bare-bones list; the original list was eight or nine lines.. So, if you were wonder whether this is a good post for new Haskellers just learning the language, there’s your answer!

I need some way to represent variables (in the CFG sense). In order to ensure that everything is well-typed, I need some way to keep track of the type of the semantic value associated with each variable. Here’s what I did.

data Var a = Var String

And a right-hand side of each rule will have a sequence of variables and terminals. Again, to keep the type information around, I’ll need a sort of cons operator at the type level. Here is that. I defined a type, and also an operator that makes the type easier to use.

data RHS a b = RHS a b ( & ) = RHS -- a convenient operator infixr 5 &

And next, things get hard. I’m using a multiparameter type class, in the fine tradition of Haskell type hacking, to represent a relation between types. My relation is defined in the following comment:

{- (Action a b c) means the following: A production with a right hand side of type: a may be associated with a semantic rule of type: b to produce a rule with semantic result type: c -} class Action a b c | a b -> c , a c -> b

In other words, the Action class will be used to ensure that the result type of a grammar production, the right-hand side of the production, and the type of the associated semantic rule are all consistent with each other. The functional dependencies simply assert that if you know the types of the right-hand side and the semantic rule, this is enough to determine what the result will be after applying the semantic rule; and that if you know the right-hand side and the result type, this is enough to determine what the type of the semantic rule needs to be.

There are a three base cases for this relation:

instance Action (Var x ) ( x -> y ) y

This says that if a production has the form A -> B , where A has semantic values of type y , and B has semantic values of type x , then the semantic rule must have type (x -> y) . If you think about it, this should make sense.

instance Action Char y y

This rule says that if the right-hand side of a production is a signle character (a terminal, not a variable), the semantic rules should be a constant that matches the semantic type for the left-hand variable.

instance Action () y y

This describes the situation for empty productions (sometimes called epsilon or lambda productions). Since leaving out any terms on the right-hand side of a rule isn’t an option, I use () , called “unit” to represent an empty right-hand side.

Those are the base cases. (As a side comment, only the last one is strictly necessary; the first two are basically just syntactic convenience. See below.) Here’s how rules with more than one symbol on the right-hand side are handled.

instance (Action a b c ) => Action (RHS (Var x ) a ) ( x -> b ) c

The RHS operator defined earlier is used to build a list of sorts. This rule says that adding a variable to the beginning of the right-hand side of a rule requires adding a parameter to the beginning of the semantic action, and that the result type stays the same. This case handles right-hand sides that begin with a variable.

instance (Action a b c ) => Action (RHS Char a ) b c

Finally, this case handles right-hand sides that begin with a terminal (a character). The types of the semantic rule and result don’t change, since a terminal is known in advance, so there’s no need for it to carry semantic information.

Some more syntactic convenience makes it easier to write grammars. Here I abuse monads to take advantage of the special syntax.

{- The RuleSet monad is used to define rules of a grammar in a convenient 'do' notation. -} data Rule = forall a b c . (Action a b c ) => Rule (Var c , a , b ) data RuleSet x = RuleSet ([Rule], x ) instance Monad RuleSet where a >>= k = let RuleSet ( r1 , v1 ) = a RuleSet ( r2 , v2 ) = k v1 in RuleSet ( r1 ++ r2 , v2 ) return x = RuleSet ([], x )

So a rule consists of a left-hand side, a right-hand side, and a semantic rule. They are constrained to match each other by the Action class defined above. A RuleSet is basically just a writer monad for lists of rules, but I defined it by hand just for the fun of it.

Now time to define an operator for building rules inside the monad:

( ==> ) :: (Action a b c ) => Var c -> a -> b -> RuleSet () ( ==> ) lhs rhs sem = RuleSet ([Rule ( lhs , rhs , sem )], ()) infixr 4 ==>

It took a while to pick this. All the good arrow-like operators seems to be taken! Nevertheless, it does the job we want fairly well. Notice that even though I’m using an infix operator, there are three operands. The normal usage looks like this:

lefthand ==> righthand $ semanticrule

You’ll see examples coming up.

The formal definition of a context-free grammar includes four things: a set of variables, a set of terminals, a set of productions, and a special start variable. We’ve got three: variables are values of type Var a . Terminals are values of type Char . Productions are values of type Rule . Next, I need a start symbol. This is defined once, outside of the monadic environment in which rules are defined. At the same time, I through away the result value of the monad, which is useless since I was just exploiting the syntax rather than building a real monad.

{- Grammar. A grammar is a set of rules together with a start symbol. -} data Grammar a = Grammar (Var a ) [Rule] grammar s (RuleSet ( rs , x )) = Grammar s rs

And that’s it! Here’s a sample grammar so we can see it work.

{- Sample grammar. The parentheses in the let bindings are required: they force the rules to be monomorphic, which is needed for type checking to work properly. -} g = let ( expr ) = Var "expression" ( term ) = Var "term" ( termmore ) = Var "term operator" ( fact ) = Var "factor" ( factmore ) = Var "factor operator" ( digit ) = Var "digit" ( digits ) = Var "digits" in grammar expr $ do expr ==> term & termmore $ \\ x y -> y x termmore ==> () $ \\ x -> x termmore ==> '+' & term & termmore $ \\ x y z -> y ( x + z ) termmore ==> '-' & term & termmore $ \\ x y z -> y ( x - z ) term ==> fact & factmore $ \\ x y -> y x factmore ==> () $ \\ x -> x factmore ==> '*' & fact & factmore $ \\ x y z -> y ( x * z ) factmore ==> '/' & fact & factmore $ \\ x y z -> y ( x / z ) fact ==> '(' & expr & ')' $ \\ x -> x fact ==> digit & digits $ \\ x y -> y x digits ==> () $ \\ x -> x digits ==> digit & digits $ \\ x y z -> y ( 10* x + z ) digit ==> '0' $ 0 digit ==> '1' $ 1 digit ==> '2' $ 2 digit ==> '3' $ 3 digit ==> '4' $ 4 digit ==> '5' $ 5 digit ==> '6' $ 6 digit ==> '7' $ 7 digit ==> '8' $ 8 digit ==> '9' $ 9

(This is a modified grammar I had laying around from a set of compiler course notes. It happens to have left recursion removed, but that’s immaterial really.) There are no type declarations in the entire grammar. Ask GHCi for the type of g , though, and it answers.

g :: (Fractional a ) => Grammar a

It correctly inferred that the result type of expr must be Fractional . How? Because the third production for factmore uses the / operator. This means that factmore must be Fractional , and the type ripples upward all the way to the start variable of expr .

The only thing I don’t like at the moment is the need to use an explicit monomorphic binding (the parentheses) to declare non-terminals. If that’s not done, then the compiler thinks non-terminals can have different result types when used in different places, and the types it infers tend to be several pages long! A nice solution to that would be good, but I’m happy with everything else!

Please comment if there’s something that can be improved.