November 27, 2017 at 05:28 Tags Haskell , Recursive descent parsing

This post follows the construction of parsers described in Graham Hutton's "Programming in Haskell" (2nd edition). It's my attempt to work through chapter 13 in this book and understand the details of applicative and monadic combination of parsers presented therein.

Basic definitions for the Parser type A parser parameterized on some type a is: newtype Parser a = P ( String -> [( a , String )]) It's a function taking a String and returning a list of (a,String) pairs, where a is a value of the parameterized type and String is (by convention) the unparsed remainder of the input. The returned list is potentially empty, which signals a failure in parsing . It might have made more sense to define Parser as a type alias for the function, but type s can't be made into instances of typeclasses; therefore, we use netwype with a dummy constructor named P . With this Parser type, the act of actually parsing a string is expressed with the following helper function. It's not strictly necessary, but it helps make code cleaner by hiding P from users of the parser. parse :: Parser a -> String -> [( a , String )] parse ( P p ) inp = p inp The most basic parsing primitive plucks off the first character from a given string: item :: Parser Char item = P ( \ inp -> case inp of [] -> [] ( x : xs ) -> [( x , xs )]) Here's how it works in practice: > parse item "foo" [('f',"oo")] > parse item "f" [('f',"")] > parse item "" []

Parser as a Functor We'll start by making Parser an instance of Functor : instance Functor Parser where -- fmap :: (a -> b) -> Parser a -> Parser b fmap g p = P ( \ inp -> case parse p inp of [] -> [] [( v , out )] -> [( g v , out )]) With fmap we can create a new parser from an existing parser, with a function applied to the parser's output. For example: > parse (fmap toUpper item) "foo" [('F',"oo")] > parse (fmap toUpper item) "" [] Let's check that the functor laws work for this definition. The first law: fmap id = id Is fairly obvious when we substitute id for g in the definition of fmap . We get: fmap id p = P (\inp -> case parse p inp of [] -> [] [(v,out)] -> [(id v,out)]) Which takes the parse result of p and passes it through without modification. In other words, it's equivalent to p itself, and hence the first law holds. Verifying the second law: fmap (g . h) = fmap g . fmap h ... is similarly straightforward and is left as an exercise to the reader. While it's not obvious why a Functor instance for Parser is useful in its own right, it's actually required to make Parser into an Applicative , and also when combining parsers using applicative style.

Parser as an Applicative Consider parsing conditional expressions in a fictional language: if <expr> then <expr> else <expr> To parse such expressions we'd like to say: Parse the token if

Parse an <expr>

Parse the token then

Parse an <expr>

Parse the token else

Parse an <expr>

If all of this was successful, combine all the parsed expressions into some sort of result, like an AST node. Such sequences, along with alternation (an expression is either this or that) are two of the critical basic blocks of constructing non-trivial parsers. Let's see a popular way to accomplish this in Haskell (for a complete example demonstrating how to construct a parser for this particular conditional expression, see the last section in this post). Parser combinators is a popular technique for constructing complex parsers from simpler parsers, by means of higher-order functions. In Haskell, one of the ways in which parsers can be elegantly combined is using applicative style. Here's the Applicative instance for Parser . instance Applicative Parser where -- pure :: a -> Parser a pure v = P ( \ inp -> [( v , inp )]) -- <*> :: Parser (a -> b) -> Parser a -> Parser b pg <*> px = P ( \ inp -> case parse pg inp of [] -> [] [( g , out )] -> parse ( fmap g px ) out ) Recall how we created a parser that applied toUpper to its result using fmap ? We can now do the same in applicative style: > parse (pure toUpper <*> item) "foo" [('F',"oo")] Let's see why this works. While not too exciting on its own, this application of a single-argument function is a good segue to more complicated use cases. Looking at the Applicative instance, pure toUpper translates to P (\inp -> [(toUpper,inp)] - a parser that passes its input through unchanged, returning toUpper as a result. Now, substituting item into the definition of <*> we get: pg <*> item = P (\inp -> case parse pg inp of [] -> [] [(g,out)] -> parse (fmap g item) out) ... pg is (pure toUpper), the parsing of which always succeeds, returning [(toUpper,inp)] pg <*> item = P (\inp -> parse (fmap toUpper item) inp) In other words, this is exactly the example we had for Functor by fmap -ing toUpper onto item . The more interesting case is applying functions with multiple parameters. Here's how we define a parser that parses three items from the input, dropping the middle result: dropMiddle :: Parser ( Char , Char ) dropMiddle = pure selector <*> item <*> item <*> item where selector x y z = ( x , z ) Following the application of nested <*> operators is tricky because it builds a run-time chain of functions referring to other functions. This chain is only collapsed when the parser is used to actually parse some input, so it is necessary to keep a lot of context "on the fly". To better understand how this works, we can break the definition of dropMiddle into parts as follows (since <*> is left-associative): dropMiddle = (( pure selector <*> item ) <*> item ) <*> item where selector x y z = ( x , z ) Applying the first <*> : pg <*> item = P (\inp -> case parse pg inp of [] -> [] [(g,out)] -> parse (fmap g item) out) ... pg is (pure selector), the parsing of which always succeeds, returning [(selector,inp)] pg <*> item = P (\inp -> parse (fmap selector item) inp) --= app1 Let's call this parser app1 and apply the second <*> in the sequence. app1 <*> item = P (\inp -> case parse app1 inp of [] -> [] [(g,out)] -> parse (fmap g item) out) --= app2 We'll call this app2 and move on. Similarly, applying the third <*> in the sequence produces: app2 <*> item = P (\inp -> case parse app2 inp of [] -> [] [(g,out)] -> parse (fmap g item) out) This is dropMiddle . It's a chain of parsers expressed as a compbination of higher-order functions (closures, actually). To see how this combined parser actually parses input, let's trace through the execution of: > parse dropMiddle "pumpkin" [(('p','m'),"pkin")] dropMiddle is app2 <*> item , so we have: -- parse dropMiddle parse P (\inp -> case parse app2 inp of [] -> [] [(g,out)] -> parse (fmap g item) out) "pumpkin" .. substituting "pumpkin" into inp case parse app2 "pumpkin" of [] -> [] [(g,out)] -> parse (fmap g item) out Now parse app2 "pumpkin" is going to be invoked; app2 is app1 <*> item : -- parse app2 case parse app1 "pumpkin" of [] -> [] [(g,out)] -> parse (fmap g item) out Similarly, we get to parse app1 "pumpkin" : -- parse app1 parse (fmap selector item) "pumpkin" .. following the definition of fmap parse P (\inp -> case parse item inp of [] -> [] [(v,out)] -> [(selector v,out)]) "pumpkin" .. Since (parse item "pumpkin") returns [('p',"umpkin")], we get: [(selector 'p',"umpkin")] Now going back to parse app2 , knowing what parse app1 "pumpkin" returns: parse (fmap (selector 'p') item) "umpkin" .. following the definition of fmap parse P (\inp -> case parse item inp of [] -> [] [(v,out)] -> [(selector 'p' v,out)]) "umpkin" [(selector 'p' 'u',"mpkin")] Finally, dropMiddle : app2 <*> item = P (\inp -> case parse app2 inp of [] -> [] [(g,out)] -> parse (fmap g item) out) .. Since (parse app2 "pumpkin") returns [(selector 'p' 'u',"mpkin")] parse (fmap (selector 'p' "u") item) "mpkin" .. If we follow the definition of fmap again, we'll get: [(selector 'p' 'u' 'm',"pkin")] This is the final result of applying dropMiddle to "pumpkin", and when selector is invoked we get [(('p','m'),"pkin")] , as expected.

Parser as a Monad Parsers can also be expressed and combined using monadic style. Here's the Monad instance for Parser : instance Monad Parser where -- return :: a -> Parser a return = pure -- (>>=) :: Parser a -> (a -> Parser b) -> Parser b p >>= f = P ( \ inp -> case parse p inp of [] -> [] [( v , out )] -> parse ( f v ) out ) Let's take the simple example of applying toUpper to item again, this time using monadic operators: > parse (item >>= (\x -> return $ toUpper x)) "foo" [('F',"oo")] Substituting in the definition of >>= : item >>= (\x -> return $ toUpper x) = P (\inp -> case parse item inp of [] -> [] [(v,out)] -> parse (return $ toUpper v) out) ... if item succeeds, this is a parser that will always succeed with the upper-cased result of item When writing in monadic style, however, we won't typically be using the >>= operator explicitly; instead, we'll use the do notation. Recall that in the general multi-parameter case, this: m1 >>= \x1 -> m2 >>= \x2 -> ... mn >>= \xn -> f x1 x2 ... xn Is equivalent to this: do x1 <- m1 x2 <- m2 ... xn <- mn f x1 x2 ... xn So we can also rewrite our example as: > parse (do x <- item; return $ toUpper x) "foo" [('F',"oo")] The do notation starts looking much more attractive for multiple parameters, however. Here's dropMiddle in monadic style written directly : dropMiddleM :: Parser (Char,Char) dropMiddleM = item >>= \x -> item >>= \_ -> item >>= \z -> return (x,z) And now rewritten using do : dropMiddleM' :: Parser (Char,Char) dropMiddleM' = do x <- item item z <- item return (x,z) Let's do a detailed breakdown of what's happening here to better understand the monadic sequencing mechanics. I'll be using the direct style ( dropMiddleM ) to unravel the applications of >>= : item >>= \x -> item >>= \_ -> item >>= \z -> return (x,z) .. applying the first >>=, calling the right-hand side rhsX P (\inp -> case parse item inp of [] -> [] [(v,out)] -> parse (rhsX v) out) .. the result of parsing the first item is passed in as the argument to rhsX, which then returns the next application of >>=; As usual, we acknowledge the error propagation and ignore it for simplicity. P (\inp -> case parse item inp of [] -> [] [(v,out)] -> parse (rhsY v) out) ... and similarly for rhsZ; the final result is invoking "parse return (x,z)" where x is the result of parsing the first item and z the result of parsing the third.