24 days of Hackage, 2015: day 14: Earley: a promising newer parser library for Haskell

Table of contents for the whole series

A table of contents is at the top of the article for day 1.

Day 14

(Reddit discussion)

On day 10, I showed how to use S-expressions to avoid having to write a custom parser. But writing parsers isn’t too bad in Haskell, or is it? The popular parsec library has many problems, because it requires hand-hacked backtracking that causes weird error messages and difficulty in reasoning about your grammar. There’s an improved fork of parsec called megaparsec , but it’s still the same kind of technology. How about something completely different.

The recent Earley is intriguing and I’ve begun using it for new projects where I don’t need the monadic power of something like parsec but are OK with an applicative API instead and don’t need the performance of something like attoparsec . Apart from good error messages, it allows handles online parsing and ambiguity.

Today I’ll give two small examples of using Earley .

Installation

Since Stackage LTS is behind right now, and Earley keeps moving, I decided to use the latest version of Earley by modifying our stack.yaml :

- Earley- 0.10 . 1.0

Stackage LTS 4 has caught up, no no more need for this modification.

Parsing into day 10’s AST

Let’s go back to the symbolic differentiation problem on day 10, and create a math-like infix syntax to parse.

Tests

Here are some HSpec/QuickCheck tests to illustrate what we want when parsing a string into an Exp .

Imports

Text.Earley is the main module of the Earley package; Report is used for return a report on the progress of the parse.

{-# LANGUAGE QuasiQuotes #-} {-# LANGUAGE ScopedTypeVariables #-} {-# LANGUAGE LambdaCase #-} module SymbolicDifferentiation.EarleySpec where import SymbolicDifferentiation.AlphaSyntax ( Exp ( N , V , Plus , Times )) import qualified SymbolicDifferentiation.Earley as Earley import Text.Earley ( Report (..)) import Test.Hspec ( Spec , hspec, describe, it, shouldBe, shouldSatisfy) import Test.Hspec.QuickCheck (prop) import Test.QuickCheck ( NonNegative (..)) import Data.String.Here (i)

QuickCheck tests

Some QuickCheck tests that show that some sample expressions such as x*a + y*b * (z+c) parse into the expected ASTs:

spec :: Spec spec = describe "Custom syntax for expression parsed by Earley" $ do -- For simplicity, don't support negative numeric literals now. prop "x + a" $ \( NonNegative (a :: Int )) -> fst ( Earley .parses [i|x + ${a}|]) `shouldBe` [ Plus ( V "x" ) ( N a)] prop "x*a + y*b * (z+c)" $ \( NonNegative (a :: Int )) ( NonNegative (b :: Int )) ( NonNegative (c :: Int )) -> fst ( Earley .parses [i|x*${a} + y*${b} * (z+${c})|]) `shouldBe` [ Plus ( Times ( V "x" ) ( N a)) ( Times ( Times ( V "y" ) ( N b)) ( Plus ( V "z" ) ( N c)))]

Expected parse errors

Finally, one example of how to check for expected parse errors. The error tokens are user-defined and attached to grammar productions, as we will see.

it "x + y * + 5" $ Earley .parses "x + y * + 5" `shouldSatisfy` \ case ( [] , Report { position = 8 , expected = [ "number" , "identifier" , "(" ] , unconsumed = "+ 5" }) -> True _ -> False

Implementation

The implementation involves Applicative idioms that will be familiar to you if you have used parsec .

Imports

{-# LANGUAGE ScopedTypeVariables #-} {-# LANGUAGE RecursiveDo #-} {-# LANGUAGE FlexibleContexts #-} module SymbolicDifferentiation.Earley where import SymbolicDifferentiation.AlphaSyntax ( Exp ( N , V , Plus , Times )) import qualified Text.Earley as E import Text.Earley ((<?>)) import Control.Applicative (many, some, (<|>)) import qualified Data.Char as Char import Control.Monad.ST ( ST ) import Data.ListLike ( ListLike ) -- | What to report for something expected. type Expected = String

The <?> operator is used to attach an expectation (which we have decided to specify as a string, with type synonym Expected ) to a production.

Drivers

What we want for our particular problem is a parser that takes a string as input and expects to fully parse it. We construct it from a more generic parser that comes from processing our grammar.

-- | Return a list of all possible `Exp` parses, and also a status report -- regardless of how many successes. parses :: String -> ([ Exp ], E . Report Expected String ) parses = E .fullParses expParser -- | Parser created from the grammar. expParser :: ListLike input Char => ST state (input -> ST state ( E . Result state Expected input Exp )) expParser = E .parser grammar

Grammar

Our grammar is straightforward. Earley uses a monad to maintain its internal state, and we use the RecursiveDo GHC extension (covered in a 2014 Day of GHC Extensions) in order to be able to refer to a rule within the grammar recursively. Note that left recursion in the grammar is just fine for Earley .

Prod is the type constructor for a production, and you build up productions using combinators such as satisfy and symbol .

-- | Basically taken from <https://github.com/ollef/Earley/blob/master/examples/Expr2.hs Earley example expression parser> grammar :: forall r. E . Grammar r ( E . Prod r Expected Char Exp ) grammar = mdo whitespace <- E .rule $ many $ E .satisfy Char .isSpace let token :: E . Prod r Expected Char a -> E . Prod r Expected Char a token p = whitespace *> p sym x = token $ E .symbol x <?> [x] ident = token $ ( : ) <$> E .satisfy Char .isAlpha <*> many ( E .satisfy Char .isAlphaNum) <?> "identifier" num = token $ some ( E .satisfy Char .isDigit) <?> "number" -- For now, just handle unsigned numeric literals. atom <- E .rule $ ( N . read) <$> num <|> V <$> ident <|> sym '(' *> term <* sym ')' factor <- E .rule $ Times <$> factor <* sym '*' <*> atom <|> atom term <- E .rule $ Plus <$> term <* sym '+' <*> factor <|> factor return $ term <* whitespace

For more examples of grammars, see the examples directory in the Earley GitHub repo.

The inverse of parsing is pretty-printing, covered on day 17.

For fun: solving the “number word” problem

The ability to handle ambiguity and return all possible parses is a useful one in many situations. Here I show a solution to the “number word” problem. In the past, I have managed ambiguity using Happy’s GLR support, but I don’t like writing parsers using Happy.

The “number word” problem:

Given a positive integer, return all the ways that the integer can be represented by letters using the mapping 1 -> A, 2 -> B, ..., 26 -> Z. For instance, the number 1234 can be represented by the words ABCD, AWD and LCD.

This is a toy version of an actually serious problem, that of segmentation in natural language.

Test

The test reflects the problem statement:

module EarleyExampleSpec where import EarleyExample (grammar, NumberWord , Expected ) import qualified Text.Earley as E import qualified Data.List.NonEmpty as NonEmpty import Test.Hspec ( Spec , hspec, describe, it, shouldMatchList) spec :: Spec spec = describe "EarleyExample" $ do it "returns all possible parses of number words" $ do let (result, _ ) = parseNumberWord 1234 map NonEmpty .toList result `shouldMatchList` [ "ABCD" , "AWD" , "LCD" ] parseNumberWord :: Integer -> ([ NumberWord ], E . Report Expected String ) parseNumberWord = E .fullParses ( E .parser grammar) . show

Note that I am using a NonEmpty list of Char because an empty string is not a valid solution to the “number word” problem. (I covered NonEmpty on day 7.)

Solution

The solution is just to write a grammar that tries to pick off valid consecutive digits to make a letter. We create a production for each possible letter that we care about, using numberLetterFor , combine those productions with alternation using asum to get a composite production numberLetter , then use that for numberWord which is the grammar.

{-# LANGUAGE RecursiveDo #-} module EarleyExample where import qualified Text.Earley as E import Text.Earley ((<?>)) import Control.Applicative ((<|>)) import qualified Data.Foldable as Foldable import qualified Data.Char as Char import qualified Data.List.NonEmpty as NonEmpty import Data.List.NonEmpty ( NonEmpty ((:|))) -- | Result wanted. type NumberWord = NonEmpty NumberLetter -- | 'A' to 'Z'. type NumberLetter = Char -- | What to report for something expected. type Expected = String grammar :: E . Grammar r ( E . Prod r Expected Char NumberWord ) grammar = mdo numberWord <- E .rule $ NonEmpty .cons <$> numberLetter <*> numberWord <|> ( :| [] ) <$> numberLetter return numberWord numberLetter :: E . Prod r Expected Char NumberLetter numberLetter = ( Foldable .asum . map numberLetterFor) [ 'A' .. 'Z' ] <?> "number" -- | Return a production for a given letter. -- -- 1 is 'A', 2 is 'B', .. 26 is 'Z'. numberLetterFor :: NumberLetter -> E . Prod r Expected Char NumberLetter numberLetterFor c = c <$ E .word (show (toNumber c)) <?> [c] -- | 'A' is 1, ... 'Z' is 26 toNumber :: NumberLetter -> Int toNumber c = ( Char .ord c - Char .ord 'A' ) + 1

Conclusion

I only recently discovered the Earley parser library and started using it. I’m pretty excited by its friendliness.

All the code

All my code for my article series are at this GitHub repo.