Lexical analysis with parser combinators

When writing a programming language parser in Haskell, the usual dilemma is whether to use lexer/parser generators (Alex+Happy), or make a single parser (using a parser combinator library like Parsec) without an explicit tokenizer.

In this article, I’ll explain why you may want to use a separate lexer built with applicative parser combinators, and how you might go about writing one.

Many of the ideas described in this article have been since implemented in lexer-applicative.

Alex

Alex, the Haskell lexer generator, is a nice tool, but it has a weakness. With Alex, you cannot observe the internal structure of a token.

The most common instance of this problem is parsing string interpolation, think "Hello, ${name}!" . Such a string is a single token, yet it has some internal structure to it. To analyze this with Alex, you’ll probably need to have two different lexers, and run one inside the other.

Another instance is date literals (useful for many DSLs), such as 2015-01-02 (or d"2015-01-02" to avoid confusion with infix subtraction.) You can recognize such a literal with an Alex regular expression $d+\-$d+\-$d+ . However, you won’t be able to get hold of the individual $d+ pieces — Alex regexps lack capture groups. So you’ll need to use a separate date parsing function to parse the same string second time.

Parsec

On the other hand, we have parser combinators that solve this problem nicely. You define parsers for numbers; then you glue them with dashes, and you get a parser for dates.

However, using Parsec (or similar) without a separate lexer is not the best choice:

Dealing with whitespace and comments is awkward in the parser; you need to wrap everything in a token combinator. (If you decide to do that, at least use a free applicative functor to ensure that you don’t forget to consume that whitespace). Separating parsing into lexical and syntax analysis is just a good engineering practice. It reduces the overall complexity through «divide and conquer». The two phases usually have well-defined responsibilities and a simple interface — why not take advantage of that? If a language needs the maximal munch rule, it’s hard to impossible to encode that with Parsec or similar libraries. Tokens enable the syntax analyzer to report better errors. This is because you can tell which token you didn’t expect. In a Char-based Parsec parser, you can only tell which character (or an arbitrary number of characters) you didn’t expect, because you don’t know what constitutes a token. Potential performance considerations. If a parser has to try several syntax tree alternatives, it reparses low-level lexical tokens anew every time. From this perspective, the lexical analyzer provides a natural «caching layer».

regex-applicative

My regex-applicative library provides applicative parser combinators with regexp semantics. We can write a simple lexer on top of it that would give us the benefits of both Alex and Parsec.

{-# LANGUAGE OverloadedStrings, OverloadedLists #-} import qualified Data.Text as T import qualified Data.HashSet as HS import Text.Regex.Applicative import Text.Regex.Applicative.Common (decimal) (decimal) import Data.Char import Data.Time

For example, here’s a parser for dates. (For simplicity, this one doesn’t check dates for validity.)

pDate :: RE Char Day = fromGregorian <$> decimal <* "-" <*> decimal <* "-" <*> decimal pDatefromGregoriandecimaldecimaldecimal

And here’s a parser for templates — strings that may include interpolated variables and escaping.

pTemplate :: RE Char [ Either T.Text T.Text ] -- Left = text, Right = variable = "\"" *> many piece <* "\"" pTemplatemany piece where -- piece of text or interpolated variable = piece ( Left . T.pack <$> some ch) <|> T.packsome ch) ( Right <$> var) var) -- interpolated variable = "${" *> pVar <* "}" varpVar -- normal character, plain or escaped = ch -> not $ c `HS.member` escapable) <|> psym (\cescapable) ( "\\" *> psym (\c -> c `HS.member` escapable)) psym (\cescapable)) -- set of escapable characters = [ '"' , '\\' , '$' ] escapable = T.pack <$> some (psym isAlpha ) pVarT.packsome (psym

Individual parsers are merged into a single pToken parser:

data Token = Template [ Either T.Text T.Text ] | Date Day -- | ... pToken :: RE Char Token = pToken ( Template <$> pTemplate) <|> pTemplate) ( Date <$> pDate) pDate) -- <|> ...

Finally, a simple tokenizing function might look something like this:

tokens :: String -> [ Token ] "" = [] tokens[] = tokens s let re = ( Just <$> pToken) <|> ( Nothing <$ some (psym isSpace )) in repToken)some (psym)) case findLongestPrefix re s of findLongestPrefix re s Just ( Just tok, rest) -> tok : tokens rest tok, rest)toktokens rest Just ( Nothing , rest) -> tokens rest -- whitespace , rest)tokens rest Nothing -> error "lexical error"

The resulting token list can be further parsed with Parsec or Happy.

This simple function can be extended to do a lot more: