The Haskell standard library comes with a small but competent parser generator library: Text.ParserCombinators.ReadP. Whenever you need to write your own parser to consume some kind of data, this is what you should reach for first. Forget splitting strings. Forget regexes. ReadP is where it is at from now on.2 There are a bunch of parser combinator libraries for Haskell, including but not limited to Attoparsec , Parsec , Megaparsec , Turtle.Pattern and Earley . These are all good for different things. The reason I recommend ReadP is that it’s good, of course, but also that if you have ghc installed, you already have ReadP on your computer! Oh, but you’d rather want to learn Parsec or some other more “batteries included” parser combinator library? No problem, the things you learn in this tutorial are things you can use in other parser combinator libraries too. In fact, I’ve decided to try to teach only functions and operators which have the same name across multiple parser combinator libraries. So you can totally read this tutorial and then use the things you learned with Parser or Attoparsec or something along those lines.

To get acquianted with parser combinators, let’s start with the simplest parser I can think of: we’ll parse a single vowel.

Enter:

import Text.ParserCombinators.ReadP isVowel :: Char -> Bool isVowel char = any (char ==) "aouei" vowel :: ReadP Char vowel = satisfy isVowel

We make the helper function isVowel which simply returns True for any character that is a vowel. It does this by checking if the argument character is equal to any character in the string "aouei" .

isVowel is then used in the parser we name vowel , through the satisfy function from the ReadP library, one of our staples. This function is important, so lets look at its type signature.

Enter:

satisfy :: ( Char -> Bool ) -> ReadP Char

It takes any function Char -> Bool and returns a parser that parses any character that passes the test function we give it. In this case, we give it isVowel , so it will return a parser that parses a single vowel. You could just as well imagine the satisfy isDigit parser that parses a single digit. Or a satisfy (== ' ') parser that parses only a single space character and will fail on anything else.

Oh, and in case it is not evident yet, a value of type ReadP Char is a parser that parses characters and returns a Char value. A parser of type ReadP Float also parses characters (all parsers do) but returns a Float type value. Any time you see the type ReadP something , you can internally read it as “parser of something ”.

However, a parser is not itself a function that takes input. It needs to be “run” on some input by another function. In the case of ReadP , this is done by the confusingly named readP_to_S function, which takes a parser and an input and runs the parser on the input. We can test our vowel parser with that. This is its type signature, when it has been “demystified”:

Enter:

readP_to_S :: ReadP a -> String -> [(a, String )]

The output of readP_to_S might look a little odd at first, but by looking at several examples of it you will get a sense of what it means. In essence, readP_to_S returns a list of successful parses, where “a parse” loosely means the two-tuple (parsedValue, unparsedRemainderOfString) . If the parser fails (i.e. could not parse anything at the beginning of the input) it will return the empty list. In action:

Enter:

λ> readP_to_S vowel "e" [( 'e' , "" )] λ> readP_to_S vowel "k" [] λ> readP_to_S vowel "another one bites the dust" [( 'a' , "nother one bites the dust" )] λ> readP_to_S vowel "did you see that" []

The first element of the tuple is the successful parse, the second element of the tuple is the unparsed remainder of the string.

If the string does not start with a vowel, the parser fails entirely. The parser will not automatically skip irrelevant characters, but leaves that decision up to the one who writes the parser. This greater control, while sometimes inconvenient, is normally useful.

However, reading just one vowel is not as interesting as reading several of them. Since readP_to_S returns the unparsed remainder of the input, we can imagine writing a function to chain together parsers.

Enter: atLeastOne :: ReadP Char -> String -> [( String , String )] atLeastOne parser input = case readP_to_S parser input of -- Empty list means failed parse, so this parser -- should fail too [] -> [] -- Successfully parsed at least one character, so -- try parsing a few more by recursively calling -- atLeastOne [(char, remainder)] -> case atLeastOne parser remainder of -- After a successful parse, it failed when -- trying to do it again. Return the single -- successful parse [] -> [(char : "" , remainder)] -- The recursive call was successful. Append -- our results to the rest of them, and return -- whatever is left of the input [(str, finalRemainder)] -> [(char : str, finalRemainder)]

While this works, as demonstrated below, it is a very bad idea.

Enter:

λ> atLeastOne vowel "aouibcdef" [( "aoui" , "bcdef" )] λ> atLeastOne vowel "gjshifu" []

Why is atLeastOne not good? For one, it is brittle and not quite following the expectations we have of parser combinators, but moreover it is hugely inconvenient to write, and not very clear at all when trying to read it later.

This is where the combinator part of parser combinators come in. Our atLeastOne function dealt with parsed results, while the combinator functions we want to use work with parsers.

For instance, there is the many1 combinator function in Text.ParserCombinators.ReadP which does exactly what we want. The type signature of it looks like

Enter:

many1 :: ReadP a -> ReadP [a]

In other words, it takes a parser that parses a single a (which in our case is Char ) and returns a parser that parses several a ﻿s. By “several”, I mean at least one, but potentially infinitely many.

With this, we can create

Enter:

atLeastOneVowel :: ReadP [ Char ] atLeastOneVowel = many1 vowel

and behold! This might not be what you expected.

Enter:

λ> readP_to_S atLeastOneVowel "aouibcdef" [( "a" , "ouibcdef" ) ,( "ao" , "uibcdef" ) ,( "aou" , "ibcdef" ) ,( "aoui" , "bcdef" )]

Now we see why readP_to_S returns a list. “At least one vowel” can mean just one vowel. It can also mean two, or three, or four of them. So many1 accounts for these possibilities by simply giving back all possible parses, and lets you pick whichever one you wanted.

This may look problematic, but it turns out that often it does not matter, because most of the time there is only one possible parse anyway.