foo=bar

offset=17

a=1 b=2 c=3

normalPair = do name

char '='

value

return (name, value)

notSpaces = many1 (satisfy (not . isSpace))

string

string=text with embedded spaces and newlines

string=my damn string length=14 hurt=bad

string=trojan length=6 nukes=launch_em_all length=35

anyPair = try stringPair

normalPair

stringPair

try

normalPair

stringPair

= do s

char '='

(len, str)

return (s, take len str)

stringPair

scanForLength

scanForLength = do r

return (r, [])

do x

(r, xs)

return (r, (x:xs))

scanForLength

manyTill

lengthPair

lengthPair = string "length=" >> many1 digit >>= return . read

For the past number of years, whenever I’ve needed to write a parser for a language, I’ve turned to Terence Parr’s ANTLR . It’s a wonderful piece of software, far more capable than the old standbys lex and yacc. Just as appealing is the fact that it generates recursive-descent parsers that are easy to read and debug. Compare an ANTLR-generated parser with the pages of pushdown automaton state transition muck that yacc and its descendants generate and you’ll quickly appreciate the clean nature of ANTLR’s code. In the Haskell world, a close approximation to ANTLR is available in the form of Parsec , a library for building parsers using combinators. Where ANTR has its own special syntax, and you run a tool that generates code in some target language, you construct a Parsec parser entirely in Haskell. The Parsec documentation is excellent, with a great tutorial that makes it clear how easy it is to use Parsec to build a parser. In my case, I wanted to parse a simple language that has some structure, but no formal grammar, and a few nasty corners. This language makes heavy use of key/value pairs, in which the keys and values are usually just contiguous non-whitespace strings. In other words, a typical pair might be ““, or ““. Sequences of pairs are separated by white space, so three pairs might look like ““. It’s easy to recognise one of these pairs.One of the lurking nasty bits of this language lies in how it represents string literals. A key in this case will be the word "", and the value will be the text of the string. Unfortunately, the text of the string isn't escaped in any way, so a literal might look like "". Oh, no! Where did my nice regular language go? Fortunately, a string literal is always followed immediately by a pair that tells me how long it was, with a key of "length", and a value that's an integer. To further complicate matters, there can be an arbitrary amount of white space between the end of the string literal and the start of the length pair; some of this white space can belong to the string, while the remainder is padding. So in order to recognise a string properly, I have to identify the start of the string, scan for the length pair, parse the number for the length, then use that to chop out the actual string to return as the value of this production.(Aside: The astute reader will notice that we're sunk if a string literal contains a piece of text that looks like a valid length pair. This would allow an attacker to inject key/value pairs that could lead us off into the weeds, depending on what we are trying to do with all of this data. Consider the example below, where the "real" string is in italics, but it contains an embedded string that we can't distinguish.Kids, remember what mom said! When a stranger comes up to you and suggests that you cruft together a simple file format, just say "no" to dumping unquoted literal strings out! You never know who'll be watching!) If I want to be able to recognise a normal pair or one of these bizarre string creatures, it's easy enough.If(yet to be defined) doesn't match,will "unconsume" the characters it ate trying, and thenwill try to useto match.The key tois that it usesto recognise the "length=n" string. That function returns both the accumulated string that it scanned and the value of the length pair:This definition ofis very similar to Parsec's standardcombinator. Finally, let's provide a definition ofthat avoids "do" notation, just because I could:What I like about Parsec is the ability to do things in a freewheeling manner when necessary, like parse these nasty string literals; this simple task is rather more painful with a traditional parser/lexer generator. Oh, and I haven't mentioned the error messages that Parsec generates. If a parse fails, it does a nice job of telling you where the problem occurred, and what it was expecting to match at the time. Compared to trying to produce meaningful errors from yacc, it's rather a lot nicer. There are a few downsides to using Parsec, compared to more traditional tools. The biggest is that a normal parser generator will tell you if your production rules will lead to nontermination (e.g. via a production whose RHS matches its LHS, like foo -> foo). Since Parsec isn't a parser generator, it doesn't have the ability to inspect productions; you'll get an infinite loop at runtime. However, the Parsec documentation does a good job of explaining its pitfalls, and how you can work around them. It's a lovely piece of software.