[Haskell-cafe] ANNOUNCE: Megaparsec – an improved and actively maintained fork of Parsec

Hello! I'd like to announce the release of Megaparsec, a fork of Parsec that has been in the works for the past 2 months. There's a lot of improvements and bugfixes under the hood – as well as a new test suite with 128 Quickcheck tests covering 80% of code (Parsec has only 3 tests, by the way) – but first I'd like to explain why a fork was needed, since forking a popular library is a pretty drastic measure and should be accompanied by an explanation. (A disclaimer: I've been given permission to announce the library, but I'm neither the author nor an expert on parsing.) Hackage: https://hackage.haskell.org/package/megaparsec Changelog (including a list of differences from Parsec): https://hackage.haskell.org/package/megaparsec/changelog Github: https://github.com/mrkkrp/megaparsec If you ever had any ideas about what Parsec should've done differently, or what amazing new combinators it should include, etc., post your ideas here: https://github.com/mrkkrp/megaparsec/issues Why fork Parsec instead of writing a new library ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ There are many parsing libraries on Hackage; Parsec was one of the first to appear, and it's still the one that is recommended to beginners when they ask about doing parsing in Haskell. Most other libraries aren't exactly trying to compete with Parsec – instead they explore new directions. Like it or not, it remains a fact that a lot of people are being recommended Parsec, a lot of people are using Parsec, and a lot of people will probably continue to use Parsec since there's no clear alternative to it. Writing a new and *different* library probably won't change it. Even new and *similar* libraries (trifecta and attoparsec are similar enough) haven't removed the need for Parsec, only mitigated it somewhat. Perhaps one day trifecta or something else will completely replace Parsec, but right now we still have to put up with Parsec. So, what we need (or at least what would be nice to have) is simply a better Parsec. What's wrong with original Parsec ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Parsec's development has stagnated long ago; you'd have a hard time getting your pull request merged, even if the only thing it does is fixing a typo (I'm not even talking about anything more controversial than that). Parsec isn't perfect or bug-free, and there *is* a need for those pull requests – just look at its Github page, where you'll find 8 unmerged PRs (and 9 open issues): https://github.com/aslatter/parsec In addition to things that could be fixed but simply aren't, there are some inconveniences (and bugs!) that are hard to fix without breaking backwards compatibility: * “notFollowedBy eof” will just silently not do what you expect it to do (a bug old enough to be considered an undocumented feature) * “<|>” and “many” are redefined and so importing Text.Parsec clashes with Control.Applicative (this can't be trivially fixed in Parsec because its “<|>” has different precedence from Control.Applicative's “<|>”) * you can't wrap Parser into monad transformers (I'm talking about things like “WriterT [String] Parser a” – if this was possible, there'd be no need for “user state” baked into ParsecT) * “Text.Parsec.Token” is not flexible enough for many needs; if you depend on it, one day you may find that you have to copy the whole module to get the behavior you want (look at https://github.com/aslatter/parsec/issues/15 and https://github.com/aslatter/parsec/issues/24, for example). In short, there may be a lot of value in improving Parsec – and that's where Megaparsec comes in. What is Megaparsec and how is it different ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Megaparsec “reboots” the development of Parsec. It's not backwards compatible, but it *is* compatible enough to avoid having to rewrite all Parsec tutorials, and code written for Parsec can be converted to use Megaparsec pretty mechanically. Given that, I would recommend using Megaparsec instead of Parsec from now on, unless you need compatibility with GHCs older than 7.10. Here is a detailed account of some of the bigger changes: === Error reporting === Megaparsec's errors messages are significantly more accurate than Parsec's in some cases, as the following demonstration shows. Parsec: > parseTest (try (string "let") <|> string "lexical") "le" parse error at (line 1, column 1): unexpected end of input expecting "lexical" Megaparsec: > parseTest (try (string "let") <|> string "lexical") "le" parse error at line 1, column 1: unexpected "le" expecting "let" or "lexical" And here's another one, showing off how sometimes “try a <|> b” is a bit less harmful than usual. Just in case, I'm talking about this blog post: http://blog.ezyang.com/2014/05/parsec-try-a-or-b-considered-harmful/ If you don't want to read it in full, the gist of it is that if you use “try” too liberally, Parsec's error messages become much worse than they could be. In this example, we're trying to parse a simple version of Haskell's import statements, which can look either as “import Foo” or “import qualified Foo as B”. If you write a parser like this: try pQualifiedImport <|> pImport you will get an uninformative error message on the following input: import qualified Foo s B Specifically, the “try pQualifiedImport” branch will fail, and then the “pImport” branch will stumble upon “qualified” – while the actual error isn't there, but in misspelled “as”. The advice given in the blog post (“The scope of backtracking try should be minimized”) is good, and applies to Megaparsec just as well as it applies to Parsec. However, a curious thing is that in this particular case Parsec *should* be able to tell you that the error is in misspelled “as”, because Parsec implements the longest match rule – errors that occur later in text are given precedence over errors that occur earlier. The code is there, but somehow it doesn't do what it's supposed to do (and Megaparsec fixes that). Here's the code from the article, rewritten for Megaparsec: import Text.Megaparsec import qualified Text.Megaparsec.Lexer as L data Stmt = QualifiedImport String String | Import String deriving (Show) pStmt = try pQualifiedImport <|> pImport pImport = do keyword "import" Import <$> upperCaseIdentifier pQualifiedImport = do keyword "import" keyword "qualified" QualifiedImport <$> upperCaseIdentifier <*> (keyword "as" *> upperCaseIdentifier) upperCaseIdentifier = lexeme $ (:) <$> upperChar <*> many (alphaNumChar <|> oneOf "_.") lexeme = L.lexeme (hidden space) keyword = L.symbol (hidden space) And here are the error messages it produces. Megaparsec: > parseTest (pStmt >> eof) "import qualified Foo s B" parse error at line 1, column 22: unexpected 's' expecting "as" Parsec (with minor modifications): > parseTest (pStmt >> eof) "import qualified Foo s B" parse error at (line 1, column 8): unexpected "q" expecting uppercase letter === Integration with monad transformers === The key type of Parsec is “ParsecT” (others, such as “Parsec” and “Parser”, are just type synonyms). It lets you use parsers with other monads – for instance, if you use “ParsecT String () IO”, you can have IO in your parsers. This is the reason, by the way, why “char” has the type char :: Stream s m Char => Char -> ParsecT s u m Char instead of a simpler char :: Char -> Parser Char (If it was the latter, you wouldn't be able to use “char” and IO in the same parser.) However, even this type isn't general enough for all things you might want to do. Imagine that you want some parsers to generate warnings when they are run, and later you want to collect those warnings and do something with them. This sounds like what Writer was invented for, so you try to use it. Now all your parsers have this type: ParsecT String () (Writer [Warning]) (Where “Warning” is, say, a synonym for String.) Unfortunately, this is not how you should've composed ParsecT and Writer, because this way you don't get backtracking – in other words, if you try to do optional (try someParser) and someParser generates warnings but then fails, “optional” won't be able to make them disappear – they will still be recorded. Same with State – if you do x <* notFollowedBy y and “y” changes the state, this change will be recorded even tho it's not what you want most of the time. (You can use Parsec's internal state and it *will* work, but it doesn't help you when you want to use Writer or something else instead of State.) What you want in such situations is WriterT [Warning] Parser () StateT YourState Parser () ... but you can't get it because “char” and all other primitive parsers simply don't have those types. With Parsec, the only solution is to apply “lift” to all parsers you want to use, which is pretty annoying. Megaparsec solves this by introducing an mtl-style “MonadParsec” class, making primitive parsers members of this class, and providing instances of MonadParsec for various monad transformers. (If you have ever used the ‘parsers’ library, you may recognise this approach.) I think being able to get backtracking behavior without relying on inelegant ways like “Parsec user state” is pretty neat, even if it's not something every Parsec user needs. === Lexing === Parsec has a lexing module (Text.Parsec.Token): http://hackage.haskell.org/package/parsec/docs/Text-Parsec-Token.html If you're not familiar with lexing, the idea is as follows. When you are parsing a programming language, you often have to solve the same set of problems – parsing numbers, string literals (with all those escaping rules), identifiers/operators/keywords, comments, making all parsers handle whitespace (since there can be whitespace between pretty much any 2 tokens), and so on. With “Text.Parsec.Token” you could just specify what counts as whitespace, as a comment, as an identifier character, etc. and get a set of parsers “for free”. Parsec achieves this by defining a huge structure called “GenLanguageDef” that contains the specification of your language: data GenLanguageDef s u m = LanguageDef { commentStart :: String, commentEnd :: String, commentLine :: String, nestedComments :: Bool, identStart :: ParsecT s u m Char, identLetter :: ParsecT s u m Char, opStart :: ParsecT s u m Char, opLetter :: ParsecT s u m Char, reservedNames :: [String], reservedOpNames :: [String], caseSensitive :: Bool } Then you use “makeTokenParser” on it to generate another huge structure containing lots of useful parsers: makeTokenParser :: GenLanguageDef ... -> GenTokenParser ... data GenTokenParser s u m = TokenParser { identifier :: ParsecT s u m String, operator :: ParsecT s u m String, charLiteral :: ParsecT s u m Char, stringLiteral :: ParsecT s u m String, natural :: ParsecT s u m Integer, integer :: ParsecT s u m Integer, float :: ParsecT s u m Double, lexeme :: forall a. ParsecT s u m a -> ParsecT s u m a, parens :: forall a. ParsecT s u m a -> ParsecT s u m a, braces :: forall a. ParsecT s u m a -> ParsecT s u m a, comma :: ParsecT s u m String, colon :: ParsecT s u m String, ... } What's the problem with this approach? It's very inflexible – the moment you want to change something that wasn't supposed to be changed, you're on your own. Do you want to special-case “-- |” comments to use them as doc strings? The easiest solution is to copy the whole module into your own project: https://github.com/aslatter/parsec/issues/15 Do you want to handle newlines by yourself (for instance, to allow them to be expression separators)? The easiest solution is to fork Parsec: https://github.com/aslatter/parsec/issues/24 (Even if those turn out not to be the easiest solutions, it's still somewhat telling that they were what the authors ended up with.) Megaparsec uses a simpler, more flexible approach. See the docs here: https://hackage.haskell.org/package/megaparsec/docs/Text-Megaparsec-Lexer.html The linked module provides 3 categories of functions: * “integer”, “decimal”, “float”, “charLiteral”, etc are generic parsers that you can use to parse... well, things that they are named after. Some of those – like “integer” – are occasionally useful even if you're not parsing any languages (how often did you have to write “read <$> many1 digit”? now you don't have to). * “skipLineComment” and “skipBlockComment” generate comment parsers, and “space” combines them together. You can make comment parsers arbitrarily complex before passing them to “space”, or you can write your own space-skipping combinator. * “lexeme”, “symbol”, and “symbol'” make lexemes out of things; a lexeme, by convention, expects no leading whitespace and skips all trailing whitespace. So, instead of having parsers being passed to each other under the hood, you now have to pass them by yourself – except that you don't actually have to do much passing, because you can just write import qualified Text.Megaparsec.Lexer as L lexeme = L.lexeme space symbol = L.symbol space and off you go: keyword = label "keyword" . symbol parens = between (symbol "(") (symbol ")") ... etc ... How you can help ~~~~~~~~~~~~~~~~ * Take a project, modify it to use Megaparsec, file an issue if you've encountered any difficulties. * Take a Parsec tutorial and rewrite it for Megaparsec. (And then please send an email so that a link to your tutorial could be added to the README file.) * As I've already mentioned before, if you have any ideas about what could be changed/improved in Parsec, they likely apply to Megaparsec as well – propose them on the issue tracker. * Report typos in documentation (same goes for mistakes, unclear phrasing, etc). If you're shy (like me) and don't like opening issues, just ping me on IRC (I'm “indiagreen” on Freenode) if you spot anything.