Parsec Parser Testing with QuickCheck July 29, 2007

I must admit I’ve liked the concept of QuickCheck ever since I first found it. I tend to write tests that are “too nice” to my code, often missing corner cases and errors caused by strange input. QuickCheck helps me avoid this by generating test data for me. Even in trivial projects it has helped me detect bugs I wouldn’t have found through unit tests written myself.

In this entry I begin testing the parser for HST – A Smalltalk interpreter developed in Haskell (the code developed so far for the parser can be found here and here). To do this I’m going to make use of QuickCheck. I couldn’t find much information on testing a Parsec based parser using QuickCheck, but I’ll have a go and see what happens.

The approach I’ve decided to try here is to have QuickCheck randomly generate an instance of the Abstract Syntax Tree (AST) I’ve developed. I’ll develop a pretty printer that can convert this to a String . Parsing the String should then result in the originally generated AST. I’m not entirely sure this will work, but at least it should give me a good starting point.

Pretty Printing

First I must develop a pretty printer for my AST type. For this I’m going to use the pretty printing library that is part of Haskell’s standard library. It appears in the package Text.PrettyPrint.HughesPJ. Writing the pretty printer requires me to write a method converting each type to the libraries Doc type.

This turns out to be (mostly) quite straight forward for the limited AST that’s been implemented so far. The trickiest part is StringLiteral . I need to deal with single quote characters, as they are the string delimiters in Smalltalk. To include a single quote character in a string literal the programmer uses two single quotes (e.g., The string “Here’s a quote” would be programmed as 'Here''s a quote' ). To account for this a string escaping function is written.

identifier (Identifier i) = text i literal (StringLiteral s) = text $ "\\'" ++ (concatMap stringEscape s) ++ "\\'" stringEscape c | c == '\\'' = "''" | otherwise = [c] primary (PrimaryLiteral p) = literal p primary (PrimaryIdentifier i) = identifier i keyword (Keyword k) = text $ k ++ ":"

The <+> function allows us to specify that two Doc s are ‘beside’ each other and separated by a space. This is useful for those syntax constructs that are combinations of others.

message (KeywordMessage k p) = keyword k <+> primary p expression (BasicExpression p m) = primary p <+> message m

Testing in ghci:



> case (parse basicExpression "" "Transcript show: 'Hello World'") of Right e -> expression e Transcript show: 'Hello World'

So far, so good.

Testing

First up, I’m going to define the QuickCheck property that’s going to be tested. That will get the easy step out of the way.

checkExpressionParser ast = case (parse basicExpression "" (render $ expression ast)) of Left _ -> False Right a -> ast == a where types = ast :: BasicExpression

This results in a compilation error. This message is that there is no instance of Arbtrary for Expression . This was expected, as I have not defined how QuickCheck is to generate these types.

For those types that are composed of others, defining an instance of Arbitrary is easy.

instance Arbitrary Expression where arbitrary = liftM2 BasicExpression arbitrary arbitrary instance Arbitrary Message where arbitrary = liftM2 KeywordMessage arbitrary arbitrary

Primary is slightly more involved, as it could result in a PrimaryLiteral or PrimaryIdentifier . QuickCheck provides the oneof function which can be used for this.

instance Arbitrary Primary where arbitrary = oneof [ liftM PrimaryLiteral arbitrary , liftM PrimaryIdentifier arbitrary ]

For keywords and identifiers I’ll create a common generator, as a keyword in a keyword message is simply an identifier followed by a colon. This generator is required to generate a letter, followed by a sequence of letters or digits. The definition of letter can be implementation defined, but for now I’m just going to worry about the minimums, which is the lowercase and uppercase alphabet along with underscore.

I begin by defining lists that can be passed to QuickCheck’s frequency list for letters or digits. This is followed by the generators for letters and digits themselves. Finally these can be used to create an generator for identifiers. After this the instances of Arbitrary for Identifier and Keyword become trivial.

alphaFreqList = [ (26, choose ('a', 'z')) , (26, choose ('A', 'Z')) , (1, elements ['_']) ] digitFreqList = [ (10, choose ('0', '9')) ] letter = frequency alphaFreqList letterOrDigit = frequency $ alphaFreqList ++ digitFreqList identifierGenerator = liftM2 (:) letter $ sized (\

-> replicateM n letterOrDigit) instance Arbitrary Identifier where arbitrary = liftM2 Identifier identifierGenerator instance Arbitrary Keyword where arbitrary = liftM2 Keyword identifierGenerator

For StringLiteral we allow any ASCII character and also zero length strings.

instance Arbitrary StringLiteral where arbitrary = liftM StringLiteral $ sized (\

-> replicateM n $ choose ('\\0', '\\255'))

Running the Tests

I’ve defined a method that runs the check. I will run this from ghci

runChecks = quickCheck checkExpressionParser

Running this produces the following output (note that this may be different on each run, as the test data is generated randomly):

> runChecks Falsifiable, after 0 tests: BasicExpression (PrimaryIdentifier (Identifier "fkY")) (KeywordMessage (Keyword "y4d") (PrimaryIdentifier (Identifier "U0_")))

This corresponds to the Smalltalk code " fkY y4d: U0_ ". Getting a failure here is a relief, as I knew the parser wasn't complete yet. This test is failing because of the digits in the keyword and identifier. I can add in this feature with the following parser code:

stLetter = choice [letter, char '_', digit] identifierString = many1 stLetter keyword = Keyword `liftM` (identifierString >>~ char ':') identifier = Identifier `liftM` identifierString

Now, the parser is able to make it through 39 tests before QuickCheck reports and error. This time, it is with escaped quote characters in string literals. Again this is something left out of the original parser, so it is good that QuickCheck finds this. Modifying the parser again:

stringLiteral = StringLiteral `liftM` (quote >> stString >>~ quote) quote = char '\\'' stString = many $ choice [noneOf ['\\''], try $ string "''" >> return '\\'']

Now, all QuickCheck tests pass.

It is worth noting that this testing currently has short comings. The two most notable are:

No test for ensuring that a syntactially invalid program fails parsing,

Whitespace variations are not tested (e.g., tabs, line breaks etc.).

I'm not sure of how these can be tested using QuickCheck properties yet.

Conclusion

I must admit, I was expecting this step of developing HST to take longer and be more difficult. I am impressed with how quickly and easily Parsec, QuickCheck, and Text.PrettyPrint.HughesPJ have allowed me to develop this. Although the testing shown here does have shortcomings I'm very pleased with the testing that is in place and I think I have a good base from which to continue developing.