Welcome to the beginning of the compiler proper! If you haven't yet, check out part 0 for a description of the project and help setting up the development environment.

A note about the presentation structure: We will be going through each phase of the compiler mostly in its entirety before moving onto the next phase. Most other tutorials and books don't do this. Instead, they build up the language by implementing one feature at a time at each stage of the pipeline. When building your own compiler, I strongly recommend following this approach, as opposed to trying to write the complete parser, or the complete semantic analyzer, in one go, as having the working pipeline, however minimal, to test with, is invaluable for debugging. I've decided to present a fairly complete version of the language here at every pass for better code locality, and because I've already written the whole thing, and it would be a lot of work to split it back up into separate features. In future parts, we'll explore adding features to an already complete pipeline, but they'll be ones that I haven't implemented yet.

Defining the language

First, we'll encode the C abstract syntax tree in Haskell types. C expressions can be literal integers, floats, characters, strings (we'll say much more about strings later), booleans, null (I'm sorry), and other, more complicated forms.

module Microc.Ast where import Data.Text ( Text ) import Data.Text.Prettyprint.Doc import Data.Char ( chr ) -- Binary operators data Op = Add | Sub | Mult | Div -- I've added an exponentiation operator ** for fun | Power | Equal | Neq | Less | Leq | Greater | Geq | And | Or | BitAnd | BitOr deriving ( Show , Eq ) -- Unary operators -- (no post/pre-fix increment or decrement for simplicity) data Uop = Neg | Not deriving ( Show , Eq ) data Expr = Literal Int | StrLit Text | CharLit Int -- ^ Chars get lowered to ints during codegen | Fliteral Double | BoolLit Bool -- ^ true | false | Null | Id Text -- ^ variables | Binop Op Expr Expr -- ^ binary operators | Unop Uop Expr -- ^ unary operators (just - and !) | Call Text [ Expr ] -- ^ f(x, y, z) | Cast Type Expr -- ^ (int *)malloc(100) | Access Expr Expr -- ^ struct access | Deref Expr -- ^ *pointer | Addr Expr -- ^ &expression | Assign Expr Expr | Sizeof Type | Noexpr -- ^ used for dangling if's deriving ( Show , Eq )

Statements can be naked expressions, blocks, returns, or control flow. Our language will only have if , for , and while . At some later time, I might experiment with adding goto , break , and continue as they're quite difficult/interesting to generate llvm for, but for now we have enough to worry about.

data Statement = Expr Expr | Block [ Statement ] | Return Expr | If Expr Statement Statement | For Expr Expr Expr Statement | While Expr Statement deriving ( Show , Eq )

The notable omissions from C's type system are unions, arrays, function pointers, and typedefs. Out of C's vast array of numeric types, we just support int (32-bit) and float (actually double precision), as implementing the rest of them isn't spectacularly interesting. We will, however, have normal pointers and structs, which are plenty powerful and at on their own.

data Type = Pointer Type | TyInt | TyBool | TyFloat | TyChar | TyVoid | TyStruct Text deriving ( Show , Eq )

Finally, with function and at definitions, we can represent whole (single file) Micro C programs.

data Bind = Bind { bindType :: Type , bindName :: Text } deriving ( Show , Eq ) data Struct = Struct { structName :: Text , structFields :: [ Bind ] } deriving ( Show , Eq ) data Function = Function { typ :: Type , name :: Text , formals :: [ Bind ] , locals :: [ Bind ] , body :: [ Statement ] } deriving ( Show , Eq ) data Program = Program [ Struct ] [ Bind ] [ Function ] deriving ( Eq , Show )

(Full source of the AST here)

Using the excellent prettyprinter library, we can define Pretty instances for the AST types which will allow us to serialize our AST back to valid C source. They are elided in this post, as they're very mechanical to write, but they are at the end of the linked source, for anyone interested.

Parsing the language

Parser theory is a vast field, and one of the oldest in computer science. After all, we needed to be able to parse computer languages in order to stop having to write assembly and move on to higher level problems. Writing parsers is also a favorite pastime of the haskell community; there are 210 libraries in the "Parsing" category on hackage alone, ranging from parsec-style combinator libraries, to lex/yacc style parser generators, to implementations of Earley's algorithm, and more. I cannot claim to give even a remotely thorough treatment of various parsing strategies here, but I can at least demonstrate two different methods of parsing and compare them.

Alex/Happy

First, we'll discuss using haskell's lex/yacc style parser generator libraries. Alex is a lexer generator, similar to Lex that, given a set of rules with regular expressions, transforms the string of a source file into a series of tokens. The top of the Alex file declares the generated module name and necessary imports.

{ module Microc.Scanner.Generator where import Microc.Ast }

Then, we specify the "wrapper" type and some regex variables.

%wrapper "basic" $alpha = [a-zA-Z] $digit = 0-9 $newline = [\r

]

The meat of the lexer is the rule for how to transform characters into tokens. Code inside braces denotes a haskell function of type String -> Lexeme which is called on whatever matches the regex rule on the left. A semicolon tells Alex to ignore the matching text, which is how comments are implemented.

tokens :- $white+ ; "/*" ( $newline | [^\*] | \*+ ($newline | [^\/]) )* "*/" ; "//" [^$newline]* $newline ; \( { const LPAREN } \) { const RPAREN } \{ { const LBRACE } \} { const RBRACE } \; { const LSemi } \, { const LComma } \+ { const LAdd } \- { const LSub } \* { const LMul } \/ { const LDiv } \= { const LAssign } \=\= { const LEqual } \!\= { const LNeq } \< { const LLess } \<\= { const LLeq } \> { const LGreater } \>\= { const LGeq } \&\& { const LAnd } \|\| { const LOr } \! { const LNot } \& { const LBitAnd } \| { const LBitOr } \*\* { const LPow } \. { const LDot } \-\> { const LArrow } "if" { const LIf } "else" { const LElse } "for" { const LFor } "while" { const LWhile } "return" { const LRet } "int" { const $ LType TyInt } "float" { const $ LType TyFloat } "bool" { const $ LType TyBool } "char" { const $ LType TyChar } "void" { const $ LType TyVoid } "struct" { const LStruct } "true" { const $ LBool True } "false" { const $ LBool False } "NULL" { const LNull } "sizeof" { const LSizeof }

Variables and numeric and string literals are the most interesting pieces of the scanner, as we need to handle floating point conventions and string parsing. Fortunately, we can abuse haskell's read function to do most of the heavy lifting for us.

$digit+ { LInt . read } $digit+ \. $digit* ( [eE] [\+\-]? $digit+ )? { LFloat . read } $alpha [$alpha $digit \_]* { LId } \" [^\"]* \" { LStrLit . read -- this doesn't handle quote escaping } \' [^\'\\] \' { LCharLit . ord . head . init . tail } \'\\$digit+\' { LCharLit . read . init . drop 2 }

Finally, we have the definition of the Lexeme type.

{ data Lexeme = LInt Int | LFloat Double | LStrLit String | LCharLit Int | LId String | LType Type | LStruct | LBool Bool | LNull | LRet | LAssign | LComma | LSemi | LPAREN | RPAREN | LBRACE | RBRACE | LBRACK | RBRACK | LFor | LWhile | LIf | LElse | LAdd | LSub | LMul | LDiv | LEqual | LNeq | LLess | LLeq | LGreater | LGeq | LAnd | LOr | LNot | LBitAnd | LBitOr | LPow | LDot | LArrow | LSizeof }

(Full source of the alex lexer here.)

After scanning the source and generating the token list, we pass it to happy, which generates an LR bottom up parser for our language. As with alex, the beginning of the file is the haskell module and import specification.

{ module Microc.Parser.Generator where import Microc.Scanner.Generator import Microc.Ast import Data.Text ( pack ) import Prelude hiding ( fst , snd ) }

Then, we declare the name of the generated parsing function, the token type it will act on, and what to do in case of errors.

% name parse % tokentype { Lexeme } % error { parseError }

Happy also requires us to re-declare all of the token types from our lexer generator. The reason for this is unclear to me, as similar tools in other languages don't have this step. However, at least we can use symbols to denote operators instead of writing LBRACE everywhere. The $$ 's represent the arguments of the Lexeme constructor.

% token int { LInt $$ } float { LFloat $$ } id { LId $$ } ptype { LType $$ } char { LCharLit $$ } string { LStrLit $$ } bool { LBool $$ } null { LNull } return { LRet } struct { LStruct } sizeof { LSizeof } '=' { LAssign } ',' { LComma } ';' { LSemi } '(' { LPAREN } ')' { RPAREN } '{' { LBRACE } '}' { RBRACE } for { LFor } while { LWhile } if { LIf } else { LElse } '+' { LAdd } '-' { LSub } '*' { LMul } '/' { LDiv } ' = =' { LEqual } ' ! =' { LNeq } '<' { LLess } ' < =' { LLeq } '>' { LGreater } ' > =' { LGeq } ' & &' { LAnd } ' | |' { LOr } '!' { LNot } '&' { LBitAnd } '|' { LBitOr } ' * *' { LPow } '.' { LDot } ' - >' { LArrow }

Next, we have the precedence rules for our operators going from lowest to highest. We include a dummy NOELSE token to solve ambiguities arising from a possible dangling else.

% nonassoc NOELSE % nonassoc else % right '=' % left '|' % left '&' % left ' | |' % left ' & &' % left ' = =' ' ! =' % left '<' '>' ' < =' ' > =' % left '+' '-' % left '*' '/' % right ' * *' % right '!' NEG % left '.' ' - >'

Now, the important part of the parser: the recursive parsing rules. Care needs to be taken to make sure that as much as possible, recursive calls to rules happen at the left of a production, as bottom up parsers can deal with these in constant space, whereas right recursion can blow up with sufficiently nested files . This means that when parsing lists of items, we'll need to reverse at some point. As with alex, code in braces is valid haskell, with $1 , $2 , etc. representing productions from the 1st, 2nd, etc. rules on the left. The %% is required by happy to indicate the beginning of the production rules.

Proceeding top-down, a program consists of a list of struct declarations, global variables, and functions.

%% program : decls { Program (reverse $ fst $ 1 ) (reverse $ snd $ 1 ) (reverse $ thd $ 1 ) } decls : {- empty -} { ( [] , [] , [] ) } | decls sdecl { (( $ 2 : fst $ 1 ), (snd $ 1 ), (thd $ 1 )) } | decls vdecl { ((fst $ 1 ), ( $ 2 : snd $ 1 ), (thd $ 1 )) } | decls fdecl { ((fst $ 1 ), (snd $ 1 ), ( $ 2 : thd $ 1 )) }

A function declaration is a return type followed by a left paren, followed by 0 or more arguments, followed by a right paren, followed by a list of local variables and statements all enclosed in curly braces.

fdecl : typ id '(' formals_opt ')' '{' vdecl_list stmt_list '}' { Function { typ = $ 1 , name = pack $ 2 , formals = $ 4 , locals = reverse $ 7 , body = reverse $ 8 } } formals_opt : {- empty -} { [] } | formal_list { reverse $ 1 } formal_list : typ id { [ Bind $ 1 (pack $ 2 )] } | formal_list ',' typ id { Bind $ 3 (pack $ 4 ) : $ 1 }

Types are either 0 or more levels of indirection to a raw type or struct. I added pointers to the language after I added the exponentiation operator, so I didn't realize the problems it would cause with pointer dereferencing, but we can use a clever trick to "untokenize" the exponentiation operator back into two dereferencing operators when appropriate.

typ : ptype stars { foldr (const Pointer ) $ 1 $ 2 } | struct id stars { foldr (const Pointer ) ( TyStruct (pack $ 2 )) $ 3 } stars : { [] } | stars '*' { $ 2 : $ 1 } -- A hack to get around the power operator clashing with -- the dereferencing operator | stars ' * *' { $ 2 : $ 2 : $ 1 }

Struct declarations are the keyword struct followed by a list of variable declarations enclosed in curly braces.

sdecl : struct id '{' vdecl_list '}' ';' { Struct (pack $ 2 ) (reverse $ 4 ) } vdecl_list : {- empty -} { [] } | vdecl_list vdecl { $ 2 : $ 1 } vdecl : typ id ';' { Bind $ 1 (pack $ 2 ) }

Now we have the rules for parsing statements and expressions. The statement code is fairly straightforward except that we have to decorate an if statement with no else block with %prec NOELSE in order to avoid ambiguity in the grammar. We also have a good demonstration of the usefulness of Noexpr as a placeholder for empty slots in for loops or naked returns.

stmt_list : {- empty -} { [] } | stmt_list stmt { $ 2 : $ 1 } stmt : expr ';' { Expr $ 1 } | return ';' { Return Noexpr } | return expr ';' { Return $ 2 } | '{' stmt_list '}' { Block (reverse $ 2 ) } | if '(' expr ')' stmt % prec NOELSE { If $ 3 $ 5 ( Block [] ) } | if '(' expr ')' stmt else stmt { If $ 3 $ 5 $ 7 } | for '(' expr_opt ';' expr ';' expr_opt ')' stmt { For $ 3 $ 5 $ 7 $ 9 } | while '(' expr ')' stmt { While $ 3 $ 5 } expr_opt : {- empty -} { Noexpr } | expr { $ 1 }

The expression rule is also fairly straightforward, except the interaction of the power operator with pointer dereferencing. We also use the %prec NEG directive to force prefix operators to bind very as tightly as possible.

expr : int { Literal $ 1 } | float { Fliteral $ 1 } | char { CharLit $ 1 } | string { StrLit (pack $ 1 ) } | bool { BoolLit $ 1 } | null { Null } | id { Id (pack $ 1 ) } | expr '+' expr { Binop Add $ 1 $ 3 } | expr '-' expr { Binop Sub $ 1 $ 3 } | expr '*' expr { Binop Mult $ 1 $ 3 } | expr '/' expr { Binop Div $ 1 $ 3 } | expr ' = =' expr { Binop Equal $ 1 $ 3 } | expr ' ! =' expr { Binop Neq $ 1 $ 3 } | expr '<' expr { Binop Less $ 1 $ 3 } | expr ' < =' expr { Binop Leq $ 1 $ 3 } | expr '>' expr { Binop Greater $ 1 $ 3 } | expr ' > =' expr { Binop Geq $ 1 $ 3 } | expr '&' expr { Binop BitAnd $ 1 $ 3 } | expr '|' expr { Binop BitOr $ 1 $ 3 } | expr ' & &' expr { Binop And $ 1 $ 3 } | expr ' | |' expr { Binop Or $ 1 $ 3 } | expr ' * *' expr { Binop Power $ 1 $ 3 } | '-' expr % prec NEG { Unop Neg $ 2 } | '*' expr % prec NEG { Deref $ 2 } -- A hack to get around having the power operator | ' * *' expr % prec NEG { Deref ( Deref $ 2 ) } | '&' expr % prec NEG { Addr $ 2 } | '!' expr { Unop Not $ 2 } | expr '=' expr { Assign $ 1 $ 3 } | id '(' actuals_opt ')' { Call (pack $ 1 ) $ 3 } | '(' typ ')' expr % prec NEG { Cast $ 2 $ 4 } | expr '.' expr { Access $ 1 $ 3 } | expr ' - >' expr { Access ( Deref $ 1 ) $ 3 } | sizeof '(' typ ')' { Sizeof $ 3 } | '(' expr ')' { $ 2 } actuals_opt : {- empty -} { [] } | actuals_list { reverse $ 1 } actuals_list : expr { [ $ 1 ] } | actuals_list ',' expr { $ 3 : $ 1 }

Finally, the footer contains the parseError function as well as some utilities.

{ parseError _ = error "Unable to parse tokens" fst (a, _ , _ ) = a snd ( _ , b, _ ) = b thd ( _ , _ , c) = c }

Cabal automatically knows how to call alex and happy to generate the relevant haskell modules which will provide functions alexScanTokens :: String -> [Lexeme] and parse :: [Lexeme] -> Program .

(Full source of the happy parser here.)

Megaparsec

Now, we'll write the same scanner and parser using megaparsec. We could actually reuse the tokens generated by alex with megaparsec and skip scanning, but it's instructive to see how megaparsec handles the whole process (I also had the megaparsec version working long before I even considered using a parser generator). For further reading, I recommend the official megaparsec tutorial.

For the scanner, we have our standard import boilerplate as well as the definition of our parser type as the vanilla Parsec monad with no custom state component consuming Text .

module Microc.Scanner.Combinator where import Data.Void import Data.Char import Text.Megaparsec import Text.Megaparsec.Char import qualified Text.Megaparsec.Char.Lexer as L import Data.Text ( Text ) import qualified Data.Text as T import Control.Monad ( void ) import Data.String.Conversions type Parser = Parsec Void Text

We then define our "space consumer," which we use to create higher order functions to deal with lexemes and symbols so that they intelligently handle whitespace.

sc :: Parser () sc = L . space space1 lineCmnt blockCmnt where lineCmnt = L . skipLineComment "//" blockCmnt = L . skipBlockComment "/*" "*/" lexeme :: Parser a -> Parser a lexeme = L . lexeme sc symbol :: Text -> Parser Text symbol = L . symbol sc

Then, we create some convenience functions to handle paired delimiters and common symbols.

parens :: Parser a -> Parser a parens = between (symbol "(" ) (symbol ")" ) braces :: Parser a -> Parser a braces = between (symbol "{" ) (symbol "}" ) dquotes :: Parser a -> Parser a dquotes = between (single '"' ) (single '"' ) squotes :: Parser a -> Parser a squotes = between (single ' \' ' ) (single ' \' ' ) semi :: Parser () semi = void $ symbol ";" comma :: Parser () comma = void $ symbol "," star :: Parser () star = void $ symbol "*"

To handle reserved words, we attempt to parse a given string, not followed by anything except for whitespace. If we fail, we backtrack the parser so that it doesn't get stuck in the middle of a word.

rword :: Text -> Parser () rword w = (lexeme . try) (string w *> notFollowedBy alphaNumChar) rws :: [ Text ] -- list of reserved words rws = [ "if" , "then" , "else" , "while" , "true" , "false" , "for" , "int" , "bool" , "char" , "float" , "void" , "return" , "struct" , "NULL" , "sizeof" ]

Like with alex, we take advantage of haskell's read function to parse strings and character literals.

strlit :: Parser Text strlit = do content <- dquotes $ takeWhileP Nothing ( /= '"' ) pure $ T . pack (read ( '"' : cs content ++ " \" " )) charlit :: Parser Int charlit = squotes $ (ord <$> satisfy (`notElem` [ ' \\ ' , ' \' ' ])) <|> (single ' \\ ' >> int)

Identifiers are a little more cumbersome in megaparsec than the $alpha [$alpha $digit \_]* rule in alex because we have to also check that they aren't the same as any reserved words.

identifier :: Parser Text identifier = (lexeme . try) (p >>= check) where p = fmap T . pack $ ( : ) <$> letterChar <*> many (alphaNumChar <|> single '_' ) check x = if x `elem` rws then fail $ "keyword " <> show x <> " cannot be an identifier" else return x

Lexing int 's and float 's, however, is already implemented for us by the library, so we just wrap the definitions in our lexeme combinator.

int :: Parser Int int = lexeme L . decimal float :: Parser Double float = lexeme L . float

(Full source for the megaparsec scanner here.)

The parser begins similar to the scanner.

module Microc.Parser.Combinator ( programP , runParser , errorBundlePretty ) where import Microc.Ast import Microc.Scanner.Combinator import Text.Megaparsec import Control.Monad.Combinators.Expr import Control.Applicative ( liftA2 , liftA3 ) import Data.Either

To handle operator precedence, we construct a value of type [[Operator Parser Expr]] which we will later pass to makeExprParser . Ordering is from highest to lowest precedence, with operators in the same list having equal priority. Each constructor of the Operator type takes a function of type Parser (Expr -> Expr -> Expr) or Parser (Expr -> Expr) depending on if the operator is unary or binary and dispatches said parser appropriately for each operator.

Some contortions are necessary to handle chained prefix operators like multiple pointer dereferences, double negation, and operators that are prefixes of operators, like | for bitwise or and || for logical or. Had we reused the tokens from alex, this would not be a problem.

opTable :: [[ Operator Parser Expr ]] opTable = [ [ InfixL $ Access <$ symbol "." , InfixL $ ( \ lhs rhs -> Access ( Deref lhs) rhs) <$ symbol "->" ] , [ unary ( Unop Neg ) "-" , unary ( Unop Not ) "!" , unary Deref "*" , unary Addr "&" ] , [infixR Power "**" ] , [infixL Mult "*" , infixL Div "/" ] , [infixL Add "+" , infixL Sub "-" ] , [infixL Leq "<=" , infixL Geq ">=" , infixL Less "<" , infixL Greater ">" ] , [infixL' Equal "==" , infixL Neq "!=" ] , [infixL' BitAnd "&" ] , [infixL' BitOr "|" ] , [infixL' And "&&" ] , [infixL' Or "||" ] , [ InfixR $ Assign <$ symbol "=" ] ] where -- Megaparsec doesn't support multiple prefix operators by default, -- but we need this in order to parse things like double negatives, -- nots, and dereferences unary op sym = Prefix $ foldr1 ( . ) <$> some (op <$ symbol sym) infixL op sym = InfixL $ Binop op <$ symbol sym -- Primed infixL' is useful for operators which are prefixes of other operators infixL' op sym = InfixL $ Binop op <$ operator sym infixR op sym = InfixR $ Binop op <$ symbol sym operator sym = lexeme $ try (symbol sym <* notFollowedBy opChar) opChar = oneOf ( "!#$%&*+./<=>?@ \\ ^|-~" :: [ Char ])

We can now write the rest of our expression parser.

termP :: Parser Expr termP = try ( Cast <$> parens typeP <*> exprP) <|> parens exprP <|> Null <$ rword "NULL" <|> try ( Fliteral <$> float) <|> Literal <$> int <|> BoolLit <$> ( True <$ rword "true" <|> False <$ rword "false" ) <|> Sizeof <$> (rword "sizeof" *> parens typeP) <|> try ( Call <$> identifier <*> parens (exprP `sepBy` comma)) <|> CharLit <$> charlit <|> StrLit <$> strlit <|> Id <$> identifier exprP :: Parser Expr exprP = makeExprParser termP opTable exprMaybe :: Parser Expr exprMaybe = option Noexpr exprP

Unlike in our parser generator, we need to be very careful about the ordering or our alternatives and where to include the try combinator so that our parser doesn't end up in an invalid state or match on characters too eagerly.

Struct, type, and variable declarations are all straightforward.

structP :: Parser Struct structP = Struct <$> (rword "struct" *> identifier) <*> braces (many vdeclP) <* semi typeP :: Parser Type typeP = do baseType <- TyInt <$ rword "int" <|> TyBool <$ rword "bool" <|> TyFloat <$ rword "float" <|> TyChar <$ rword "char" <|> TyVoid <$ rword "void" <|> TyStruct <$> (rword "struct" *> identifier) foldr (const Pointer ) baseType <$> many star vdeclP :: Parser Bind vdeclP = Bind <$> typeP <*> identifier <* semi

Statements and function definitions are very similar to what we wrote in happy. For dangling else's, we don't have to do anything particularly special, as megaparsec isn't of the class of parsers that check for ambiguities.

statementP :: Parser Statement statementP = Expr <$> exprP <* semi <|> Return <$> (rword "return" *> exprMaybe <* semi) <|> Block <$> braces (many statementP) <|> ifP <|> forP <|> whileP ifP :: Parser Statement ifP = liftA3 If (rword "if" *> parens exprP) statementP maybeElse where maybeElse = option ( Block [] ) (rword "else" *> statementP) forP :: Parser Statement forP = do rword "for" (e1, e2, e3) <- parens $ liftA3 (,,) (exprMaybe <* semi) (exprP <* semi) exprMaybe For e1 e2 e3 <$> statementP whileP :: Parser Statement whileP = liftA2 While (rword "while" *> parens exprP) statementP fdeclP :: Parser Function fdeclP = Function <$> typeP <*> identifier <*> formalsP <*> (symbol "{" *> many vdeclP) <*> (many statementP <* symbol "}" ) formalsP :: Parser [ Bind ] formalsP = parens $ formalP `sepBy` comma where formalP = liftA2 Bind typeP identifier

Finally, a full program is potentially some whitespace followed by a list of structs and global variables followed by a list of functions and the end of the file.

programP :: Parser Program programP = between sc eof $ do structsOrGlobals <- many $ try ( Left <$> structP) <|> ( Right <$> try vdeclP) let structs = lefts structsOrGlobals globals = rights structsOrGlobals Program structs globals <$> many fdeclP

(Full source of the megaparsec parser here.)

Wiring everything together

Now that we've written our parsers, presumably we'll want to run them on some actual source files. First, we'll create a Microc.hs file in src/ that exports all of the Microc modules that we've written so far.

module Microc ( module X ) where import Microc.Ast as X import Microc.Scanner.Combinator as X import Microc.Parser.Combinator as X import Microc.Scanner.Generator as X import Microc.Parser.Generator as X

In app/Main.hs , we'll set up an options parser using optparse-applicative so we can tell our executable to pretty print the parsed AST, the semantically checked AST, or the LLVM output, or compile the executable. The run option is just a convenience that will compile the executable, run it and then cleanup intermediate files.

module Main where -- Microc parser conflicts with Options.Applicative parser import Microc hiding ( Parser ) import Options.Applicative import LLVM.Pretty import Data.String.Conversions import qualified Data.Text as T import qualified Data.Text.IO as T import Text.Pretty.Simple import Data.Text.Prettyprint.Doc import Data.Text.Prettyprint.Doc.Render.Text data Action = Ast -- TODO | Sast | LLVM | Compile FilePath | Run data ParserType = Combinator -- ^ megaparsec | Generator -- ^ alex/happy data Options = Options { action :: Action , infile :: FilePath , parser :: ParserType }

We want the behavior of our executable to be:

mcc --ast <file> # prints the ast of the file mcc --ast --generator <file> # parses using alex/happy and prints the ast mcc --sast <file> # prints the semantically checked ast of the file mcc --llvm <file> # prints the generated llvm bytecode mcc --compile <file> -o a.out # produces an executable a.out mcc <file> # compiles the file and prints the output from running it # etc.

This can be mostly accomplished using the flag' combinator which allows the use of --options as toggles.

actionP :: Parser Action actionP = flag' Ast (long "ast" <> short 'a' <> help "Pretty print the ast" ) <|> flag' Sast (long "sast" <> short 's' <> help "Pretty print the sast" ) <|> flag' LLVM (long "llvm" <> short 'l' <> help "Pretty print the generated llvm" ) <|> flag' Compile (long "compile" <> short 'c' <> help "Compile to an executable" ) <*> strOption (short 'o' <> value "a.out" <> metavar "FILE" ) -- running the file to see the expected output is default <|> pure Run parserP :: Parser ParserType parserP = flag' Combinator (long "combinator" <> help "Use the megaparsec parser implementation (default)." ) <|> flag' Generator (long "generator" <> short 'g' <> help "Use alex and happy to parse." ) <|> pure Combinator -- default to megaparsec optionsP :: Parser Options optionsP = Options <$> actionP <*> strArgument (help "Source file" <> metavar "FILE" ) <*> parserP

The main logic lives in runOpts :: Options -> IO() which is called by main in conjunction with the standard invocation of execParser from optparse-applicative .

main :: IO () main = runOpts =<< execParser (optionsP `withInfo` infoString) where withInfo opts desc = info (helper <*> opts) $ progDesc desc infoString = "Run the mcc compiler on the given file. \ \ Passing no flags will compile the file, execute it, and print the output." runOpts :: Options -> IO () runOpts ( Options action infile ptype) = do program <- T . readFile infile let parseTree = case ptype of Combinator -> runParser programP infile program Generator -> Right $ parse . alexScanTokens $ T . unpack program case parseTree of Left err -> putStrLn $ errorBundlePretty err Right ast -> case action of Ast -> putDoc $ pretty ast <> "

" _ -> error "Not yet implemented"

(Full source for app/Main.hs here. Note that it is for the complete compiler including the passes that we haven't yet discussed.)

Testing the compiler

For our compiler, we will primarily test with golden tests provided by tasty-golden. Golden tests are essentially files containing the expected outputs of running various commands. In our case, we will have two directories in tests, tests/pass for programs that are expected to run successfully and produce output and tests/fail for programs with semantic errors that we hope to emit moderately helpful error messages for. Each directory contains pairs of files cool_program.mc and cool_program.golden with the source and expected output of running mcc cool_program.mc , respectively.

However, since we have yet to write the semantic analyzer or code generation passes, we can't test them yet. All we can test at this stage is that our combinator and generator parsers produce identical ASTs when successful and that neither parser succeeds where the other one fails. We won't attempt to test whether they return the same error messages on failure because rewriting the parser generator to emit real error messages is very time consuming and would require an overhaul of its entire structure. Instead, we resort to lifting the parser generator functions into IO where we can catch exceptions (yes, haskell does allow this, although it's rarely done).

We construct a test tree by going through the pass directory and running both parsers on each file.

module Main where import Test.Tasty ( defaultMain , TestTree , testGroup ) import Test.Tasty.Golden import Test.Tasty.HUnit import System.FilePath ( takeBaseName , replaceExtension ) import Microc import Data.String.Conversions import qualified Data.Text.IO as T import Data.Text ( Text ) import Data.Text.Prettyprint.Doc import Data.Text.Prettyprint.Doc.Render.Text import Control.Monad import Control.Exception parsing :: IO TestTree parsing = do files <- concat <$> mapM (findByExtension [ ".mc" ]) [ "tests/pass" , "tests/fail" ] fmap (testGroup "parsing" ) $ forM files $ \ file -> do input <- T . readFile file combinator <- pure $ runParser programP file input generator <- try . evaluate . parse . alexScanTokens $ cs input :: IO ( Either IOError Program ) pure . testCase file $ case (combinator, generator) of ( Right ast, Right ast') -> assertEqual file ast ast' ( Left _ , Left _ ) -> pure () _ -> assertFailure file main :: IO () main = defaultMain =<< parsing

Once we've implemented more of the compiler, we'll extend the test runner with more test groups. Using Tasty's testGroup functionality, we can restrict our test run to only some of the tests by running them with nix-shell --pure --run "cabal new-run testall -- --pattern 'parsing'" to only run the 'parsing' tests, for example.

(Full source for tests/Testall.hs here. Same as with Main , contains material to be covered in future posts.)

Addendum: My informal comparison of parser combinators vs. generators

Combinators Generators Easier to write - vanilla Haskell Custom, weird syntax that no one ever remembers offhand Decent default error messages Good parse errors are possible, but they take significant work Lots of tutorials and blog posts Scarce documentation aside from the official manuals Works on String, Strict/Lazy ByteString, and Text Only String and Lazy ByteString Grammar sensitive to ordering of alternatives - need to be careful about exploding runtime with try Robust to ordering of alternatives - automatically uses longest match No way to tell if grammar is ambiguous You'll get a lovely Shift-Reduce conflict if your grammar has any ambiguities Used by hnix, Idris Used by GHC

Performance wise, it's unclear to me which one is faster, as I haven't benchmarked anything. My guess is that megaparsec can run very quickly with careful optimization and minimizing the use of try whereas alex/happy is pretty fast by default. In the current state of the codebase, it would be very unfair to compare the two, as megaparsec reads Text whereas alex reads String . A fair comparison would have to use ByteString .

Anyway, thanks for reading until the end! Stay tuned for part 2, semantic analysis.