In part 2, we completed the user facing part of the compiler with semantic analysis. Now we're free to focus on the backend, generating LLVM. To do so, we'll use the llvm-hs-pure library, which embeds LLVM into a Haskell ADT, allowing us to manipulate it without having to set up any FFI. We will then pretty print the generated bytecode with llvm-hs-pretty and call clang on it to generate machine code for our preferred target. Some things to keep in mind when writing LLVM generating code:

There is very little type safety. The usual haskell guarantees of "if it typechecks it probably works" do not apply here at all. After all, we are writing basically writing assembly, so even if we get our code through the LLVM type checker, there's still a good chance that it won't do what we want it to. This is where having a comprehensive test suite is invaluable.

Documentation for LLVM in general is quite scant. The official language reference manual is comprehensive, but can be quite terse and provides little in the way of examples, although this gets better all the time and is not nearly as bad now as it was a few years ago. Documentation for llvm-hs specifically is even scarcer: basically just some tiny examples and the haskell port of the kaleidoscope tutorial, which is by now very outdated. This post will hopefully help matters somewhat but it leaves many parts of the LLVM IR and ecosystem unexplored.

Let clang help you. You can run it with -emit-llvm on a C file to see what it generates. This is helpful even when you aren't writing a C-like language as you can see how certain high-level language concepts map onto LLVM.

LLVM Basics

The following is a simplified version of how LLVM works in order to create a basic mental model. There are many omissions.

LLVM programs are split into modules containing toplevel definitions like functions and global variables. For our purposes, we are concerned with emitting a single module containing all of our functions, global variables, and typedefs for our structs. Within functions, code is contained in basic blocks. A basic block is a list of sequential instructions in SSA (Single Static Assignment) form that is terminated in a conditional or unconditional branch to another block or in a return instruction.

LLVM, unlike many forms of assembly, is typed. Its type system resembles C's in many ways, but there are some notable differences. Instead of C's confusing and target dependent integer types short , long , long long , etc., LLVM has arbitrary N bit width integers denoted iN . Also, void* is banned in LLVM, so char* or i8* is used instead. LLVM also has native vector types for SIMD instructions, but those won't factor into our discussion. There are other exotic features in the type system that help optimization like poison values but we won't discuss them here.

Codegen.hs

With that out of the way, we can start writing code. This module has quite a few imports.

{-# LANGUAGE RecursiveDo #-} {-# LANGUAGE TupleSections #-} {-# LANGUAGE FlexibleContexts #-} -- We need these to write a ConvertibleStrings instance for -- ShortByteString {-# LANGUAGE MultiParamTypeClasses #-} {-# OPTIONS_GHC -fno-warn-orphans #-} module Microc.Codegen ( codegenProgram ) where import qualified LLVM.AST.IntegerPredicate as IP import qualified LLVM.AST.FloatingPointPredicate as FP import LLVM.AST ( Operand ) import qualified LLVM.AST as AST import qualified LLVM.AST.Type as AST import qualified LLVM.AST.Constant as C import LLVM.AST.Name import LLVM.AST.Typed ( typeOf ) import qualified LLVM.IRBuilder.Module as L import qualified LLVM.IRBuilder.Monad as L import qualified LLVM.IRBuilder.Instruction as L import qualified LLVM.IRBuilder.Constant as L import LLVM.Prelude ( ShortByteString ) import qualified Data.Map as M import Control.Monad.State import Data.String ( fromString ) import Microc.Sast import Microc.Ast ( Type ( .. ) , Op ( .. ) , Uop ( .. ) , Bind ( .. ) , Struct ( .. ) ) import Data.String.Conversions import qualified Data.Text as T import Data.Text ( Text ) import Data.Word ( Word32 ) import Data.List ( find )

In llvm-hs , values that can be passed as arguments to LLVM instructions have type Operand . This includes basically all variables as well as declared functions. Just as in Semant , we define an Env type to hold information about our generated operands and how they correspond to the names of their corresponding source variables. Besides the operands, we need to keep track of all struct declarations and all string literals in the source, as we emit a unique global variable for each unique string literal.

data Env = Env { operands :: M . Map Text Operand , structs :: [ Struct ] , strings :: M . Map Text Operand } deriving ( Eq , Show ) registerOperand :: MonadState Env m => Text -> Operand -> m () registerOperand name op = modify $ \ env -> env { operands = M . insert name op (operands env) }

Utilities

Working with LLVM bindings in other languages usually involves passing mutable builder and module context objects to all instruction-emitting functions in order to ensure that all variables have unique names and to maintain the integrity of the module. This is important to ensure that code remains in SSA form.

Of course, since we're not working in other languages, this approach of passing around mutable objects would be severely un-ergonomic, at best. Fortunately, llvm-hs provides us with monads that emulate this behavior, ModuleBuilderT for the module context, and IRBuilderT for the builder object. We'll establish two type synonyms, LLVM for generating top level entities and Codegen for generating basic blocks.

type LLVM = L . ModuleBuilderT ( State Env ) type Codegen = L . IRBuilderT LLVM

We'll also write some utilities to query struct fields defined in our Env , to convert from MicroC types to LLVM types, and to calculate the sizes of MicroC types. For structs, we emit packed fields, which is pretty bad for performance, but makes calculating sizes very easy. Note that by this phase of the compiler, we no longer report errors to the user, so if anything goes wrong, we'll just crash.

getFields :: MonadState Env m => Text -> m [ Bind ] getFields name = do ss <- gets structs case find ( \ s -> structName s == name) ss of Nothing -> error "Internal error - struct not found" Just ( Struct _ binds) -> pure binds charStar :: AST . Type charStar = AST . ptr AST . i8 -- llvm-hs uses ShortByteString for names, but we want -- easy conversion to Text with cs from Data.String.Conversions instance ConvertibleStrings Text ShortByteString where convertString = fromString . T . unpack ltypeOfTyp :: MonadState Env m => Type -> m AST . Type ltypeOfTyp = \ case TyVoid -> pure AST . void TyInt -> pure AST . i32 TyChar -> pure AST . i8 TyFloat -> pure AST . double TyBool -> pure AST . i1 -- (void *) is invalid LLVM Pointer TyVoid -> pure $ charStar -- special case to handle recursively defined structures -- TODO: add real cycle checking so that improperly defined -- recursive types case the compiler to hang forever Pointer ( TyStruct n) -> pure $ AST . ptr ( AST . NamedTypeReference (mkName $ cs ( "struct." <> n))) Pointer t -> fmap AST . ptr (ltypeOfTyp t) TyStruct n -> do fields <- getFields n typs <- mapM (ltypeOfTyp . bindType) fields -- Packed structs aren't great for performance but very easy to code for now pure $ AST . StructureType { AST . isPacked = True , AST . elementTypes = typs } sizeof :: MonadState Env m => Type -> m Word32 sizeof = \ case TyBool -> pure 1 TyChar -> pure 1 TyInt -> pure 4 TyFloat -> pure 8 TyVoid -> pure 0 Pointer _ -> pure 8 TyStruct n -> do fields <- getFields n sizes <- mapM (sizeof . bindType) fields pure (sum sizes)

Expressions

LVals

Now, we're ready to generate code for expressions. First, LVal s. When generating an LVal , we generate an Operand corresponding to the address of the value. That way, we can use it as an argument to the store instruction. For variables, we simply look up the variable name in the Env .

codegenLVal :: LValue -> Codegen Operand codegenLVal ( SId name) = gets (( M .! name) . operands)

Since we are generating addresses, dereferencing is essentially the inverse of this, so we just generate code for the underlying expression.

codegenLVal ( SDeref e ) = codegenSexpr e

For struct access, we get to use the fascinating (read: confusing) getelementptr instruction. The instruction only calculates addresses, it doesn't load memory, so it's a perfect fit for the semantics of codegenLVal . We generate the address of the left hand side of the access and then have to pass two arguments to gep , a zero to access the memory pointed to by the address we just calculated, then the offset of the struct field we want to access, which we calculated in semant. Note that getelementptr handles calculating alignment, so we don't need to do it ourselves.

codegenLVal ( SAccess e i) = do e' <- codegenLVal e let zero = L . int32 0 offset = L . int32 (fromIntegral i) L . gep e' [zero, offset]

Literals

Most literals, as usual, are straightforward.

codegenSexpr :: SExpr -> Codegen Operand codegenSexpr ( TyInt , SLiteral i ) = pure $ L . int32 (fromIntegral i) codegenSexpr ( TyFloat , SFliteral f) = pure $ L . double f codegenSexpr ( TyBool , SBoolLit b ) = pure $ L . bit ( if b then 1 else 0 ) codegenSexpr ( TyChar , SCharLit c ) = pure $ L . int8 (fromIntegral c)

Strings, however, are not. We look up the string literal in the Env to see if we've generated a global variable for it before. If so, we just return that. Otherwise, we use globalStringPtr to generate a pointer to a global string variable. We name each variable "0.str", "1.str" etc., since mkName crashes with non-ASCII input, which we haven't explicitly forbidden in our string literals. Note that globalStringPtr returns a Constant which is distinct from an Operand , so we need to promote it with AST.ConstantOperand .

codegenSexpr ( Pointer TyChar , SStrLit s ) = do -- Generate a new unique global variable for every string literal we see strs <- gets strings case M . lookup s strs of Nothing -> do let nm = mkName (show ( M . size strs) <> ".str" ) op <- L . globalStringPtr (cs s) nm modify $ \ env -> env { strings = M . insert s ( AST . ConstantOperand op) strs } pure ( AST . ConstantOperand op) Just op -> pure op

Null pointers are generated with inttoptr .

codegenSexpr (t, SNull ) = L . inttoptr ( L . int64 0 ) =<< ltypeOfTyp t

Sizeof is calculated with the sizeof function we wrote earlier.

codegenSexpr ( TyInt , SSizeof t) = L . int32 . fromIntegral <$> sizeof t

The & operator finds the address of an LVal , which is already taken care of by codegenLVal .

codegenSexpr ( _ , SAddr e) = codegenLVal e

Binary operators

For assignment, we calculate the address of the left hand side, the value of the right hand side, and then store said value at the address, returning the value.

codegenSexpr ( _ , SAssign lhs rhs) = do rhs' <- codegenSexpr rhs lhs' <- codegenLVal lhs L . store lhs' 0 rhs' return rhs'

For the Binop constructor, we begin by generating code for the left and right and sides.

codegenSexpr (t, SBinop op lhs rhs) = do lhs' <- codegenSexpr lhs rhs' <- codegenSexpr rhs case op of

For addition on int s and float s, we simply generate the corresponding machine instruction. For pointer addition, getElementPtr takes care of calculating the offset for each pointer type so we don't have to worry about it.

Add -> case (fst lhs, fst rhs) of ( Pointer _ , TyInt ) -> L . gep lhs' [rhs'] ( TyInt , Pointer _ ) -> L . gep rhs' [lhs'] ( TyInt , TyInt ) -> L . add lhs' rhs' ( TyFloat , TyFloat ) -> L . fadd lhs' rhs' _ -> error "Internal error - semant failed"

For pointer subtraction, we do actually have to calculate the pointer width ourselves and divide the difference in addresses by it.

Sub -> let zero = L . int64 0 in case (fst lhs, fst rhs) of ( Pointer typ, Pointer typ') -> if typ' /= typ then error "Internal error - semant failed" else do lhs'' <- L . ptrtoint lhs' AST . i64 rhs'' <- L . ptrtoint rhs' AST . i64 diff <- L . sub lhs'' rhs'' width <- L . int64 . fromIntegral <$> sizeof typ L . sdiv diff width

Subtracting int s from pointers is similar to adding them, except that we negate the int before passing it to getElementPtr .

( Pointer _ , TyInt ) -> do rhs'' <- L . sub zero rhs' L . gep lhs' [rhs'']

For int s and float s, we again dispatch to the corresponding machine instruction.

( TyInt , TyInt ) -> L . sub lhs' rhs' ( TyFloat , TyFloat ) -> L . fsub lhs' rhs' _ -> error "Internal error - semant failed"

Multiplication and division are easy.

Mult -> case t of TyInt -> L . mul lhs' rhs' TyFloat -> L . fmul lhs' rhs' _ -> error "Internal error - semant failed" Div -> case t of TyInt -> L . sdiv lhs' rhs' TyFloat -> L . fdiv lhs' rhs' _ -> error "Internal error - semant failed"

For the exponentiation operator, all remaining cases are raising int s to int s. We can take this opportunity to write some non-trivial LLVM and implement exponentiation as repeated multiplication directly in the IR. In haskell, the algorithm would be

-- We can obviously be more terse but this form maps better onto LLVM raise lhs rhs = go 1 rhs where go acc expt = if expt == 0 then acc else let nextAcc = lhs * acc nextExpt = expt - 1 in go nextAcc nextExpt

In order to marry SSA with conditionals, LLVM uses phi nodes. Phi nodes must all appear at the very beginning of a basic block. There cannot be any non-phi instructions preceding them. The phi instruction takes a list of pairs. The first element of each pair is a value and the second element is the label of a basic block which has an outgoing branch to the block with phi nodes.

First, we need to get the label of the enclosing block so that we can start our new block. We then set acc and expt to phi nodes, such that if control flow proceeds into the loop_pow block from the enclosing scope, they are initialized to 1 and rhs , respectively, and if control flow is from continue , they are set to nextAcc and nextExpt . The if clause is handled by issuing a condBr if the expt has reached 0, at which point we either return the acc or jump back to loop_pow .

Note that we use mdo , courtesy of {-# LANGUAGE RecursiveDo #-} , instead of do , as we need to forward-reference the doneBlock and continueBlock s in our branch instruction. We can't define our blocks with L.block and then branch to them because calling L.block ends the current block and starts a new one. When using other LLVM bindings, one usually has to create all of the blocks and then manually position the builder at the correct location before emitting instructions. However, haskell's laziness allows us to avoid this inelegance and write branching code much more naturally.

Power -> mdo enclosing <- L . currentBlock L . br loop loop <- L . block ` L . named` "loop_pow" acc <- L . phi [( L . int32 1 , enclosing), (nextAcc, continueBlock)] ` L . named` "acc" expt <- L . phi [(rhs', enclosing), (nextExpt, continueBlock)] ` L . named` "expt" done <- L . icmp IP . EQ expt ( L . int32 0 ) L . condBr done doneBlock continueBlock continueBlock <- L . block ` L . named` "continue" nextAcc <- L . mul acc lhs' ` L . named` "next_acc" nextExpt <- L . sub expt ( L . int32 1 ) ` L . named` "next_expt" L . br loop doneBlock <- L . block ` L . named` "done" pure acc

(It is left as an exercise for the reader to implement a more efficient exponentiation algorithm in LLVM.)

The remaining binary operators all map directly onto their LLVM counterparts.

Equal -> case fst lhs of TyInt -> L . icmp IP . EQ lhs' rhs' TyBool -> L . icmp IP . EQ lhs' rhs' TyChar -> L . icmp IP . EQ lhs' rhs' Pointer _ -> L . icmp IP . EQ lhs' rhs' TyFloat -> L . fcmp FP . OEQ lhs' rhs' _ -> error "Internal error - semant failed" Neq -> case fst lhs of TyInt -> L . icmp IP . NE lhs' rhs' TyBool -> L . icmp IP . NE lhs' rhs' TyChar -> L . icmp IP . NE lhs' rhs' Pointer _ -> L . icmp IP . NE lhs' rhs' TyFloat -> L . fcmp FP . ONE lhs' rhs' _ -> error "Internal error - semant failed" Less -> case fst lhs of TyInt -> L . icmp IP . SLT lhs' rhs' TyBool -> L . icmp IP . SLT lhs' rhs' TyChar -> L . icmp IP . ULT lhs' rhs' TyFloat -> L . fcmp FP . OLT lhs' rhs' _ -> error "Internal error - semant failed" Leq -> case fst lhs of TyInt -> L . icmp IP . SLE lhs' rhs' TyBool -> L . icmp IP . SLE lhs' rhs' TyChar -> L . icmp IP . ULE lhs' rhs' TyFloat -> L . fcmp FP . OLE lhs' rhs' _ -> error "Internal error - semant failed" Greater -> case fst lhs of TyInt -> L . icmp IP . SGT lhs' rhs' TyBool -> L . icmp IP . SGT lhs' rhs' TyChar -> L . icmp IP . UGT lhs' rhs' TyFloat -> L . fcmp FP . OGT lhs' rhs' _ -> error "Internal error - semant failed" Geq -> case fst lhs of TyInt -> L . icmp IP . SGE lhs' rhs' TyBool -> L . icmp IP . SGE lhs' rhs' TyChar -> L . icmp IP . UGE lhs' rhs' TyFloat -> L . fcmp FP . OGE lhs' rhs' _ -> error "Internal error - semant failed" And -> L . and lhs' rhs' Or -> L . or lhs' rhs' BitAnd -> L . and lhs' rhs' BitOr -> L . or lhs' rhs'

Unary operators

There aren't any negation intrinsics in LLVM, but it's easy enough to implement them ourselves.

codegenSexpr (t, SUnop op e) = do e' <- codegenSexpr e case op of Neg -> case t of TyInt -> L . sub ( L . int32 0 ) e' TyFloat -> L . fsub ( L . double 0 ) e' _ -> error "Internal error - semant failed" Not -> case t of TyBool -> L . xor e' ( L . bit 1 ) _ -> error "Internal error - semant failed"

Function calls

For function calls, we generate code for each argument, look up the function in our Env , and then emit the call instruction. Note that we add an empty list to each argument. LLVM allows us to emit parameter attributes attached to each argument, which we don't really care about.

codegenSexpr ( _ , SCall fun es) = do es' <- mapM (fmap (, [] ) . codegenSexpr) es f <- gets (( M .! fun) . operands) L . call f es'

Casts

For casts from type t to t' , we simply use the corresponding instruction.

codegenSexpr ( _ , SCast t' e @ (t, _ )) = do e' <- codegenSexpr e llvmType <- ltypeOfTyp t' case (t', t) of ( Pointer _ , Pointer _ ) -> L . bitcast e' llvmType ( Pointer _ , TyInt ) -> L . inttoptr e' llvmType ( TyInt , Pointer _ ) -> L . ptrtoint e' llvmType -- Signed Int to Floating Point ( TyFloat , TyInt ) -> L . sitofp e' llvmType _ -> error "Internal error - semant failed"

Finally, for SNoexpr we just generate a 0, and if something got by the semantic checker, we crash.

codegenSexpr ( _ , SNoexpr ) = pure $ L . int32 0 -- Final catchall codegenSexpr sx = error $ "Internal error - semant failed. Invalid sexpr " <> show sx

Statements

Codegen for statements isn't too bad. In the case of naked expressions, returns, and blocks, we simply reuse the work from codegenSexpr .

codegenStatement :: SStatement -> Codegen () codegenStatement ( SExpr e) = void $ codegenSexpr e codegenStatement ( SReturn e) = case e of ( TyVoid , SNoexpr ) -> L . retVoid _ -> L . ret =<< codegenSexpr e codegenStatement ( SBlock ss) = mapM_ codegenStatement ss

For conditionals, we follow a similar strategy as we did in implementing integer exponentiation. We generate the condition, branch on it, generate statements for each alternative in the correct block, and then issue an unconditional branch to a merge block. One subtlety that we have to keep in mind is the possibility of a return inside of the if statement. LLVM only allows one kind of terminator in a block, so we can use the mkTerminator helper to check if that is the case and if so, do nothing.

codegenStatement ( SIf p cons alt) = mdo bool <- codegenSexpr p L . condBr bool thenBlock elseBlock thenBlock <- L . block ` L . named` "then" codegenStatement cons mkTerminator $ L . br mergeBlock elseBlock <- L . block ` L . named` "else" codegenStatement alt mkTerminator $ L . br mergeBlock mergeBlock <- L . block ` L . named` "merge" return ()

For do while loops, we immediately branch into the while block, generate the code for the condition and the body, then conditionally branch into either the while or merge blocks.

codegenStatement ( SDoWhile p body) = mdo L . br whileBlock whileBlock <- L . block ` L . named` "while_body" codegenStatement body continue <- codegenSexpr p mkTerminator $ L . condBr continue whileBlock mergeBlock mergeBlock <- L . block ` L . named` "merge" return ()

Functions

To generate function code, we use the function function (who said haskellers were bad at naming?!). We actually need to insert it into the Env before generating code for it in case it calls itself recursively. Fortunately, we have our trusty mdo . After generating the body, we have to re-insert all the strings we encountered back into the global Env so that they can be reused across functions (this is ugly and should be refactored.)

codegenFunc :: SFunction -> LLVM () codegenFunc f = mdo registerOperand (sname f) function (function, strs) <- locally $ do retty <- ltypeOfTyp (styp f) params <- mapM mkParam (sformals f) fun <- L . function name params retty genBody strings' <- gets strings pure (fun, strings') modify $ \ e -> e { strings = strs }

The L.function call merits further discussion. It has type

:: MonadModuleBuilder m => Name -> [(Type, ParameterName)] -> Type -> ([Operand] -> IRBuilderT m () -> m Operand)

which specializes to

:: Name -> [(Type, ParameterName)] -> Type -> ([Operand] -> Codegen ()) -> LLVM Operand (Now we understand why Codegen and LLVM are defined the way they are.)

The name is easy.

where name = mkName (cs $ sname f)

To make parameters, we just find the corresponding LLVM type and suggest the name as it appears in the source file.

mkParam ( Bind t n) = (,) <$> ltypeOfTyp t <*> pure ( L . ParameterName (cs n))

To generate the body, we first create an entry block.

genBody :: [ Operand ] -> Codegen () genBody ops = do _entry <- L . block ` L . named` "entry"

Then, for each of the Operand s that the function takes, we allocate space on the stack with alloca , store that Operand in that memory, and register the memory in our Env .

forM_ (zip ops (sformals f)) $ \ (op, Bind _ n) -> do -- typeOf is defined in LLVM.AST.Typed addr <- L . alloca (typeOf op) Nothing 0 L . store addr 0 op registerOperand n addr

Local variables are treated similarly, except that we can leave them as uninitialized memory.

forM_ (slocals f) $ \ ( Bind t n) -> do ltype <- ltypeOfTyp t addr <- L . alloca ltype Nothing 0 registerOperand n addr

Finally, we generate the body of the function.

codegenStatement (sbody f)

For built in functions, we can use extern to indicate to the linker to insert them into the final program.

emitBuiltIn :: ( String , [ AST . Type ], AST . Type ) -> LLVM () emitBuiltIn (name, argtys, retty) = do func <- L . extern (mkName name) argtys retty registerOperand (cs name) func -- Printf has varargs so we treat it separately builtIns :: [( String , [ AST . Type ], AST . Type )] builtIns = [ ( "printbig" , [ AST . i32] , AST . void) , ( "llvm.pow.f64" , [ AST . double, AST . double], AST . double) , ( "llvm.powi.i32" , [ AST . double, AST . i32] , AST . double) , ( "malloc" , [ AST . i32] , AST . ptr AST . i8) , ( "free" , [ AST . ptr AST . i8] , AST . void) ]

Globals

For global variables, we simply call global with a dummy 0 initial value and add the variable to our Env .

codegenGlobal :: Bind -> LLVM () codegenGlobal ( Bind t n) = do let name = mkName $ cs n initVal = C . Int 0 0 typ <- ltypeOfTyp t var <- L . global name typ initVal registerOperand n var

For toplevel structs, we register typedef s at the module level with L.typedef .

emitTypeDef :: Struct -> LLVM AST . Type emitTypeDef ( Struct name _ ) = do typ <- ltypeOfTyp ( TyStruct name) L . typedef (mkName (cs ( "struct." <> name))) ( Just typ)

Finally, we generate the entire SProgram by emitting all of the builtin functions, toplevel structs, global variables, and functions sequentially.

codegenProgram :: SProgram -> AST . Module codegenProgram (structs, globals, funcs) = flip evalState ( Env { operands = M . empty, structs = structs, strings = M . empty }) $ L . buildModuleT "microc" $ do printf <- L . externVarArgs (mkName "printf" ) [charStar] AST . i32 registerOperand "printf" printf mapM_ emitBuiltIn builtIns mapM_ emitTypeDef structs mapM_ codegenGlobal globals mapM_ codegenFunc funcs

(Full source for Codegen.hs here.)

The runtime

There's really very little runtime to speak of, but for demonstration purposes there's a runtime.c file that we link with all of our executables to provide the builtin functions.

#include <stdio.h> #include <stdlib.h> void printbig ( int c) { // elided... }

(Full source for runtime.c here).

Linking

Our "linker" is just a thin wrapper around clang . We'll create a Microc.Toplevel module to handle the details.

module Microc.Toplevel where import LLVM.AST import LLVM.Pretty import Data.String.Conversions import Data.Text ( Text ) import qualified Data.Text.IO as T import System.IO import System.Directory import System.Process import System.Posix.Temp import Control.Exception ( bracket )

The compile function, given a Module , generates an executable at the supplied path. We call ppllvm from llvm-hs-pretty to dump the textual representation of the Module into a file so that we can call clang on it. We use bracket to make sure that the build artifacts get cleaned up properly.

compile :: Module -> FilePath -> IO () compile llvmModule outfile = bracket (mkdtemp "build" ) removePathForcibly $ \ buildDir -> withCurrentDirectory buildDir $ do -- create temporary file for "output.ll" (llvm, llvmHandle) <- mkstemps "output" ".ll" let runtime = "../src/runtime.c" -- write the llvmModule to a file T . hPutStrLn llvmHandle (cs $ ppllvm llvmModule) hClose llvmHandle -- link the runtime with the assembly callProcess "clang" [ "-Wno-override-module" , "-lm" , llvm, runtime, "-o" , "../" <> outfile]

We also provide a run function that simply generates an executable, reads its output, then deletes it.

run :: Module -> IO Text run llvmModule = do compile llvmModule "./a.out" result <- cs <$> readProcess "./a.out" [] [] removePathForcibly "./a.out" return result

(Full source for Toplevel.hs here.)

Now we can finally finish writing runOpts in Main.hs .

runOpts :: Options -> IO () runOpts ( Options action infile ptype) = do program <- T . readFile infile let parseTree = case ptype of Combinator -> runParser programP infile program Generator -> Right $ parse . alexScanTokens $ T . unpack program case parseTree of Left err -> putStrLn $ errorBundlePretty err Right ast -> case action of Ast -> putDoc $ pretty ast <> "

" _ -> case checkProgram ast of Left err -> putDoc $ pretty err <> "

" Right sast -> let llvm = codegenProgram sast in case action of Sast -> pPrint sast LLVM -> T . putStrLn . cs . ppllvm $ llvm Compile outfile -> compile llvm outfile Run -> run llvm >>= T . putStr Ast -> error "unreachable"

(Full source for Main.hs here.)

Testing

For testing, we write a similar runner that takes a filepath and returns what would be the result of calling mcc <path> on it.

runFile :: FilePath -> IO Text runFile infile = do program <- T . readFile infile let parseTree = runParser programP (cs infile) program case parseTree of Left e -> return . cs $ errorBundlePretty e Right ast -> case checkProgram ast of Left err -> return . renderStrict $ layoutPretty defaultLayoutOptions (pretty err) Right sast -> run (codegenProgram sast)

We now have enough code to write passing tests for our compiler. They look very similar to the failing tests from part 2.

passing :: IO TestTree passing = do mcFiles <- findByExtension [ ".mc" ] "tests/pass" return $ testGroup "pass" [ goldenVsString (takeBaseName mcFile) outfile (cs <$> runFile mcFile) | mcFile <- mcFiles , let outfile = replaceExtension mcFile ".golden" ] main :: IO () main = defaultMain =<< goldenTests goldenTests :: IO TestTree goldenTests = testGroup "all" <$> sequence [passing, failing, parsing]

(Full source for Testall.hs here.)

Conclusion and Acknowledgments

The compiler is finished! In just 1500 or so lines of haskell, we've implemented a significant amount of the C programming language! Thanks to everyone who's been reading along. I've had fun writing and revisiting my old code. In particular, I'd like to thank