In this post, I'm going to implement a simple markup language including a parser and type checker. The language's syntax will be based on Haskell's syntax, and the type system will look a little bit like System F.

Why am I interested in implementing such a markup language? Well, markup languages like JSON are good at describing the values in a dataset, but not their types, and XML extends this by allowing you to specify a schema, which corresponds to a set of typing rules. Languages like XAML allow you to extend this idea even futher by using an actual type system as a sort of global schema, and then extending the expressiveness of the language with things like subclass polymorphism.

In my opinion, XAML is great for quickly bootstrapping a DSL without having to worry about parsing, but the lack of generics makes it impossible to encode some simple constraints in the schema. I often find myself wanting a markup language which looks like Haskell and which uses similar typing rules.

The simplicity of the language below allows for a very simple implementation: typing is a simple bottom-up pass over a term.

With that, let's include some modules:

module Markup where import Text.Parsec import qualified Text.Parsec.Token as P import qualified Text.Parsec.Language as L import qualified Text.Parsec.Expr as E import Data.Char import Data.List (groupBy, nub, (\\)) import Data.Function (on) import Control.Arrow ((***)) import Control.Applicative ((<$>), (<*>)) import Control.Monad (guard, zipWithM) import qualified Control.Monad.State as S

Language Definition and Examples

I would like separate languages for type declarations and values. Here are some examples of the sorts of things I would like to be able to write (note, not all of these are expected to pass the typechecker):

exampleSchema = unlines [ "Nil :: List Z a;" , "Singleton :: a -> List (S Z) a;" , "Cons :: a -> List n a -> List (S n) a;" , "Reverse :: List n a -> List n a;" , "Zip :: List n a -> List n b -> List n (Pair a b);" , "Tip :: Tree Z a;" , "Branch :: Tree n a -> a -> Tree n a -> Tree (S n) a" ] example1 = "x = Zip (Cons 1 $ Cons 2 Nil) (Cons 1 Nil)" example2 = "x = Zip (Cons 1 $ Reverse $ Cons 2 $ Cons 3 Nil) (Cons \"a\" $ Cons \"b\" $ Singleton \"c\")" example3 = "x = Branch (Branch Tip 1 Tip) 2 (Branch Tip 3 Tip)" example4 = "x = Branch (Branch Tip 1 Tip) 2 Tip" example5 = "x = Cons (Branch Tip 1 Tip) Nil" example6 = "x = 1 `Cons` (2 `Cons` (3 `Cons` Nil))" example7 = "x = 1 `Cons` Nil `Zip` (2 `Cons` Nil)" example8 = "x = Nil `Cons` Nil `Zip` ((Cons 1 $ Nil) `Cons` Nil)" example9 = unlines [ "x = Nil;" , "y = Cons x Nil;" , "z = Cons y Nil" ]

Notice that the type definition language supports things like type constructors and phantom types to allow me to encode simple constraints in the type of a declaration. For example, I can enforce the fact that the Zip constructor takes two lists of the same length. Also, there are no higher order functions, since there are no user-defined functions other than constructors, and no data declarations: all types are open for extension.

The value language copies several ideas from Haskell: infix constructors using backticks, and application using both juxtaposition and $ . Other than that, the syntax is very simple.

Here are the data types which represent terms and types, along with their pretty printing functions:

data Tm = TmName String | TmAtom String | TmStr String | TmInt Integer | TmApp Tm Tm deriving (Show, Eq) prettyTm :: Tm -> String prettyTm (TmName s) = s prettyTm (TmAtom s) = s prettyTm (TmStr s) = show s prettyTm (TmInt n) = show n prettyTm (TmApp t1 t2@(TmApp _ _)) = prettyTm t1 ++ " (" ++ prettyTm t2 ++ ")" prettyTm (TmApp t1 t2) = prettyTm t1 ++ " " ++ prettyTm t2 data Ty = TyCon String | TyApp Ty Ty | TyArr Ty Ty | TyVar String deriving (Show, Eq) prettyTy :: Ty -> String prettyTy (TyCon s) = s prettyTy (TyVar v) = v prettyTy (TyApp t1 t2@(TyCon _)) = prettyTy t1 ++ " " ++ prettyTy t2 prettyTy (TyApp t1 t2) = prettyTy t1 ++ " (" ++ prettyTy t2 ++ ")" prettyTy (TyArr t1 t2) = prettyTy t1 ++ " -> (" ++ prettyTy t2 ++ ")"

Parsing Types and Values Using Parsec

Now let's implement a parser. To keep things simple, I'm going to use the lexer generator from Text.Parsec.Token .

languageDef :: P.LanguageDef st languageDef = L.haskellStyle { P.opStart = fail "Operators are not supported" , P.opLetter = fail "Operators are not supported" , P.reservedNames = [] , P.reservedOpNames = [] , P.caseSensitive = True } lexer = P.makeTokenParser languageDef identifier = P.identifier lexer whiteSpace = P.whiteSpace lexer stringLiteral = P.stringLiteral lexer integer = P.integer lexer symbol = P.symbol lexer lexeme = P.lexeme lexer parens = P.parens lexer semi = P.semi lexer identu :: Parsec String u String identu = lookAhead upper >> identifier identl :: Parsec String u String identl = lookAhead lower >> identifier

There are parsers for terms:

tmName :: Parsec String u Tm tmName = lexeme $ identl >>= return . TmName tmAtom :: Parsec String u Tm tmAtom = lexeme $ identu >>= return . TmAtom tmStr :: Parsec String u Tm tmStr = lexeme $ stringLiteral >>= return . TmStr tmInt :: Parsec String u Tm tmInt = lexeme $ integer >>= return . TmInt tmBracket :: Parsec String u Tm tmBracket = lexeme $ parens tm tm :: Parsec String u Tm tm = E.buildExpressionParser [ [ E.Infix (return TmApp) E.AssocLeft ] , [ E.Infix (((.) TmApp . TmApp) <$> between (symbol "`") (symbol "`") tmAtom) E.AssocLeft ] , [ E.Infix (symbol "$" >> return TmApp) E.AssocRight ] ] (tmBracket <|> tmStr <|> tmInt <|> try tmName <|> tmAtom)

and parsers for types:

tyVar :: Parsec String u Ty tyVar = lexeme $ identl >>= return . TyVar tyCon :: Parsec String u Ty tyCon = lexeme $ identu >>= return . TyCon tyBracket :: Parsec String u Ty tyBracket = parens ty ty :: Parsec String u Ty ty = E.buildExpressionParser [ [ E.Infix (return TyApp) E.AssocLeft ] , [ E.Infix (symbol "->" >> return TyArr) E.AssocRight ] ] (tyBracket <|> try tyVar <|> tyCon)

The parsers tm and ty are then used to construct top-level parsers schema for type declarations and doc for value declarations:

valueDecl :: Parsec String u (String, Tm) valueDecl = do name <- identl symbol "=" t <- tm return (name, t) doc :: Parsec String u [(String, Tm)] doc = do whiteSpace ds <- sepBy valueDecl (lexeme semi) eof return ds typeDecl :: Parsec String u (String, Ty) typeDecl = do name <- identu symbol "::" t <- ty return (name, t) schema :: Parsec String u [(String, Ty)] schema = do whiteSpace ds <- sepBy typeDecl (lexeme semi) eof return ds

There is quite a bit going on here, but for the most part, the parser is assembled from reusable pieces in Text.Parsec.Token and some utility functions such as sepBy . The parser serves as a fairly good definition of the language syntax: the tm and ty parsers are just constructed using buildExpressionParser from the precedence tables for the term and type constructors respectively.

Implementing a Typechecker

The next step is to write the typechecker. This is actually quite simple since there are no function types other than constructors, and all quantification over free type variables implicitly happens at the top level.

The first step is to implement a unification function. This is very easy: types are either equal modulo variable substitution, or they do not unify at all:

unify :: Ty -> Ty -> Either String [(String, Ty)] unify (TyVar s) (TyVar t) = return [] unify (TyVar s) t = return [(s, t)] unify t (TyVar s) = return [(s, t)] unify (TyCon ct1) (TyCon ct2) | ct1 == ct2 = return [] unify (TyApp ty1 ty2) (TyApp ty3 ty4) = (++) <$> unify ty1 ty3 <*> unify ty2 ty4 unify (TyArr ty1 ty2) (TyArr ty3 ty4) = (++) <$> unify ty1 ty3 <*> unify ty2 ty4 unify t1 t2 = Left $ "Cannot unify " ++ prettyTy t1 ++ " with " ++ prettyTy t2

Type variable substitution is also a simple pattern match:

subst :: [(String, Ty)] -> Ty -> Ty subst vs v@(TyVar s) = maybe v id $ lookup s vs subst vs (TyApp t1 t2) = TyApp (subst vs t1) (subst vs t2) subst vs (TyArr t1 t2) = TyArr (subst vs t1) (subst vs t2) subst _ t = t

One important function which is used in the typechecker is specialize , which is used to apply a polymorphic function to an argument. The idea is that specialize f x should return the most general type of f x . The helper function isFunction is used to ensure that each type variable was unified with at most one type.

isFunction :: (Eq dom, Eq cod) => [(dom, cod)] -> Bool isFunction = all ((== 1) . length . nub . map snd) . groupBy ((==) `on` fst) specialize :: Ty -> Ty -> Either String Ty specialize (TyArr t1 t2) t3 = do vs <- unify t1 t3 guard $ isFunction vs return $ subst vs t2 specialize t _ = Left $ "Expected a function, found " ++ prettyTy t

We need one more helper function, which is used to rename free variables in a polymorphic type. This is used to make sure type variables do not overlap when a function application gets typed.

The renameAll function gives new names to all of the free type variables appearing in a type, using an association list to store the currently-generated names:

renameAll :: (Eq a) => a -> Ty -> S.State [((String, a), String)] Ty renameAll _ t@(TyCon _) = return t renameAll key (TyVar v) = do m <- S.get case lookup (v, key) m of Nothing -> do let name = unusedName (map snd m) S.modify $ (:) ((v, key), name) return $ TyVar name Just name -> return $ TyVar name renameAll key (TyApp t1 t2) = TyApp <$> renameAll key t1 <*> renameAll key t2 renameAll key (TyArr t1 t2) = TyArr <$> renameAll key t1 <*> renameAll key t2 unusedName :: [String] -> String unusedName used = head $ map (("t" ++) . show) [0..] \\ used

We can now write the typechecker. It is a simple application of the specialize and renameAll helper functions:

typeOf :: [(String, Ty)] -> Tm -> Either String Ty typeOf ctx (TmName s) = maybe (Left $ "Unknown name " ++ s) Right $ lookup s ctx typeOf ctx (TmAtom s) = maybe (Left $ "Unknown constructor " ++ s) Right $ lookup s ctx typeOf _ (TmInt _) = return (TyCon "Int") typeOf _ (TmStr _) = return (TyCon "String") typeOf ctx (TmApp t1 t2) = do tyArg <- typeOf ctx t2 tyFun <- typeOf ctx t1 flip S.evalState [] $ do tyFun' <- renameAll True tyFun tyArg' <- renameAll False tyArg return $ specialize tyFun' tyArg' typesOf :: [(String, Ty)] -> [(String, Tm)] -> Either String [(String, Ty)] typesOf _ [] = return [] typesOf ctx ((name, tm):tms) = do ty <- typeOf ctx tm let first = (name, ty) rest <- typesOf (first:ctx) tms return (first:rest)

Let's test the typechecker and parser on some of the examples at the top of the post:

parseErr p = either (Left . show) Right . parse p "" test sch val = do s <- parseErr schema sch d <- parseErr doc val tys <- typesOf s d return $ map (id *** prettyTy) tys

ghci> test exampleSchema example1 Left "Cannot unify S Z with Z" ghci> test exampleSchema example2 Right [("x","List (S (S (S Z))) (Pair Int String)")] ghci> test exampleSchema example3 Right [("x","Tree (S (S Z)) Int")]

This is good - the ill-typed expressions returned a typing error, and the well-typed expressions returned the correct types, including the correctly inferred placeholders for phantom types.

There are a few other small additions to the language that I would like to make such as infix operators, kind checking, better error messages and support for other primitive types.

A Mini Language For Relational Queries

As a more useful example, I've put together a minimal DSL for partially typed relational queries. The type system is used to make sure pairs of columns equated in joins have the same type, and to track the tables used in a query. Try replacing some of the constructors to see the ways in which the type system is used!

relationSchema = unlines [ "UserID :: Column User PK;" , "UserName :: Column User String;" , "UserCompanyID :: Column User PK;" , "CompanyID :: Column Company PK;" , "CompanyName :: Column Company String;" , "Table :: HList (Column t) -> Query (Table t);" , "Join :: Query r1 -> Query r2 -> Expr (Cross r1 r2) Bool -> Query (Cross r1 r2);" , "Column :: Column t ty -> Expr (Table t) ty;" , "Left :: Expr t1 ty -> Expr (Cross t1 t2) ty;" , "Right :: Expr t2 ty -> Expr (Cross t1 t2) ty;" , "Eq :: Expr t ty -> Expr t ty -> Expr t Bool;" , "Nil :: HList f;" , "Cons :: f t -> HList f -> HList f" ] relationExample = unlines [ "query = Join" , " (Table (Cons UserName $ Cons UserCompanyID Nil))" , " (Table (Cons CompanyID $ Cons CompanyName Nil))" , " ((Left $ Column UserCompanyID) `Eq` (Right $ Column CompanyID))" ]

ghci> test relationSchema relationExample Right [("query","Query (Cross (Table User) (Table Company))")]

Implementing a Kind Checker

Ensuring terms are well-typed is pretty useless if the type language is itself typeless, so let's implement a kind checker for type definitions in a schema.

The kind checker below is a modified version of a type checker from an earlier post. It works by generating kind constraints by performing a bottom-up traversal of a type, and then solving tose constraints by substitution.

Here is the type of kinds, including placeholders for unknown kinds:

type Unknown = Int data Kind = KindUnknown Unknown | KindStar | KindArr Kind Kind deriving (Show, Eq) prettyKind :: Kind -> String prettyKind (KindUnknown n) = "k" ++ show n prettyKind KindStar = "*" prettyKind (KindArr k1@(KindArr _ _) k2) = "(" ++ prettyKind k1 ++ ") -> " ++ prettyKind k2 prettyKind (KindArr k1 k2) = prettyKind k1 ++ " -> " ++ prettyKind k2

I'll start as usual with some basic utility functions which will be used during kind checking. Two crucial functions are the replacement of an unknown with another kind, and an occurs-check function:

replace :: Unknown -> Kind -> Kind -> Kind replace i x u@(KindUnknown j) | i == j = x | otherwise = u replace _ _ KindStar = KindStar replace i x (KindArr k1 k2) = KindArr (replace i x k1) (replace i x k2) occursCheck :: Unknown -> Kind -> Either String () occursCheck i (KindUnknown j) | i == j = return () occursCheck i x = occursCheck' i x where occursCheck' i (KindUnknown j) | i == j = fail "Occurs check failed during kind checking" occursCheck' i (KindArr k1 k2) = occursCheck' i k1 >> occursCheck' i k2 occursCheck' _ _ = return ()

Constraint generation happens by unification. The unify function takes two kinds and generates the constraints imposed by their equality, or returns a unification error.

type KindConstraint = (Unknown, Kind) type SolutionSet = Unknown -> Kind unifyKinds :: Kind -> Kind -> Either String [KindConstraint] unifyKinds (KindUnknown i) x = return [(i, x)] unifyKinds x (KindUnknown i) = return [(i, x)] unifyKinds KindStar KindStar = return [] unifyKinds (KindArr k1 k2) (KindArr k3 k4) = (++) <$> unifyKinds k1 k3 <*> unifyKinds k2 k4 unifyKinds k1 k2 = fail $ "Cannot unify " ++ show k1 ++ " with " ++ show k2

The general constraint generation function works bottom-up over the structure of a type:

generateConstraints :: (Monad m, S.MonadState ([(String, Unknown)], Unknown) m) => Ty -> m [KindConstraint] generateConstraints ty = do (cs, n) <- generateConstraints' ty return $ (n, KindStar) : cs where generateConstraints' :: (Monad m, S.MonadState ([(String, Unknown)], Unknown) m) => Ty -> m ([KindConstraint], Unknown) generateConstraints' (TyVar v) = do (m, _) <- S.get case lookup v m of Nothing -> do n <- newUnknown S.modify $ (:) (v, n) *** id return ([], n) Just unk -> return ([], unk) generateConstraints' (TyCon s) = do (m, _) <- S.get case lookup s m of Nothing -> do n <- newUnknown S.modify $ (:) (s, n) *** id return ([], n) Just unk -> return ([], unk) generateConstraints' (TyApp ty1 ty2) = do (c1, n1) <- generateConstraints' ty1 (c2, n2) <- generateConstraints' ty2 thisName <- newUnknown let newConstraint = (n1, KindArr (KindUnknown n2) (KindUnknown thisName)) return $ ((newConstraint:c1) ++ c2, thisName) newUnknown :: (Monad m, S.MonadState (a, Unknown) m) => m Unknown newUnknown = S.modify (id *** (+1)) >> S.get >>= return . snd

Constraints are solved by substitution using replace :

solve :: [KindConstraint] -> Either String SolutionSet solve cs = solve' cs KindUnknown where solve' [] ss = return ss solve' (c@(i, x):cs) ss = do occursCheck i x cs' <- substConstraints c cs let ss' = substSolutionSet c ss solve' cs' ss' substConstraints c [] = return [] substConstraints (i, x) ((j, y): cs) | i == j = (++) <$> substConstraints (i, x) cs <*> unifyKinds x y | otherwise = (:) (j, replace i x y) <$> substConstraints (i, x) cs substSolutionSet (i, x) = (.) (replace i x)

Now we can write the kind checker. The applications function splits an arrow type into parts, so that kind checking can work on each part.

applications :: Ty -> [Ty] applications t = applications' t [] where applications' (TyArr t1 t2) = applications' t1 . applications' t2 applications' t = (:) t kindCheck :: [Ty] -> Either String [(String, Kind)] kindCheck = uncurry kindCheck' . (solve *** id) . (concat *** fst) . flip S.runState ([], 0) . flip forM generateConstraints . concatMap applications where kindCheck' :: Either String SolutionSet -> [(String, Unknown)] -> Either String [(String, Kind)] kindCheck' ss unks = do ss' <- ss return $ map (id *** ss') unks

Let's try it out.