This Lambda the Ultimate post (dated 2010) describes a rather universal problem faced by compiler writers: how does one go about adding “extra information” (e.g. types) to an AST? (The post itself divides the problem into three components: adding the information to the data types, using the information to inform the construction of the node, and using the information to inform the destruction of a node—but I’m really only interested in the question of how you define your data type, not do things to it.) In this post, I want to sum up ways of solving the problem which were described in this post, and also take a look at what some real world compilers do. The running example lambda calculus looks like the following:

data Exp = Num Int | Bool Bool | Var Var | If Exp Exp Exp | Lambda Var Exp | App Exp Exp data Type = TyInt | TyBool | TyArrow Type Type

Separate IR where nodes are decorated with types The low-tech solution: if you need a new version of the IR with more information, just define a new IR type where each node can also carry the information. A trick to make these definitions more concise is to make a mutually recursive data structure. [1] type TExp = (TExp', Type) data TExp' = TNum Int | TBool Bool | TVar Var | TIf TExp TExp TExp | TLambda Var TExp | TApp TExp TExp Despite (or perhaps because of) it’s simplicity, this approach is extremely popular among many compilers, especially in the ML community. A few examples include OCaml (parsetree/typedtree), MLton (AST/CoreML) and Ikarus Scheme. Part of the reason for this is that the transition from frontend language to typed language also comes with some other changes, and when a new AST is defined those changes can be combined in too.

Nullable field The unprincipled solution: use one AST, but have an optional field in which you can slot in the information. [2] type TExp = (TExp', Maybe Type) data TExp' = TNum Int | TBool Bool | TVar Var | TIf TExp TExp TExp | TLambda Var TExp | TApp TExp TExp Presented without further comment.

Explicit typing While closely related to the separate IR solution, an explicitly typed IR takes the approach of not decorating each node with a type, but arranging that the type of any given node can be quickly computed using only local information. [3] data TExp = TNum Int | TBool Bool | TVar Var | TIf TExp TExp TExp | TLambda Var Type TExp | TApp TExp TExp Here, the difference between TExp and Exp is very slight; the TLambda is annotated with an explicit type for the binder. As far as type-checking is concerned, this makes a world of difference: we no longer need to look outside a lambda to figure out what the binder could be. Forcing your IR to be explicitly typed is often a good idea for metatheoretic reasons, as complicated type systems often don’t have decidable inference algorithms. Both GHC’s core IR, Ur/Web's core and Coq are explicitly typed in this way.

Two-level types By deferring when you tie the knot of a recursive data-structure, you can arrange for the base functor to do double-duty for the untyped and typed representations. [4] data ExpF a = Num Int | Bool Bool | Var Var | If a a a | Lambda Var a | App a a newtype Exp = Exp (ExpF Exp) newtype TExp = TExp (ExpF TExp, Type) The Coq kernel uses this to define its expression type, although it doesn’t use it to define an untyped variant.

(Lazy) Attribute grammars I don’t claim to understand this approach too well, but essentially it is a programming model distinct from usual algebraic data types which associates attributes over nodes of a tree. In some sense, it can be thought as a memoized function from AST nodes to the attributes. Many compilers do utlize maps, but only for top-level declarations. [5]