The G-machine In Detail, or How Lazy Evaluation Works

Posted on January 31, 2020

This post has several interactive components that won’t work without JavaScript. These will be clearly indicated. Regardless, I hope that you can still appreciate the prose and code.

With Haskell now more popular than ever, a great deal of programmers deal with lazy evaluation in their daily lives. They’re aware of the pitfalls of lazy I/O, know not to use foldl , and are masters at introducing bang patterns in the right place. But very few programmers know the magic behind lazy evaluation—graph reduction.

This post is an abridged adaptation of Simon Peyton Jones’ and David R. Lester’s book, “Implementing Functional Languages: a tutorial.”, itself a refinement of SPJ’s previous work, 1987’s “The Implementation of Functional Programming Languages”. The newer book doesn’t cover as much material as the previous: it focuses mostly on the evaluation of functional programs, and indeed that is our focus today as well. For this, it details three abstract machines: The G-machine, the Three Instruction Machine (affectionately called Tim), and a parallel G-machine.

In this post we’ll take a look first at a stack-based machine for reducing arithmetic expressions. Armed with the knowledge of how typical stack machines work, we’ll take a look at the G-machine, and how graph reduction works (and where the name comes from in the first place!)

This post is written as a Literate Haskell source file, with Cpp conditionals to enable/disable each section. To compile a specific section, use GHC like this:

ghc -XCPP -DSection1 2020-01-09.lhs -XCPP -DSection1 2020-01-09.lhs

module StackArith where

Section 1: Evaluating Arithmetic with a Stack

Stack machines are the base for all of the computation models we’re going to explore today. To get a better feel of how they work, the first model of computation we’re going to describe is stack-based arithmetic, better known as reverse polish notation. This machine also forms the basis of the programming language FORTH. First, let us define a data type for arithmetic expressions, including the four basic operators (addition, multiplication, subtraction and division.)

data AExpr = Lit Int | Add AExpr AExpr | Sub AExpr AExpr | Mul AExpr AExpr | Div AExpr AExpr deriving ( Eq , Show , Ord )

This language has an ‘obvious’ denotation, which can be realised using an interpreter function, such as aInterpret below.

aInterpret :: AExpr -> Int Lit n) = n aInterpret (n) Add e1 e2) = aInterpret e1 + aInterpret e2 aInterpret (e1 e2)aInterpret e1aInterpret e2 Sub e1 e2) = aInterpret e1 - aInterpret e2 aInterpret (e1 e2)aInterpret e1aInterpret e2 Mul e1 e2) = aInterpret e1 * aInterpret e2 aInterpret (e1 e2)aInterpret e1aInterpret e2 Div e1 e2) = aInterpret e1 `div` aInterpret e2 aInterpret (e1 e2)aInterpret e1aInterpret e2

Alternatively, we can implement the language through its operational behaviour, by compiling it to a series of instructions that, when executed in an appropriate machine, leave it in a final state from which we can extract the expression’s result.

Our abstract machine for aritmethic will be a stack based machine with only a handful of instructions. The type of instructions is AInstr .

data AInstr = Push Int | IAdd | IMul | ISub | IDiv deriving ( Eq , Show , Ord )

The state of the machine is simply a pair, containing an instruction stream and a stack of values. By our compilation scheme, the machine is never in a state where more values are required on the stack than there are values present; This would not be the case if we let programmers directly write instruction streams.

We can compile a program into a sequence of instructions recursively.

aCompile :: AExpr -> [ AInstr ] Lit i) = [ Push i] aCompile (i)i] Add e1 e2) = aCompile e1 ++ aCompile e2 ++ [ IAdd ] aCompile (e1 e2)aCompile e1aCompile e2 Mul e1 e2) = aCompile e1 ++ aCompile e2 ++ [ IMul ] aCompile (e1 e2)aCompile e1aCompile e2 Sub e1 e2) = aCompile e1 ++ aCompile e2 ++ [ ISub ] aCompile (e1 e2)aCompile e1aCompile e2 Div e1 e2) = aCompile e1 ++ aCompile e2 ++ [ IDiv ] aCompile (e1 e2)aCompile e1aCompile e2

And we can write a function to represent the state transition rules of the machine.

aEval :: ([ AInstr ], [ Int ]) -> ([ AInstr ], [ Int ]) ([], [])([], []) Push i : xs, st) = (xs, i : st) aEval (xs, st)(xs, ist) IAdd : xs, x : y : st) = (xs, (x + y) : st) aEval (xs, xst)(xs, (xy)st) IMul : xs, x : y : st) = (xs, (x * y) : st) aEval (xs, xst)(xs, (xy)st) ISub : xs, x : y : st) = (xs, (x - y) : st) aEval (xs, xst)(xs, (xy)st) IDiv : xs, x : y : st) = (xs, (x `div` y) : st) aEval (xs, xst)(xs, (xy)st)

A state is said to be final when it has an empty instruction stream and a single result on the stack. To run a program, we simply repeat aEval until a final state is reached.

aRun :: [ AInstr ] -> Int = go (is, []) where aRun isgo (is, []) | Just i <- final st = i go stfinal st = go (aEval st) go stgo (aEval st) = Just n final ([], [n]) = Nothing final _

A very important property linking our compiler, abstract machine and interpreter together is that of compiler correctness. That is:

forall x . aRun (aCompile x) == aInterpret x aRun (aCompile x)aInterpret x

As an example, the arithmetic expression 2 + 3 × 4 2 + 3 \times 4 2+3×4 produces the following code sequence:

[ Push 2 , Push 3 , Push 4 , IMul , IAdd ]

You can interactively follow the execution of this program with the tool below. Pressing the Step button is equivalent to aEval . The stack is drawn in boxes to the left, and the instruction sequence is presented on the right, where the > marks the currently executing instruction (the “program counter”, if you will).

You seem to have opted out of the interactive visualisations :(

Step Reset

Section 1.75: A Functional Program

In the previous section, we looked at how stack machines can be used to implement arithmetic. This is nothing exciting, though: FORTH is from the late 1960s! In this section, we’re going to look at a much more modern idea, only 30-something years old, which uses stack machines to implement functional languages via lazy graph reduction.

But first, we need to understand what that technobabble means in the first place. We define a functional language to be one in which the evaluation of a program expression is the same as evaluating a mathematical function: When you’re executing a “function application”, substitute the actual value of the argument wherever the parameter appears in the body of the function, then reduce any reducible expressions.

( λ x . x + 2 ) 5 Evaluation of a functional program starts by identifying a reducible expression, that is, an expression that isn’t “done” evaluating yet. By convention, we call reducible expressions redexes for short , and expressions that are done evaluating are called head-normal forms. Every application is a reducible expression. Here, reduction proceeds by substituting 5 5 5 in the place of every mention of x x x. Substituting an expression E 2 E_2 E2​ in place of the variable v v v, in a bigger expression E 1 E_1 E1​ is notated E 1 [ E 2 / v ] E_1[E_2/v] E1​[E2​/v] (read “ E 1 E_1 E1​ with E 2 E_2 E2​ for v v v”). ( x + 2 ) [ 5 / x ] This step of the evaluation isn’t exactly an expression, but it serves to illustrate what reducing a λ \lambda λ expression does: replacing the bound variable (or the “formal parameter” in fancy-pants speak. I’ll stick to bound variable). ( 5 + 2 ) By this step, the function has disappeared entirely. The expression has been replaced entirely with addition between numbers. Of course, addition, when both sides have been evaluated to a number, is itself a redex. This program isn’t done yet. 7 Replacing the addition by its value, our original program has reached its end: The number 7 7 7, and indeed any other number, is a head-normal form.

This all sounds good when described on paper, but how does one actually wire up (or, well, program) a computer to reduce functional programs?

Among the first and most comprehensive answers to this question was the G-machine, whose G stands for “Graph”. More specifically, the G-machine is an implementation of graph reduction: The expression to be reduced is represented as a graph that might have some redexes.

Once the machine has identified some particular redex to reduce, it’ll evaluate exactly as much as is needed to reach a head-normal form, and replace (or update) the graph so that the old redex points to its normal form.

To explore the workings of the G-machine, we’ll need to choose a functional language. Any will do, but simpler is better. Since I’ve already written a Lazy ML that compiles as described in this post, we’ll go with that.

Rio’s core language is a very simple functional language, notable only in that it doesn’t have λ \lambda λ-abstractions. All functions are defined at top-level, in the form of supercombinators.

A supercombinator is a function that only refers to its arguments or other supercombinators.

There’s a data type for terms:

data Term = Let [( Var , Term )] Term [()] | Letrec [( Var , Term )] Term [()] | App Term Term | Ref Var | Num Integer deriving Show

And one for supercombinators:

data SC = SC { name :: Var , args :: [ Var ], body :: Term } ], deriving Show

Consider the reduction of this functional program:

= x + x double x = double (double 4 ) maindouble (double

Here, double and main are the supercombinators that constitute the program. By convention, execution starts with the supercombinator main .

The initial graph is the trivial graph containing only the node main and no edges. Since the node points directly to a supercombinator, we can replace it by a copy of its body:

Now starts the actual work. There are many strategies for selecting a redex, and all of them are equally good, with the caveat that some may not terminate. However, if any evaluation strategy terminates, then so does “always choose the outermost redex”. This is called normal order evaluation. It’s what the G-machine implements.

The outermost redex here is the outer application of double , so that’s where reduction will happen. To reduce an application, update the redex with a copy of the supercombinator body, and replace the bound variables with pointers to the arguments.

Observe that, since the subexpression double 4 has two edges leading into it, the tree representing the program has degenerated into a general graph. However, this isn’t a bad thing: it means that the work to evaluate double 4 will only be needed once.

The application of + + + isn’t reducible yet because it requires its arguments to be evaluated, so the next reducible expression down the chain is the application node representing double 4 . The expansion there is similarly simple.

Here, it’s a bit hard to see what’s actually going on, so I’ll highlight in blue the whole next redex, 4 + 4 .

The state of the graph after reduction of double 4 . … with the entirety of the next redex highlighted for clarity.

But, wait. That redex has two application nodes, but the expression it represents is just 4 + 4 (with the 4 s, shared, so more like let x = 4 in x + x , but still). What gives?

Most formal treatments of functional languages, this included (to the extent that you can call Rio and a blog post “formal”), use currying to represent functions of multiple arguments. That is, instead of having built-in support for things like

let x = function (x , y) { (xy) /* takes two arguments (arity = 2) */ }

We encode a function of many arguments using nested lambda expressions, as in λ x . λ y . x + y \lambda x. \lambda y. x + y λx.λy.x+y. That’s why the application 4 + 4 , or, better stated, (+) 4 4 , has two application nodes.

With that in mind, the entire blue subgraph can be zapped away to become the number 8.

And finally, the last redex, 8 + 8 , can be zapped entirely into the number 16 .

module Gm where import qualified Data.Map.Strict as Map import Data.Map.Strict ( Map , (!)) , (!)) import qualified Data.Set as Set import Data.Set ( Set )

Section 2: The G-machine

After seeing in detail the reduction of a simple expression, one might start to form in their heads an idea of an algorithm to reduce a functional programming. As SPJ put it:

Find the next redex. Reduce it. Update the root of the redex with its reduct.

With these three easy steps, functional programs be!

Of course, that glosses over three major difficulties:

How does one find the next redex? How does one reduce it? How does one update the graph?

Of these, only the answer to 3 is simple: “Overwrite it with an indirection”. (We’ll get there). To do the latter efficiently, we’re going to use an abstract machine: The G-machine.

What’s an abstract machine? An abstract machine isn’t, as the similar-sounding name might imply, a virtual machine. Indeed, these concepts are so easily confused that the most popular abstract machine in existence has “virtual machine” in its name. I’m talking about LLVM, of course. Abstract machines are simply formalisms used to aid in the implementation of compilers. Of course, one might write an execution engine for such a machine (a “simulator”, one could say), and even use that as an actual execution model for your language (like OCaml uses the ZINC machine). In this, they are more closely related to intermediate languages than virtual machines.

Let’s tackle these problems in turn.

How does one find the next redex?

Consider the following expression graph. It has an interesting feature in that it (almost certainly) constitutes a redex. How do we know that?

Well, I’ve used the least subtle blue possible to highlight the spine of the expression graph. By starting at the root (the topmost node), and following every left pointer until reaching a supercombinator, one can find the spine of the graph.

Moreover, if we use a stack to remember the addresses that we visited on our way down, we’ll have unwound the spine.

A note on stack addressing Following x86 convention, our stack grows downwards, so that the first element in the diagram above would be the one pointing to f .

The third address in the stack is the root of the redex, and the first address points to a supercombinator. If the number of pointers on the stack is greater than or equal to the number of arguments the supercombinator expects (plus one, to account for the supercombinator node itself), we’ve spotted a redex.

How does one reduce it?

This depends on the nature of the redex, of course; Reducing a supercombinator is not the same as reducing an arithmetic function, for example.

Supercombinator redexes are easy enough. If the stack has enough arguments, then we can just replace the root of the redex (in our addressing model, this coincides with the stack pointer used to fetch the last argument) with a copy of the body of the supercombinator, replacing their arguments in the correct place.

Constant applicative forms, or CAFs, are supercombinators with no arguments. Their reduction is much the same as with a normal supercombinator, except that when the time comes to update the graph, we need to update the supercombinator itself with an indirection.

Primitive redexes, on the other hand, will require a bit more machinery. For instance, what should we do in the situation above, where the argument to + was itself a redex?

There needs to be a way to evaluate the argument double 4 to head normal form then continue reducing the application of + . Every programming language has to deal with this, and our solution is more of the same: use a stack.

The G-machine already has a stack, though, so we need another one. A stack of stacks, and of return addresses, called the dump. When a primitive operation needs the value of one of its arguments, it first saves that argument from the stack, then pushes the stack pointer and program counter onto the dump (this is the G-machine’s concept of return address); The saved argument is pushed onto an empty stack, and the graph is unwound starting from that argument.

When unwinding encounters a node in head-normal form, and there’s a saved return address on the dump, we pop that, restore the stack pointers, and jump to the saved program counter.

The idea behind the G-machine is that we can teach each supercombinator to make an instance of its own body by compiling it to a series of small, atomic instructions. This solves the hardest problem in implementing functional languages, which is the whole “replacing the root of the redex with a copy of the supercombinator body” I glossed over.

An Example

Let’s consider the (fragment of a) functional program below.

= K (g x) f g x(g x)

Compiling it into G-machine instructions results in the following instructions:

Push ( Arg 1 ) Push ( Arg 3 ) Mkap Push ( Global K ) Mkap Slide 3 Unwind

These diagrams show how the code for f would execute.

Fig. 1: Diagram of the stack and the heap (“graph”) after entering the f f f supercombinator. Fig. 2: State of the machine after executing Push (Arg 1) . Fig. 3: State of the machine after executing Push (Arg 3) . Fig. 4: State of the machine after executing Mkap . Fig. 5: State of the machine after executing Push (Global K) . Fig. 6: State of the machine after executing Mkap . Fig. 7: State of the machine after executing Slide 3 .

When jumping to the code of f , the stack would look as it does in figure 1. The expression graph has been unwound, and the stack has pointers to the application nodes that we’ll use to fetch the actual arguments.

The first thing we do is take pointers to the arguments g and x from their application nodes and put them on the stack. This is shown in figures 2 and 3.

Keep in mind that Arg 0 would refer to the bottom-most stack location, so (on entry to the function) Arg 1 refers to the first argument. However, when we push onto the stack, the offsets to reach the argument shift by one, and so what would be Arg 2 has to become Arg 3 .

The instruction Mkap takes the two newest pointers and makes an application node, denoted @ , from them. The newest value on the stack is taken as the function (the node’s left edge) and the value above that is the argument (the node’s right edge).

By figure 4, we’re not done yet. Push (Global K) has the sole effect of pushing a pointer to the supercombinator K onto the stack, as shown in figure 5; After yet another Mkap , we’ve finished building the body of f .

The G-machine presented above, unlike the one implemented in Rio, is not lazy; The abrupt transition between figures 6 and 7 shows that, instead of updating the graph, we just discard the old stuff that was there with a Slide 3 instruction.

“Slide” is a weird little instruction that doesn’t correspond to any stack operation, whose effect, to save the newest value on the stack, discard the n values following that, and push the saved value, is best described by the Haskell function below:

: xs) = x : drop n xs slide n (xxs)n xs

Implementing the G-machine

First and foremost we’ll need a type for our machine’s instructions. GmVal represents anything that can be pushed onto the stack, and only exists to avoid having four different Push instructions.

data GmVal = Global String | Value Int | Arg Int | Local Int deriving ( Eq , Show , Ord )

The addressing mode Global is only used for statically-allocated supercombinator nodes; Value is used for integer constants, and allocates an integer node on the heap . Arg and Local push a pointer from the stack back onto the stack, the difference being that Arg expects the indexed value to point to an application node, and pushes the right pointer of that node.

data GmInst = Push GmVal | Slide Int | Cond [ GmInst ] [ GmInst ] ] [ | Mkap | Eval | Add | Sub | Mul | Div | Equ | Unwind deriving ( Eq , Show , Ord )

Here’s a quick summary of what the instructions do, in order:

Push adds something to the stack in one of the ways described above; Slide n does the “save top item, pop n items, push top item” transformation described above; Cond code_then code_else expects the top of the stack to be a pointer to an integer node. If the value pointed to is 0 , it’ll load code_then into the program counter; Otherwise, it’ll load code_else . Mkap makes an application node out of the two topmost values on the stack, and pushes that node’s address back onto the stack. Eval is one of the most complicated instructions. First, it must save the topmost element of the stack. In a compiled implementation, this would be in a scratch register, but in this simulator it’s saved as a local Haskell variable. It then saves the stack pointer and program counter onto the dump, allocates a fresh stack with only the saved value, and loads [Unwind] as the program. Add , Sub , Mul , Div , and Equ are all self-explanatory. They all expect the two topmost values onto the stack to be numbers in WHNF. Unwind is the most complicated instruction in the machine. In a compiled implementation, like Rio, the sensible thing to do for Unwind would be to emit a jump to a precompiled procedure. The behaviour of unwinding depends on what’s currently the top of the stack. Unwinding an application node pushes the left pointer (the function pointer) of the application node onto the stack and continues unwinding.

Unwinding a supercombinator node must check that the stack has enough pointers to satisfy the combinator’s arity. Namely, for a combinator of arity N N N, the stack must have at least N + 1 N + 1 N+1 pointers.

Unwinding a number with a non-empty dump must pop the stack pointer and program counter from the top of the dump and continue executing, with the number pushed on top of the restored stack.

Unwinding a number with an empty dump means the machine is done.

For our simulator, we need to define what the state of the machine comprises, and implement state transitions corresponding to each of the instructions above.

type Addr = Int data GmNode = App Addr Addr | SCo String Int [ GmInst ] | Num Int deriving ( Eq , Show , Ord ) type GmHeap = Map Addr GmNode type GmGlobals = Map String Addr type GmCode = [ GmInst ] type GmDump = [( GmStack , GmCode )] [()] type GmStack = [ Addr ]

The state of the machine is the pairing (quintupling?) of heap, globals, code, dump and stack.

Support functions for the heap and the state type data GmState = GmState { heap :: GmHeap , globals :: GmGlobals , stack :: GmStack , code :: GmCode , dump :: GmDump } deriving ( Eq , Show , Ord ) alloc :: GmNode -> GmHeap -> ( Addr , GmHeap ) = alloc node heap let ( last , _) = Map.findMax heap , _)Map.findMax heap in ( last + 1 , Map.insert ( last + 1 ) node heap) , Map.insert () node heap) num :: GmNode -> Int Num i) = i num (i) = error $ "Not a number: " ++ show x num x binop :: ( Int -> Int -> Int ) -> GmState -> GmState @ GmState { .. } = binop fun st let a : b : xs = stack xsstack = num (heap Map.! a) a'num (heapa) = num (heap Map.! b) b'num (heapb) = alloc ( Num (b' `fun` a')) heap (addr, heap')alloc ((b'a')) heap in st { heap = heap', stack = addr : xs } st { heapheap', stackaddrxs } reify :: GmState -> GmNode GmState { stack = addr : _, heap } = heap Map.! addr reify{ stackaddr_, heap }heapaddr graphToDOT :: GmState -> String GmState { .. } = unlines $ "digraph {

" : concatMap go (Map.toList heap) graphToDOTgo (Map.toList heap) ++ [ "stack[color=red]; stack ->" ++ nde ( head stack) ++ "; }" ] where nde (stack) = go (n, node) case node of node Num i -> ([ nde n ++ "[label=" ++ show i ++ "]; " ]) ([ nde n]) SCo name _ code -> (nde n ++ "[label=" ++ name ++ "]; " ) : mapMaybe (codeEdge n) code name _ code(nde nnamemapMaybe (codeEdge n) code App n' n'' -> ([ nde n ++ "[label=\"@\"]" , nde n ++ " -> " ++ nde n', nde n ++ " -> " ++ nde n'' ]) n' n''([ nde n, nde nnde n', nde nnde n'' ]) = 'N' : show i nde i Push ( Global g')) = Just (nde i ++ " -> " ++ nde (globals Map.! g')) codeEdge i (g'))(nde inde (globalsg')) = Nothing codeEdge i _

Armed with a definition for the machine state, we can implement the main function run , which takes a state to a list of successor states. If the program represented by some state initial terminates, then last (run initial) is the terminal state, containing the single number which is the result of the program.

run :: GmState -> [ GmState ] = state : rest where run statestaterest rest | final state = [] final state[] | otherwise = run nextState run nextState = step state nextStatestep state

What does it mean for a state to be final, or terminal? Well, if the machine has no more code to execute, or it’s reached WHNF for a value and has nowhere to return, execution can not proceed. These are the final states of our G-machine.

final :: GmState -> Bool GmState { .. } = null code || ( null dump && whnf) where finalcodedumpwhnf) = whnf case stack of stack -> isNum (heap Map.! addr) [addr]isNum (heapaddr) _ -> False Num _) = True isNum (_) = False isNum _

Now we can define the stepper function that takes one step to its successor:

step :: GmState -> GmState @ GmState { code = [] } = error "step final state" step state{ code[] } @ GmState { code = i : is } = step state{ codeis } = is } instruction i state{ codeis } instruction :: GmInst -> GmState -> GmState

The many cases of the instruction function represent the various transition rules for each instruction we detailed above.

Push val) st @ GmState { .. } = instruction (val) st case val of val Global str -> st { stack = globals Map.! str : stack } strst { stackglobalsstrstack } Local i -> st { stack = (stack !! i) : stack } st { stack(stacki)stack } Arg i -> st { stack = getArg (heap Map.! (stack !! (i + 1 ))) : stack } st { stackgetArg (heap(stack(i)))stack } Value i -> let (addr, heap') = alloc ( Num i) heap (addr, heap')alloc (i) heap in st { stack = addr : stack, heap = heap' } st { stackaddrstack, heapheap' } where getArg ( App _ x) = x getArg (_ x)

Remember that in the Push (Arg _) case, the offset points us to an application node unwound from the spine, so we have to look through it to find the actual argument.

Mkap st @ GmState { .. } = instructionst let (addr, heap') = alloc ( App f x) heap (addr, heap')alloc (f x) heap x : f : xs = stack xsstack in st { heap = heap', stack = addr : xs } st { heapheap', stackaddrxs } Slide n) st @ GmState { .. } = instruction (n) st let a : as = stack in st { stack = a : drop n as } asstackst { stackn as }

Mkap and Slide are very straightforward indeed.

Cond t e) st @ GmState { .. } = instruction (t e) st let a : as = stack asstack Num i = heap Map.! a heap in if i == 0 then st { code = t ++ code, stack = as } else st { code = e ++ code, stack = as } st { codecode, stackas }st { codecode, stackas }

For the Cond instruction, we mimic the effect of control flow “joining up” after an if statement by concatenating the given code, instead of replacing it. Since Unwind acts almost like a return statement, one can skip this by adding an Unwind in either branch.

Add st = binop ( + ) st instructionstbinop () st Sub st = binop ( - ) st instructionstbinop () st Mul st = binop ( * ) st instructionstbinop () st Div st = binop div st instructionstbinopst Equ st @ GmState { .. } = instructionst let a : b : xs = stack xsstack Num a' = heap Map.! a a'heap Num b' = heap Map.! b b'heap = alloc ( Num equal) heap (addr, heap')alloc (equal) heap = if a' == b' then 0 else 1 equala'b' in st { heap = heap', stack = addr : xs } st { heapheap', stackaddrxs }

I included Equ here as a representative example for all the binary operations; The rest are defined in terms of a binop combinator I hid in a <details> tag way back when the state type was defined.

The Eval instruction needs to save the stack and the code onto the dump and begin unwinding the top of the stack.

Eval st @ GmState { .. } = instructionst let a : as = stack asstack in st { dump = (as, code) : dump, code = [ Unwind ], stack = [a] } st { dump(as, code)dump, code], stack[a] }

Unwind is, by far, the most complicated instruction. We start by dispatching on the head of the stack.

Unwind st @ GmState { .. } = instructionst case heap Map.! head stack of heapstack

If there’s a number, we also have to inspect the dump. If we have somewhere to return to, we continue there. Otherwise, we’re done.

Num _ -> case dump of dump : dump' -> (stack', code')dump' = head stack : stack', code = code', dump = dump' } st { stackstackstack', codecode', dumpdump' } -> [] = [] } st { code[] }

Application nodes are more interesting. We put the function part of the app node onto the stack and keep unwinding.

App fun _ -> st { stack = fun : stack, code = [ Unwind ] } fun _st { stackfunstack, code] }

Supercombinator nodes do the arity test and load their code onto the state if there are enough arguments.

SCo _ arity code | length stack + 1 >= arity -> _ arity codestackarity = code } st { codecode } SCo name _ _ -> error $ "Not enough arguments for supercombinator " ++ name name _ _name

Here’s the code for a factorial program if you’d like to see. You can print the (very non-exciting result) using the functions reify and run like this:

= print . reify . last . run $ factorial10 mainreifyrunfactorial10

G-machine code for 10 ! 10! 10!, and factorial10_dumb Note: The code below is much better than what I can realistically implement a compiler for in the space of a blog post. It was hand-tuned to do the least amount of evaluation nescessary. It could, however, be improved by being made tail-recursive. Exercise: Make the implementation below tail-recursive. That is, compile the following program: 0 acc = acc facaccacc ! acc = fac (n - 1 ) (acc * n) fac naccfac (n) (accn) = fac 10 1 mainfac factorial10 :: GmState = factorial10 GmState { code = [ Push ( Global "main" ), Unwind ] { code), = globals , globalsglobals = [] , stack[] = heap , heapheap = [] , dump[] } where = Map.fromList . zip [ 0 .. ] $ heapMap.fromList [ SCo "fac" 1 [ Push ( Arg 0 ), Eval , Push ( Local 0 ), Push ( Value 0 ), Equ ),),), , Cond [ Push ( Value 1 ), Slide 3 , Unwind ] [] ),] [] , Push ( Global "fac" ) , Push ( Local 1 ), Push ( Value 1 ), Sub ),), , Mkap , Eval , Push ( Local 1 ), Mul ), , Slide 2 , Unwind ] , SCo "main" 0 [ Push ( Global "fac" ), Push ( Value 10 ), Mkap , Slide 1 , Unwind ] ),), ] = Map.fromList [ ( "fac" , 0 ), ( "main" , 1 ) ] globalsMap.fromList [ (), () ] What you could expect from Rio is more along the lines of this crime against humanity: factorial10_dumb :: GmState = factorial10_dumb GmState { code = [ Unwind ] { code = globals , globalsglobals = [ 5 ] , stack = heap , heapheap = [] , dump[] } where = Map.fromList . zip [ 0 .. ] $ heapMap.fromList [ SCo "if" 3 [ Push ( Arg 0 ), Eval , Cond [ Push ( Arg 1 ) ] [ Push ( Arg 2 ) ], Slide 4 , Unwind ] ),) ] [) ], , SCo "mul" 2 [ Push ( Arg 0 ), Eval , Push ( Arg 2 ), Eval , Mul , Slide 3 , Unwind ] ),), , SCo "sub" 2 [ Push ( Arg 0 ), Eval , Push ( Arg 2 ), Eval , Sub , Slide 3 , Unwind ] ),), , SCo "equ" 2 [ Push ( Arg 0 ), Eval , Push ( Arg 2 ), Eval , Equ , Slide 3 , Unwind ] ),), , SCo "fac" 1 [ Push ( Global "if" ), Push ( Global "equ" ), Push ( Arg 2 ), Mkap , Push ( Value 0 ), Mkap ),),),), , Mkap , Push ( Value 1 ), Mkap , Push ( Global "mul" ), Push ( Arg 2 ), Mkap , Push ( Global "fac" ) ),),), , Push ( Global "sub" ), Push ( Arg 4 ), Mkap , Push ( Value 1 ), Mkap , Mkap , Mkap ),),), , Mkap , Slide 2 , Unwind ] , SCo "main" 0 [ Push ( Global "fac" ), Push ( Value 10 ), Mkap , Slide 1 , Unwind ] ),), ] = Map.fromList [ ( "if" , 0 ), ( "mul" , 1 ), ( "sub" , 2 ), ( "equ" , 3 ), ( "fac" , 4 ) ] globalsMap.fromList [ (), (), (), (), () ]

The G-machine, with no garbage collector, has a tendency to produce ridiculously large graphs comprising of mostly garbage. For instance, the graph at the end of reducing factorial10_dumb has 271 nodes, only one of which isn’t garbage. Ouch!

Those two red nodes? That’s the result of the program, and the top of the stack pointing to it. Yup.

Thankfully, the G-machine makes it easy to write a garbage collector. Well, in theory, at least. The roots can be found on the stack, and all the stacks saved on the dump. Each live supercombinator can also keep other supercombinators alive by referencing them in Push (Global _) instructions.

Since traversing each supercombinator every GC cycle to identify global references is expensive, they can each be augmented with a “static reference table”, or SRT for short. In our simulator, this would be a Set of Addr s that each supercombinator keeps alive.

liveAddrs :: GmState -> Set Addr GmState { .. } = roots <> foldMap explore roots where liveAddrsrootsexplore roots = Set.fromList stack <> foldMap (Set.fromList . fst ) dump rootsSet.fromList stack(Set.fromList) dump = Set.insert i $ explore iSet.insert i case heap Map.! i of heap App x y -> explore x <> explore y x yexplore xexplore y SCo _ _ code -> foldMap globalRefs code _ _ codeglobalRefs code _ -> mempty Push ( Global i)) = Set.singleton (globals Map.! i) globalRefs (i))Set.singleton (globalsi) = mempty globalRefs _

With the set of live addresses in hand, we can write code to get rid of all the others, and re-number them all. This is a toy moving garbage collector, since we allocate an entirely new heap to get rid of the old one.

scavenge :: GmState -> GmState @ GmState { .. } = st { heap = Map.filterWithKey (\k _ -> is_live k) heap } where scavenge stst { heapMap.filterWithKey (\k _is_live k) heap } = liveAddrs st liveliveAddrs st = x `Set.member` live is_live xlive

Running scavenge on the final state of factorial10_dumb gets us a much better looking graph:

Possible Extensions

Data structures. This is covered in the book, but I didn’t have space/time to cover it here. The core idea is that the graph gets a new kind of node, Constr Int [Addr] , that stores a tag and some fixed amount of addresses. Pattern-matching case expressions can then take apart these Constr nodes and branch based on the integer tag. Support I/O. By threading an explicit state variable, a guaranteed order of effects can be achieved even in lazy code. Let me tell you a secret: This is what GHC does. newtype IO a = IO { runIO # :: State # RealWorld -> ( # a, State # RealWorld # ) } { runIOa,) } The State# RealWorld# value is consumed by each foreign function, i.e. everything that actually does I/O, looking a lot like a state monad; In reality, the RealWorld is made of lies. State# has return kind TYPE (TupleRep '[]) , i.e., it takes up no bits at runtime. However, by having every foreign function be strict in some variable, no matter how fake it is, we can guarantee the order of effects: each function depends directly on the function “before” it. Parallelism. Lazy graph reduction lends itself nicely to parallelism. One could envision a machine where a number of worker threads are each working on a different redex. To prevent weird parallelism issues from cropping up, graph nodes would need to be lockable. However, only @ nodes will ever be locked, so that might lead to an optimisation. As an alternative to a regular lock, the implementation could replace each node under evaluation by a black hole, that doesn’t keep alive any more values (thus possibly getting rid of some space leaks). Each black hole would maintain a queue of threads that tried to evaluate it, to be woken up once the result is available.

Conclusion

This post was long. And it still didn’t cover a lot of stuff about the G-machine, such as how to compile to the G-machine (expect a follow-up post on that) and how to compile from the G-machine (expect a follow-up post on that too!)

Assembling G-machine instructions is actually simpler than it seems. With the exception of Eval and Unwind , which are common and large enough to warrant pre-assembled helpers, all G-machine instructions assemble to no more than a handful of x86 instructions. As an entirely contextless example, here’s how Cond instructions are assembled in Rio:

Cond c_then c_else) = do compileGInst (c_then c_else) pop rbx 0 ) (intv_off `quadOff` rbx) cmp (int64) (intv_offrbx) rec jne else_label traverse_ compileGInst c_then jmp exit_label <- genLabel else_labelgenLabel traverse_ compileGInst c_else <- genLabel exit_labelgenLabel pure () ()

This is one of the most complicated instructions to assemble, since the compiler has to do the impedance matching between the G-machine abstraction of “instruction lists” and the assembler’s labels. Other instructions, such as Pop (not documented here), have a much clearer translation:

Pop n) = add (int64 (n * 8 )) rsp compileGInst (n)add (int64 (n)) rsp

Keep in mind that the x86 stack grows downwards, so adding corresponds popping. The only difference between the actual machine here and the G-machine here is that the latter works in terms of addresses and the former works in terms of bytes.

The code to make an App node is similarly simple, using Haskell almost as a macro assembler. The variable hp is defined in the code generator and RTS headers to be r10 , such that both the C support code and the generated assembly can agree on where the heap is.

Mkap = do compileGInst `byteOff` hp) mov (int8 tag_AP) (tag_offhp) `quadOff` hp) pop (arg_offhp) `quadOff` hp) pop (fun_offhp) push hp += int64 valueSize hpint64 valueSize

Allocating in Rio is as simple as writing the value you want, saving hp somewhere, then bumping it by the size of a value. We can do this because the amount a given supercombinator allocates is statically known, so we can do a heap satisfaction check once, at the start of the combinator, and then just build our graphs free of worry.

A function to count how much a supercombinator allocates is easy to write using folds. entry :: Foldable f => f GmCode -> BlockBuilder () () entry code | bytes_alloced > 0 bytes_alloced = do lea (bytes_alloced `quadOff` hp) r10 lea (bytes_allocedhp) r10 cmp hpLim r10 Label "collect_garbage" ) ja ( | otherwise = pure () () where = foldl' cntBytes 0 code bytes_allocedfoldl' cntBytescode MkAp = valueSize + x cntBytes xvalueSize Push ( Value _)) = valueSize + x cntBytes x (_))valueSize Alloc n) = n * valueSize + x cntBytes x (n)valueSize Cond xs ys) = foldl' cntBytes 0 xs + foldl' cntBytes 0 ys + x cntBytes x (xs ys)foldl' cntBytesxsfoldl' cntBytesys = x cntBytes x _

To sum up, hopefully without dragging around a huge chain of thunks in memory, I’d like to thank everyone who made it to the end of this grueling, exceedingly niche article. If you liked it, and were perhaps inspired to write a G-machine of your own, please let me know!