Haskell to .Net compiler series: part 1

Photo by Markus Spiske on Unsplash

We’ve started exploring Haskell to .Net platform compilation in frames of a larger project and decided to share learnings, ideas and ask for feedback as we progress. We at Superstring Solutions firmly believe that functional programming in general and Haskell as it’s “flag bearer” in particular present a much better approach to developing complex real-world systems than predominantly imperative approach in the industry. There are 2 big hurdles for this adoption: lack of comprehensible learning resources, especially at school or undergrad level, and lack of GUI / Web integration (even though there’s significant horsepower available today for both with functional reactive programming and various web frameworks and haskell-2-js compilers). To address the latter, we think that having Haskell compile to .Net with easy seamless interoperability with existing .Net libraries would significantly boost Haskell’s popularity with “mainstream” developers and help with wider industry adoption.

Welcome to the Haskell to .Net compiler series! It will be useful to all of you who want to hack GHC, write your own compiler or are generally interested in advanced functional programming topics. In this post, we:

describe STG language and its’ data types in some detail with reference links

show how to get to STG from Haskell source via GHC plugins and in a standalone program using GHC API

Sketch .Net compilation approaches

To wet your appetite, here is the status of compiling the map function to pseudo-C# code we use in this very early design exploration stage:

Haskell → STG

Pseudo-C#

The compiler is (and will be) based on GHC (duh!), whose compilation pipeline is described from top level here. In short, it goes from Haskell source to Core language to STG language to C-minus-minus to various backend code generators (llvm and native), with desugaring, typechecking and a bunch of optimizations applied along the way. Most of this functionality is readily available via GHC API, which unfortunately has a couple of annoying problems:

It is not documented very well (do read the source files if you want to get serious, comments there are very helpful)

The API changes slightly from version to version, which makes existing resources describing how to use it obsolete with persistent regularity (e.g., we did use this excellent Stephen Diehl’s post series, as everything by this author, but it is already quite obsolete and required a lot of dancing around with types to get the current GHC version to work)

We will compile to .Net from STG, not Core and not Cmm due to a number of reasons that will be touched upon later on, hence our focus on STG here. In short, Core is too high-level and Cmm is too low-level for .Net CLR, but there are other considerations.

Everything below is based on GHC 8.6.5, but we will do our best to explain general principles which allow easy keeping up with the changes.

STG and Basic Design

STG stands for Spineless Tagless Graph (reduction) machine, a very small functional language and an approach (probably state of the art, certainly the most studied and tried in production due to being central to GHC compiler) to compile various functional languages to real hardware. It is described by Simon Peyton Jones and others in several papers, out of which you may want to read 2: “Spineless Tagless G-machine” and “Push/Enter vs Eval / Apply” (curiously, in the first STG machine function calls were implemented via push/enter, but after the 2nd GHC implementation was changed to eval / apply as more superior). All of Haskell gets compiled via this tiny language.

Very briefly summarizing key STG features that make it an excellent source for compilation:

It’s a very small purely functional language, which also has explicitly defined operational semantics, which makes mapping to hardware or low-level virtual machines (e.g., .Net or Java) pretty straightforward

All function applications and constructor arguments are simple variables or constants, while all data constructor applications and operator calls are fully saturated

All pattern matching is performed by case statements, and evaluation (or graph reduction) is driven also only by them; heap is allocated only by let statements

statements, and evaluation (or graph reduction) is driven also only by them; heap is allocated only by statements All objects are uniformly represented at runtime as an info pointer, code pointer and payload, and have only 3 types: FUN — function object (so, a function value), CON — data constructor application (so, a data value) and a THUNK — an as yet unevaluated suspension (which are ubiquitous due to laziness of Haskell) — more on this below

STG is easy to understand, both papers referenced above are very accessible, but the first thing that will jump at you after you’ve read them and understood STG principles is this (comments added by me):

data GenStgExpr bndr occ =

StgApp occ [GenStgArg occ] -- function application

| StgLit Literal -- literal

| StgConApp DataCon [GenStgArg occ] [Type] -- data constructor application

| StgOpApp StgOp [GenStgArg occ] Type -- primitive operator or foreign call application

-- only used during core to stg passes:

| StgLam (NonEmpty bndr) StgExpr

-- case expressions:

| StgCase (GenStgExpr bndr occ) bndr AltType [GenStgAlt bndr occ] -- let bindings:

| StgLet (GenStgBinding bndr occ) (GenStgExpr bndr occ)

| StgLetNoEscape (GenStgBinding bndr occ) (GenStgExpr bndr occ)

| StgTick (Tickish bndr) (GenStgExpr bndr occ)

… current main data type encoding STG expressions, which is significantly more cluttered and confusing than the STG description given in the papers, and then there are bindings and closures with their own data types to deal with on top of it. However, here we are primarily interested in StgApp — function application; StgConApp — constructor application that maps to CON objects at runtime; StgCase — case expressions that drive evaluation and pattern matching inside function definitions; and StgLet — let bindings that allocate heap objects for closures.

Let’s look at it and dissect the types top to bottom, step by step, to be able to write our compiler.

So what is an STG program?

Full type hierarchy to unpack and analyze an STG program: GenStgTopBinding → GenStgBinding describe bindings to .. → GenStgRhs closures, which in turn contain .. → GenStgExpr expressions. All of these types are currently parametrized by Var type for both bndr and occ type variables.

STG program is merely a list of top level bindings of closures to variables given by the type [GenStgTopBinding bndr occ] (here and below links to documentation are given right over the code piece). You can get STG program from the core program by calling coreToStg function; we describe how exactly to arrive there from Haskell source code in detail in the next section, for now let’s continue looking at the types. This type, as all other STG types as of the time of this writing, are parametrized by the Var type for both bndr and occ type variables, which encodes variable name and type, so de-facto the program is of type [GenStgTopBinding Var Var] .

Now, all of the types in STG have some constructors that we will ignore below because they are not essential for understanding of how to manipulate STG and even write basic compilers. They are mostly used for Cmm specific code generation optimizations and we omit them to reduce clutter.

Inside GenStgTopBinding there is StgTopLifted (GenStgBinding bndr occ) constructor, which we want to simply unpack to get to the actual type of bindings:

data GenStgBinding bndr occ

= StgNonRec bndr (GenStgRhs bndr occ)

| StgRec [(bndr, GenStgRhs bndr occ)] bndr (GenStgRhs bndr occ)[(bndr, GenStgRhs bndr occ)]

In essence StgNonRec var rhs describes binding closures rhs::GenStgRhs Var Var to var::Var variable, with recursive and non-recursive variants, pretty straightforward.

Closures

Now, closures are more interesting — as mentioned above, they are the only type of heap objects in the STG machine, and here is how they are represented:

data GenStgRhs bndr occ

= StgRhsClosure

CostCentreStack -- CCS to be attached (default is CurrentCCS)

StgBinderInfo -- Info about how this binder is used (see below)

[occ] -- non-global free vars; a list, rather than

-- a set, because order is important

!UpdateFlag -- ReEntrant | Updatable | SingleEntry

[bndr] -- arguments; if empty, then not a function;

-- as above, order is important.

(GenStgExpr bndr occ) -- body





| StgRhsCon

CostCentreStack -- CCS to be attached (default is CurrentCCS).

-- Top-level (static) ones will end up with

-- DontCareCCS, because we don't count static

-- data in heap profiles, and we don't set CCCS

-- from static closure.

DataCon -- Constructor. Never an unboxed tuple or sum, as those

-- are not allocated.

[GenStgArg occ] -- Args -- CCS to be attached (default is CurrentCCS)-- Info about how this binder is used (see below)-- non-global free vars; a list, rather than-- a set, because order is important-- ReEntrant | Updatable | SingleEntry-- arguments; if empty, then not a function;-- as above, order is important.-- body-- CCS to be attached (default is CurrentCCS).-- Top-level (static) ones will end up with-- DontCareCCS, because we don't count static-- data in heap profiles, and we don't set CCCS-- from static closure.-- Constructor. Never an unboxed tuple or sum, as those-- are not allocated.-- Args

Remember that bndr and occ are always of type Var as of now.

StgRhsCon represents a “normal” value — data constructor DataCon application to a list of GenStgArg Var — a type that describes arguments to function, operator and constructor applications — a very simple sum type between Var and a literal. Remember that constructor applications in STG are always saturated, so you don’t need to deal with partial applications — and this translates to CON heap objects at runtime. Just 4 translates to something like StgRhsCon <'Just' DataCon rep> [StgLitArg 4] (not actual code!)

StgRhsClosure ccs binfo freeVars::[Var] flag::UpdateFlag vars::[Var] expr::GenStgExpr Var Var describes everything else — so, a function or a thunk. Ignore Cost Center and binder info stuff, they are used in code gen optimizations, and then signature above encodes STG code similar to: {xf1 ... xfn} \r {x1 ... xm} -> expr , where xfi are free variables, xj are arguments and expr::GenStgExpr Var Var is actual code.

This closure representation is key to understanding how STG machine functions, so we look at it in more detail.

This closure form has several interesting properties:

Free variables freeVars of the code represented by expr are referenced right here together with the code, so at runtime there’s no need to look them up in some global table, they are passed explicitly (for details please refer to the papers mentioned above)

of the code represented by are referenced right here together with the code, so at runtime there’s no need to look them up in some global table, they are passed explicitly (for details please refer to the papers mentioned above) Update flag can be either ReEntrant — which encodes a “normal” function with lambda-bound arguments vars and which may or may not have free variables freeVars (top level functions never have free variables)

— which encodes a “normal” function with lambda-bound arguments and which may or may not have free variables (top level functions never have free variables) Updatable — which encodes a THUNK — a lazy, as of yet unevaluated suspension, that normally has free variables and a code to execute ( expr ) but does not have arguments vars (otherwise it would be a function). At runtime, THUNK in STG is represented as a self-updating structure, so when we need a value, it is entered, evaluated and the pointer is updated with the calculated value for much faster subsequent access — hence the name call-by-need that is used to describe Haskell’s evaluation strategy (as opposed to much less efficient call-by-name or strict call-by-value)

— which encodes a — a lazy, as of yet unevaluated suspension, that normally has free variables and a code to execute ( ) but does not have arguments (otherwise it would be a function). At runtime, in STG is represented as a self-updating structure, so when we need a value, it is entered, evaluated and the pointer is updated with the calculated value for much faster subsequent access — hence the name call-by-need that is used to describe Haskell’s evaluation strategy (as opposed to much less efficient call-by-name or strict call-by-value) SingleEntry — this is also a THUNK , but such that the compiler was able to prove it will be entered at most once, so it does not need to be updated. We simply evaluate it to normal form when its’ value is needed and then let garbage collector collect it — a very nice optimization increasing overall code efficiency