



HaSKI

An FPGA-based SKI calculus evaluator written in Haskell/Cλash

HaSKI (github) is my attempt at building a reasonably simple hardware-based evaluator for the dead-simple, turing-complete SKI combinator calculus.

Background: Circuit design

Currently, the vast majority of hardware projects are written using a "Hardware Description Language", or HDL. These are text-based languages that allow designers to describe the behavior of circuits.

However, I really don't like any of the industry-standard HDLs. They were designed a long time ago, and not designed very well. Thankfully, we have some modern alternatives.

The functional programming language Haskell has very consistent, well-defined, and reasonable semantics, which means that you can do a lot of interesting things with Haskell programs that you couldn't possibly do with most other languages. In particular, Haskell is statically typed, pure and lazy. It turns out that these attributes are really great for modeling circuit behavior, and a large subset of Haskell programs can be directly compiled to circuits that do the exact same thing.

One such compiler, and the one I used, is Cλash. For more details, see the slides from my (informal) AHUG talk on Cλash. Hopefully that should explain a bit about why the traditional HDLs are no good and how we can use Haskell instead.

In my experience, writing hardware using Haskell is vastly easier than using Verilog or VHDL. In particular, it is much easier to write correct hardware. Debugging hardware is substantially more difficult and time-consuming than debugging software, so we want to avoid making mistakes if at all possible. Haskell's powerful type system (together with the extensions used by Cλash) allows for very robust static verification of hardware, which leads to fewer bugs in the first place.

If we do end up creating a bug, we can simply run our circuit as a standard Haskell program, which means we can use software debugging techniques to debug our hardware design! For example, we can have useful error messages in our Haskell circuit description. These will show up normally if we run the design as a program and there's an error. Cλash knows about error messages, and it simply removes them when we actually compile to hardware.

I should mention that Cλash is not the only project with the goal of replacing the standard HDLs. There is also Lambda-CCC, Chisel, etc. Cλash worked best for me, but it is by no means the only solution out there.

A word of warning, as well; these are all very new projects, so they are likely to have bugs.

Background: SKI combinator calculus

The SKI calculus is defined by three terms ( S , K , and I ). You can follow that link for a rigorous description. For a less rigorous description:

a,b,c,d are arbitrary SKI expressions (e.g. b could be "(SIIK)"). Program before evaluation step Program after evaluation step (Ia)b ab (Kab)c ac (Sabc)d (ac(bc))d

You just keep applying the evaluation step to run your program.

It turns out that this is all you need for a Turing-complete language with branching, recursion, etc.

These terms don't have any side effects, so it's hard to make a pure SKI program do anything interesting. Therefore, I added an L (for "Literal") term, which is just like I except that it causes the evaluator to emit a value when evaluated.

Background: Haskell

If you don't know any Haskell, you may find some of this a bit hard to follow. The most important thing is that data Foo = Bar Int Float means that I am defining a new data structure. Its type is Foo , and its only constructor is Bar . A Bar holds an Int and a Float . By analogy to C,

data Foo = Bar Int Float x = Bar 10 3.14

Is roughly like

typedef struct { int first_value; float second_value; } foo; foo x = {.first_value = 10, .second_value = 3.14};

Haskell also supports multiple constructors for a single type. For example, we can define data EitherAnIntOrAFloat = AnInt Int | AFloat Float . This just means that, at runtime, an EitherAnIntOrAFloat can have one of two possible structures. We have to inspect a given EitherAnIntOrAFloat to figure out which structure it has. By analogy to C,

data EitherAnIntOrAFloat = AnInt Int | AFloat Float y = AFloat 2.71

is roughly like

typedef enum {int_constructor, float_constructor} constructor; typedef struct { constructor which_constructor; union { int int_val; float float_val; } data; } eitherAnIntOrAFloat; eitherAnIntOrAFloat y = {.which_constructor = float_constructor, .data.float_val = 2.71};

If you'd like to know more, I wrote a reasonably concise Haskell crash course that might help out.

A simple evaluator

There are multiple ways to build an SKI evaluator. I chose to do a spine-traversal stack-machine based evaluator. I haven't managed to find much literature on evaluator techniques, but I was able to figure this one out after a friend mentioned seeing this general idea in a paper.

My SKI terms are defined as follows in Model.Model :

data SKI = S | K | I | T SKI SKI | L Char

That is to say that an SKI term is an S , a K , an I (the three standard terms), a T (two terms next to each other), or an L (an output term).

For example, the SKI expression "IKS", equivalent to "(IK)S" would be represented as T (T I K) S , while "I(KS)" would be represented as T I (T K S) .

The evaluator is defined as follows, in Model.StackMachine :

data State = State {stack :: [SKI], current :: SKI} | Terminal step :: State -> (State, Maybe Char) step (State stack (T a b)) = (State (b : stack) a, Nothing) step (State (s:stack) I ) = (State stack s, Nothing) step (State (x:y:stack) K ) = (State stack x, Nothing) step (State (x:y:z:stack) S ) = (State (z : (T y z) : stack) x, Nothing) step (State (s:stack) (L l) ) = (State stack s, Just l) step (State [] (L l) ) = (Terminal , Just l)

The step functions takes a state and generates the next state. It also returns a Char if there was any output for that step (i.e. the evaluated term was an L ).

The way this works is pretty simple. If we come across a T (two terms next to each other), we push the rightmost term onto the stack and evaluate the leftmost term. This is why I call it a "spine-traveral" evaluator. It traverses the spine (the leftmost side) of the tree representing the SKI program.

So if we start with the program "abc", equivalent to T (T a b) c , the evaluator states are as follows:

1. State [] (T (T a b) c) 2. State [c] (T a b) 3. State [b,c] a 4. ...

If the current term is an I and the stack has s (where s is S , K , I , T a b , or L c ) on the tip, that conceptually means that we just evaluated the term T I s , which is equivalent to the SKI expression "Is". As we can tell from the rules of the SKI calculus, this should be reduced to just "s". Therefore, we pull the s off the stack and start evaluating it.

K is similar. State [x,y,...] K means that the expression we're currently evaluating starts with T (T K x) y , equivalent to "Kxy". We know from the rules that we should drop y and evaluate x .

With S , we need to take the three terms following it (which will be on the stack) and replace them according to the rules. So "Sxyz" becomes "xz(yz)". This is equivalent to T (T x z) (T y z) . However, we already know what effect evaluating that will have on the stack, so we just go ahead and put the correct values on the stack directly. In particular, State [x,y,z,...] S becomes State [z,(T y z),...] x .

With L , we do the exact same thing as I , except we also return an output character.

If we come across an L and there's nothing left on the stack, it's assumed that we've reached the end of the program and we enter the Terminal state.

See the Readme for info on how to play around with the model evaluator.

"SIIa" is a simple program that will output "aa". "SII" duplicates the term that follows it, which in this case is "a".

The steps of the evaluator is as follows:

State {stack = [], current = T (T (T S I) I) (L 'a')} State {stack = [L 'a'], current = T (T S I) I} State {stack = [I,L 'a'], current = T S I} State {stack = [I,I,L 'a'], current = S} State {stack = [L 'a',T I (L 'a')], current = I} State {stack = [T I (L 'a')], current = L 'a'} -- Outputs "a" State {stack = [], current = T I (L 'a')} State {stack = [L 'a'], current = I} State {stack = [], current = L 'a'} -- Outputs "a" Terminal

Porting it to hardware

The described evaluator is very simple to implement in a high-level, garbage-collected, boxed-value language like Haskell. Unfortunately, we don't even get RAM (let alone garbage collection) in hardware.

Therefore, we have to do a lot more work to get the same result. We're starting at the transistor level.

The key differences are as follows:

We no longer have implicit built-in RAM, except for that which we can access through a memory bus. This means that the evaluator state must have a fixed size, and we have to offload any other state to RAM.

No built-in RAM means that all pointers must be explicit. Therefore, T terms can no longer "contain" values, but instead must hold pointers to values.

terms can no longer "contain" values, but instead must hold pointers to values. No built-in RAM means that we can't keep the entire stack at our disposal. We have to read in and write out the stack as it's used (although we can cache a small part of it).

We will sometimes have to dynamically allocate space for new values. If we are pushing T a b onto the stack, we must have a pointer to a and b . If a or b is a new term we have never seen before, we have to store it in memory first, so we need an explicit heap.

The Evaluator

The new SKI term definitions are in Hardware.Model :

-- 30-bit pointers to 64-bit words -- Why? We want to fit two pointers plus 3 tag bits in a word. -- This way, a whole SKI term fits in a word. newtype Ptr = Ptr (Unsigned 30) data SKI = S | K | I | T Ptr Ptr | L Output -- 32-bit output values data Output = Output (Unsigned 32)

As you can see, we now have explicit pointers in the T terms. We also replaced output Char s with 32-bit unsigned integers, because the hardware representation of Char s is not well-defined.

The new, explicit evaluator state is defined in Hardware.StackMachine :

data SKIs = NoSKIs | OneSKI SKI | TwoSKIs SKI SKI | ThreeSKIs SKI SKI SKI data Stack = Stack {cache :: SKIs, base :: Ptr, count :: Unsigned 30} data Heap = Heap {tip :: Ptr} data State = Initializing | State {stack :: Stack, heap :: Heap, current :: SKI} | Terminal

The new State has several additions.

First, the stack is more explicit. It contains a cache of as many as 3 values to allow for fast evaluation (because S uses the top 3 stack values at once). It also contains a pointer to the next stack element and a count of how many stack items are in memory.

The heap is now made explicit. For simplicity, we are using a straightforward (but wasteful) non-freeing heap. It would be an interesting project to add garbage collection or refcounting.

Same as before, we have a current term, which is the term being evaluated.

We still have the Terminal state, which means the program finished, but now we also have Initializing . We need to add Initializing because we have to load the first term from RAM before we can start evaluating it.

The logic of the stack machine is broken up into two functions. The first is

step1 :: State -> MemRequest

which takes the current state and generates the memory read/write request necessary to evaluate the current term.

The second function is

step2 :: State -> MemResponse -> State

which takes the response from the memory unit and generates the next state.

We also have outputOf :: State -> Maybe Output and terminal :: State -> Bool , which are used to determine if we should emit an output or halt the processor.

The Memory Unit

Because we now have to deal with actual RAM, we need a device for processing memory requests. This is defined in Hardware.MMU .

The relevant functions are

initiate :: MemRequest -> Pending takes a MemRequest and turns it into a series of steps that the MMU must follow (i.e. a pending request).

takes a and turns it into a series of steps that the MMU must follow (i.e. a pending request). next :: Pending -> RAMAction takes a pending request and determines the next step (i.e. should we read or write?).

takes a pending request and determines the next step (i.e. should we read or write?). service :: Pending -> RAMStatus -> Pending takes a response from the RAM module (which could be that a Write or Read completed or that nothing happened) and updates the pending request.

takes a response from the RAM module (which could be that a Write or Read completed or that nothing happened) and updates the pending request. check :: Pending -> Maybe MemResponse checks if the memory request has completely finished, and returns a MemResponse if it has.

We have to break it up into these little steps because RAM modules don't do things in a nice, timely fashion. It might take 100 cycles for a write to go through, so we must be able to deal with events occuring far apart in time.

The Processor

We have a stack machine (for evaluating SKI terms) and we have an MMU (for communicating with RAM). Now, we need to tie them together. This is done in Hardware.CPU .

The state of the entire CPU (both evaluator and MMU) is contained in a CPUState .

-- Are we waiting for a single memory action (read/write/etc.) to complete? data Waiting = No | Yes -- We need to keep track of the evaluator state as well as the MMU state. data CPUState = CPU State Pending Waiting

If you recall, State is the stack machine state and Pending is what the MMU is currently doing. The Waiting value just keeps track of if we're currently waiting for the RAM module to do something interesting (like finish a read or write).

The meat of the CPU logic is defined in

step :: CPUState -> RAMStatus -> (CPUState, RAMAction, Maybe Output)

This function takes the current CPU state and an update from the RAM module. It uses this information to generate the next CPU state (which might be the same, if nothing has happened), a RAM action (which will be nothing if there's an active RAM action that hasn't finished yet), and maybe an Output .

Everything until now has been entirely pure. That is, we've defined a bunch of state types and transition functions, but we've never detailed how this state should actually be stored in the processor. All the details of this are in

cpu :: Signal RAMStatus -> Signal (RAMAction, Maybe Output, Halt)

The Signal type indicates that this function operates on streams of values, i.e. the values that flow through the circuit as it churns away.

Every cycle (corresponding to a single value in a Signal stream), cpu does the following things. In a compiled Haskell binary, they occur in no particular order, and in hardware they all occur at the same time.

Based on the current state and incoming RAM update, Generate a new state. Emit a RAM action (read/write/do nothing). (Possibly) emit an Output . Check if we should halt and emit a DoHalt or Don'tHalt .

Update the stored state to the new state.

This completely describes the behavior of the CPU.

Integrating with FPGA hardware

We've described the behavior of the stack machine, MMU, and glue logic.

Now, we need to deal with actually integrating with the hardware on the FPGA.

This is detailed in Hardware .

First, we have four functions, ramstatus , ramaction , output , and halt , which convert between abstract Haskell values (for which the Cλash compiler is allowed to generate arbitrary hardware representations) and well-defined bit-vector representations. For example, Cλash could encode DoHalt as 0 and Don'tHalt as 1, but we want the opposite, so we have to manually define a function that does this conversion for us. Note that Cλash might happen to represent a given type in the way we want, but it's not guaranteed to do so.

Next, we have cpuHardware . This is exactly like cpu , except that its inputs and outputs have been wrapped in those four conversion functions, so that they have a well-defined bit-level representation.

Now, take a look at ramHardware . In a "real world" SKI computer (if, for some strange reason, you wanted to make one), you would probably use a standard DRAM module for holding all your data. However, FPGA DRAM integration is really messy and varies wildly across dev boards. In the interest of making this design plug-and-play, I used Verilog code design to be synthesized to block RAM, which is efficient single-cycle RAM present on most FPGAs. My cheap FPGA only has a few hundred kilobits, so the default RAM is pretty small.

If you'd like, you could implement a ramHardware using an external RAM chip for your particular FPGA board. Because we use 30-bit pointers and 64-bit words, you could use up to 8 GiB of RAM.

The RAM described by ramHardware is initialized with the default RAM contents specified in Hardware.MemoryEmulator.Default .

Finally, we have topEntity . topEntity is the function that actually gets compiled to a circuit. In this case, all it does is connect up cpuHardware and ramHardware . We get a stream of OutputBits and HaltBits .

The resulting circuit has two inputs (one for the clock signal, one for the reset signal) and three outputs (an output valid bit, output data bits, and a halt bit).

You can simulate this hardware as described in the README.