The Awesome Power of Theory

Explorations in the untyped lambda calculus

Ron Garret

June 2014

0. TL;DR

This is a very long, very geeky post about using an extremely simple computational formalism to do something much more complicated and useful than you would think it was capable of. Specifically, it's about using the pure untyped lambda calculus to compute the factorial of 10. This post assumes some basic knowledge of Lisp.

1. Introduction

I want to show you a little technological miracle. This, believe it or not, is the factorial function:

(λ (n f) (n (λ (c i) (i (c (λ (f x) (i f (f x))))))(λ x f) (λ x x)))

The reason you might be skeptical that this is the factorial function is that there are no mathematical operations in it. You are probably used to seeing the factorial written something more like this:

(defun fact (n) (if (<= n 1) 1 (* n (fact (1- n)))))

Or maybe this:

int fact(int n) { return (n<=1 ? 1 : fact(n-1)); }

Regardless of the syntax, you’d expect a factorial function to include some math, and there doesn’t seem to be any math there. In fact, there doesn’t seem to be much of anything there, so you would be rightly skeptical about my claim that this is in fact the factorial function. So to prove it, here it is running on a real computer, computing the factorial of 10:

? (λ ()

((λ (f s) (f (s (s (s (s (f (s (s (s (λ (f x) x)))))))))))

(λ (n f) (n (λ (c i) (i (c (λ (f x) (i f (f x))))))

(λ x f) (λ x x)))

(λ (n f x) ((n f) (f x)))

'1+ 0))

3628800

Actually, what this code does is compute the factorial of 3, add four to that, and then compute the factorial of the result. The whole thing takes less than a second to run on my MacBook Pro. If you want to try this yourself, here's the definition of λ in Common Lisp:

(defmacro λ (args body)

(if (and args (atom args)) (setf args (list args)))

(if (and (consp body) (consp (car body))) (push 'funcall body))

(if (null args)

body

`(lambda (&rest args1)

(let ((,(first args) (first args1)))

(declare (ignorable ,(first args)))

(flet ((,(first args) (&rest args2)

(apply ,(first args) args2)))

(if (rest args1)

(apply (λ ,(rest args) ,body) (rest args1))

(λ ,(rest args) ,body)))))))

This code is complete and self-contained. You can cut-and-paste this code into any Common Lisp and it should work. Note that there are no calls to any Common Lisp math functions (except 1+) and no numbers other than 0.

This is more than just a parlor trick. This version of the factorial is the last step in a long process that contains deep insights into how programming languages work under the hood. Those twelve lines of code that define the λ macro are actually a compiler for a programming language called the untyped lambda calculus. The λ macro compiles the lambda calculus into Common Lisp, and from there it gets compiled into machine code. This lambda calculus is closely related to Lisp (which is why it is so easy to compile it into Lisp), but it isn’t Lisp. For starters, it doesn’t have lists. It doesn’t have anything except the ability to create functions and call them. No lists, no math, no primitive operations of any kind. It turns out you don’t need them.

The following sections will walk you step-by-step through the process of building a factorial function out of the lambda calculus. It is the closest thing I know to literally building something from (almost) nothing.

2. A quick introduction to the lambda calculus

I'm going to use Lisp notation because we're going to actually implement the lambda calculus in Lisp. The lambda calculus consists of three kinds of expressions:

1. Identifiers

2. Function applications

3. Lambda expressions

Identifiers are simply letters or words, like ‘a’, ’x’, or ‘foo’. They have no semantic meaning, and we could just as well use pictures instead of letters and words. But using descriptive words can sometimes make it easier to understand what is going on.

Function applications consist of two expressions surrounded by a matched pair of parentheses, i.e.:

([expression1] [expression2])

Here are some examples:

(f x)

(sin a)

(cos (sqrt x))

((f g) (h (y z)))

Note that “sin”, “cos” and “sqrt” don’t have any special meaning in the lambda calculus. There are no math operations or primitives of any kind. There are no numbers, nor any other data types (except functions).

The third kind of expression, lambda expressions, look like this:

(λ [identifier] [expression])

So lambda expressions look similar to function applications, but they have three elements instead of two, and the first is always λ and the second is always a single identifier. (Later we will extend the notation to allow multiple arguments, but this will just be a notational shorthand. It won’t actually change the language.)

A lambda expression denotes a function whose argument is [identifier] and whose value is [expression]. There’s a little more to it than that, but for our purposes that’s all really you need to know. So, for example, this is the identity function:

(λ x x)

And that’s it. That’s all there is to the lambda calculus. There are no primitives, no numbers, no strings, no characters, no data types of any kind, only functions, and only functions of exactly one argument. How could you possibly ever build anything useful out of that?

The trick to wringing useful behavior out of the lambda calculus is that functions are first-class entities. That means that you can pass them as arguments and return them as values. So, for example, here is the identity function applied to itself:

((λ a a) (λ a a))

Here the identity function (λ (a) a) plays the role of both f and a in the construct (f a). The result of actually calling the identity function on itself is, of course, the identity function.

Now, if you actually try to run this code in Common Lisp using the λ macro above (and you can, and I encourage you to do so) you will probably (depending on which Common Lisp you are using) see something like this:

? ((λ a a) (λ b b))

#<Anonymous Function #x302000F5726F>

This is because the functions we build using λ are actually converted into Common Lisp functions by the λ macro, and thence compiled by the Common Lisp compiler into executable machine code. For example, here's the result of compiling the identity function:

? (disassemble (λ a a))

;;; (λ (a) a

L0

(leaq (@ (:^ L0) (% rip)) (% fn)) ; [0]

(movl (% nargs) (% imm0.l)) ; [7]

(subq ($ 24) (% imm0)) ; [9]

(jle L30) ; [13]

(movq (% rbp) (@ 8 (% rsp) (% imm0))) ; [15]

(leaq (@ 8 (% rsp) (% imm0)) (% rbp)) ; [20]

(popq (@ 8 (% rbp))) ; [25]

(jmp L34) ; [28]

L30

(pushq (% rbp)) ; [30]

(movq (% rsp) (% rbp)) ; [31]

L34

(leaq (@ (:^ L53) (% fn)) (% temp2)) ; [34]

(nop) ; [41]

(nop) ; [44]

(jmpq (@ .SPHEAP-REST-ARG)) ; [46]

L53

(leaq (@ (:^ L0) (% rip)) (% fn)) ; [53]

(pushq (% save0)) ; [60]

(movq (@ -8 (% rbp)) (% save0)) ; [62]

(movl (% save0.l) (% imm0.l)) ; [66]

(andl ($ 7) (% imm0.l)) ; [69]

(cmpl ($ 3) (% imm0.l)) ; [72]

(jne L177) ; [75]

(pushq (@ 5 (% save0))) ; [77]

(movl (% save0.l) (% imm0.l)) ; [81]

(andl ($ 7) (% imm0.l)) ; [84]

(cmpl ($ 3) (% imm0.l)) ; [87]

(jne L185) ; [90]

(movq (@ -3 (% save0)) (% temp0)) ; [92]

(cmpb ($ 11) (% temp0.b)) ; [96]

(je L159) ; [99]

(movq (@ -24 (% rbp)) (% arg_z)) ; [101]

(pushq (% arg_z)) ; [105]

(movl (% save0.l) (% imm0.l)) ; [106]

(andl ($ 7) (% imm0.l)) ; [109]

(cmpl ($ 3) (% imm0.l)) ; [112]

(jne L193) ; [115]

(movq (@ -3 (% save0)) (% arg_z)) ; [117]

(movq (@ -32 (% rbp)) (% temp0)) ; [121]

(xorl (% nargs) (% nargs)) ; [125]

(leaq (@ (:^ L141) (% fn)) (% temp2)) ; [127]

(jmpq (@ .SPSPREADARGZ)) ; [134]

L141

(leaq (@ (:^ L0) (% rip)) (% fn)) ; [141]

(movq (@ -16 (% rbp)) (% save0)) ; [148]

(jmpq (@ .SPTFUNCALLGEN)) ; [152]

L159

(movq (@ -24 (% rbp)) (% arg_z)) ; [159]

(addq ($ 8) (% rsp)) ; [163]

(popq (% save0)) ; [167]

(leaveq) ; [169]

(retq) ; [170]

L177

(uuo-error-reg-not-list (% save0)) ; [177]

L185

(uuo-error-reg-not-list (% save0)) ; [185]

L193

(uuo-error-reg-not-list (% save0)) ; [193]

That seems like an awful lot of work to do exactly nothing. The reason the identity function turns into so much code will become clear later. For now, we just need a better way to keep track of what is going on. To do that, we're going to give ourselves the ability to assign names to the functions we build, and also to keep track of some extra bookkeeping information:

(defmacro name (thing &optional name)

(let* ((val (eval thing))

(class (class-name (class-of val))))

`(progn

(defmethod print-object ((x (eql ',val)) stream)

(format stream "#<~A ~A>" ',class ',(or name thing)))

,thing)))

(defmacro define (name value)

`(progn

(setf (get ',name 'λ) ',value)

(defun ,name (&rest args) args) ; Suppress warnings

(setf (symbol-value ',name) ,value)

(define-symbol-macro ,name (symbol-value ',name))

(defun ,name (&rest args) (apply ,name args))

(name ,name)))

DEFINE is not part of the lambda calculus, it's just a convenient debugging tool to let us peer under the hood and see what is going on. So, for example:

? (define _id (λ a a))

#<FUNCTION _ID>

? (_id _id)

#<FUNCTION _ID>

I'm (mostly) going to use the convention of starting the names of lambda calculus functions with an underscore to distinguish them from Common Lisp functions. Remember, we are "compiling" the lambda calculus to Common Lisp with the λ macro, so we can freely mix lambda calculus and Common Lisp functions and data types if we choose to:

? (_id 123)

123

But this is cheating because 123 (and all numbers) are part of Common Lisp, not the lambda calculus.

(NOTE: At this point you might object that we are "cheating" in the factorial example because we're using the number 0 and the Common Lisp function 1+. But, as we shall see, those aren't actually used in the calculation itself. They are just used at the end to convert the result of the calculation into a format we can read.)

3. Closures

So how are we going to do anything useful with no primitives? We can build an identity function, but it's hard to see how we can do anything much more than that without "cheating" and using some Common Lisp primitives.

The key idea, the thing that gives the lambda calculus all of its computational power, is that of a closure. Because functions are first-class entities, we can write a function that returns a newly constructed function as its value. Here's the simplest possible example:

(λ a (λ b a))

This looks a lot like a function that returns the identity function, but notice that the function being returned is not the identity function. The value returned by the inner function (λ (b) a) is not its own argument b, it is the argument of the outer function, a. The upshot of this is that when you call this outer function, what you get back is a new function that has captured the value that you passed in, to the outer function. When you call this new function, you get back the captured value. In effect, we have built a memory location.

Here it is in action. It will be easier to understand if we use DEFINE to give these functions names:

? (define f1 (λ a (λ b a)))

#<FUNCTION F1>

? (define f2 (f1 123))

#<FUNCTION F2>

? (f2 456)

123

The value we pass 123 in to f1 it gets stored in the returned function f2. When we call f2 we get the stored value back. (Exercise: what happens to the value we pass in to f2 (i.e. the 456)?)

4. Cons cells, multi-argument functions, and Currying

Now that we have built a storage location using a closure we are tantalizingly close to being able to build one of the key primitives that will let us build a fully-fledged Lisp: the cons cell. A cons cell is an aggregate of two storage locations, which historically were called the CAR and the CDR, but which we will call LEFT and RIGHT (or LHS and RHS for left-hand-side and right-hand-side). Besides being easier to read, if we try to name them CAR and CDR then we will have naming conflicts with our underlying Common Lisp and the compiler will complain. For the same reason I'm going to use the term "pair" instead of "cons cell".

So what we want to do is to build a function that takes two values and stores them both in the same way that we stored a single value in the previous section. So we want to do something like:

(define _pair (λ (left right) ...? ))

But now we have a problem: lambda calculus functions only take one argument. So how can we possibly write _pair? By definition, it has to take two arguments. It would seem that we're stuck. (It's worthwhile stopping at this point to see if you can figure out how to solve this problem before continuing.)

One possibility is to build up our pairs in stages. Recall our single storage cell that we built using a closure:

(λ a (λ b a))

Notice that the inner function takes an argument that gets discarded, never to be seen again. What if instead of discarding this value, and instead of just returning a, we instead returned another closure that has access to both a and b? Something like:

(define _pair (λ a (λ b (λ c ...? ))))

That seems promising, but this lack of primitives is vexing. How could we possibly extract a and b from the inner closure? What we would like to do is to have c play the role of a selector to choose which of the two captured values, a or b, gets returned, but how do we do that? We don't have an IF statement. We don't even have any actual data types that we can use to represent which value we want. All we have is functions (and only functions of one argument). So how do we build a selector?

We're going to do this in two stages. First, we are going to build multi-argument functions, because if we don't we are going to drown in a sea of nested parentheses. And then we will build our selector, and from that we will build a pair.

It turns out we don’t actually need "real" multi-argument functions. Instead we can simply introduce the following notational convention (where ::= means "is defined as"):

(λ (a b) ...) ::= (λ a (λ b ...))

(f a b) ::= ((f a) b)

So if we were to re-do our original closure example, in our original "pure" notation it would look like this:

? (((λ a (λ b a)) 123) 456)

123

And in our new "compactified" notation it looks like this:

? ((λ (a b) a) 123 456)

123

Because multi-argument functions are "really" nested single-argument functions, we don't have to pass all the arguments in at once. We can do "partial evaluation":

? (define f ((λ (a b) a) 123))

#<COMPILED-LEXICAL-CLOSURE F>

? (f 456)

123

This "partial application" of a function is also called "currying", after Haskell Curry, who invented the idea. Notice that the natural consequence of this definition is that a “function of zero arguments” doesn't have to be "called". Such a function is equivalent to its value, i.e. (λ () x) ::= x.

Before we go on to build pairs, let's see how we can use currying to build a conditional construct, an "if" function. We want to define the following functions:

(define _if (λ (condition then else) ...))

(define _true (λ (...) ...))

(define _false (λ (...) ...))

in such a way that the following holds:

(_if _true then else) ➔ then

(_if _false then else) ➔ else

We accomplish this by making _true and _false functions that take then and else as arguments and return the appropriate one (i.e. _true and _false are selectors):

(define _true (λ (then else) then))

(define _false (λ (then else) else))

Our condition argument is now a selector, so all _if has to do is call the condition on the then and else arguments:

(define _if (λ (condition then else) (condition then else)))

Let's see if it works:

? (_if _true 123 456)

123

? (_if _false 123 456)

456

Looks good. Now we can use the same basic technique to build a pair:

(define _pair (λ (l r) (λ (selector) (selector l r))))

(define _lhs (λ (pair) (pair (λ (l r) l))))

(define _rhs (λ (pair) (pair (λ (l r) r))))

The selectors themselves are embedded inside the _lhs and _rhs functions, and notice that they are in fact exactly the same as _true and _false. In fact, we could have written:

(define _pair (λ (l r) (λ (selector) (selector l r))))

(define _lhs (λ (pair) (pair _true)))

(define _rhs (λ (pair) (pair _false)))

Let's see if it works:

? (_lhs (_pair 123 456))

123

? (_rhs (_pair 123 456))

456

Cool! Now that we have pairs (a.k.a. cons cells) we are one step away from being able to build a linked list, and once we can do that we can build Lisp. The only thing missing is the empty list, or NIL, which we will need to mark the end of a list. So we need two more functions:

(define _nil (λ (a) ...? ))

(define _null? (λ (thing) ...? ))

With the property that:

(_null? _nil) ➔ _true

(_null? anything_else) ➔ _false

See if you can figure out how to make this work before reading on. It's quite challenging, but you have all the tools you need to figure it out at this point.

5. Lazy vs eager evaluation

We are starting to converge on all the things we need to build a factorial function. The basic strategy is that we are going to use linked lists to represent numbers. The integer N will be represented as a linked list of length N. So adding two numbers will boil down to appending two lists, which we already know how to do (assuming we can write _nil and _null). We just take the traditional Lisp definition of APPEND:

(defun append (l1 l2)

(if (null l1) l2 (cons (car l1) (append (cdr l1) l2))))

and translate it into the lambda calculus:

(define _append

(λ (l1 l2)

(_if (_null? l1)

l2

(_pair (_lhs l1) (_append (_rhs l1) l2)))))

But if we try to run this, we will find we have two problems. First, we haven't defined _nil and _null? so here they are:

(define _nil (λ (selector) _true))

(define _null? (λ (pair) (pair (λ (l r) _false))))

Let's verify that this works:

? (_null? _nil)

#<FUNCTION _TRUE>

? (_null? (_pair _nil _nil))

#<FUNCTION _FALSE>

The second problem is more serious:

? (_append (_pair 1 _nil) (_pair 2 _nil))

> Error: Stack overflow on temp stack.

Why did this happen? It's because _if is a function, but in order to actually run this code it has to be a control construct. Here is an illustration of the problem:

? (_if _true (print 123) (print 456))

123

456

123

Because we are compiling this code into Common Lisp, and _if is a function, both branches get evaluated before the choice is made of which value to return. As a result of this, our recursive definition of _append never "bottoms out". It's an infinite loop, because the "else" clause of the _if always gets evaluated, so it always calls _append again.

There are two ways to solve this problem. The first is to make λ lazy, that is, to only evaluate arguments that are actually returned as values. This is what languages like Haskell do. And we could write a lazy λ in Common Lisp. But it turns out we won't actually need it in the end. So instead, for now, we're simply going to cheat, and define a __if macro (with two underscores) that turns our _if function into a control construct:

(defmacro __if (condition then else)

`(if (_if ,condition t nil) ,then ,else))

This uses the _if function to translate a lambda-calculus condition into the Common Lisp values T or NIL, and then uses Common Lisp's IF control construct to choose whether to evaluate the THEN or the ELSE clause. When we rewrite _append using __if instead of _if it works:

? (define _append

(λ (l1 l2)

(__if (_null? l1)

l2

(_pair (_lhs l1) (_append (_rhs l1) l2)))))

#<FUNCTION _APPEND>

? (define my-list (_append (_pair 1 _nil) (_pair 2 _nil)))

#<COMPILED-LEXICAL-CLOSURE MY-LIST>

? (_lhs my-list)

1

? (_lhs (_rhs my-list))

2

? (_null? (_rhs (_rhs my-list)))

#<FUNCTION _TRUE>

6. The Y combinator

We still have one serious "cheat" left in our code: our _append function recursively calls itself by name. This is only possible because we've given it a name using our DEFINE macro, but DEFINE is not part of the lambda calculus. Can we make a recursive function without DEFINE? The answer is yes. The way it's done is to use something called the Y combinator. This is a topic unto itself, and again, it will turn out we don't actually need it in the end, but if you want to take a deep dive into it at this point (and it's not at all a bad idea) I recommend this tutorial.

For our purposes, I'm simply going to tell you that there's this mysterious thing called the Y combinator, and it's defined like this:

(define y (λ (f) ((λ (g) (g g)) (λ (h x) ((f (h h)) x)))))

What Y does is take a function that is "almost recursive but not quite" (because it has no way to refer to itself) and "feed it back into itself" so that it can call itself recursively. Here's an illustration of how the y combinator is used:

? (define almost-factorial

(λ (f) (λ (n) (if (zerop n) 1 (* n (f (1- n)))))))

#<FUNCTION ALMOST-FACTORIAL>

? (funcall (y almost-factorial) 10)

3628800

ALMOST-FACTORIAL takes a function F and assumes that F is the factorial function. It then uses F to compute the factorial. (Note that we're cheating by using Common Lisp to do our math for us.) What Y does is essentially solve the following mathematical equation:

FACTORIAL = (ALMOST-FACTORIAL FACTORIAL)

If you want to know the details read the tutorial linked to above. Like I said, we will ultimately not use the Y combinator at all, and we have many other fish to fry.

7. Pairnums

We are finally at the point where we can take a first cut at writing a factorial function in pure lambda calculus. Our strategy is to represent numbers as linked lists, where an integer N is represented as a list of length N. We can then build up arithmetic like so:

(define pn_0 _nil)

(define pn_zero? _null?)

(define pn_add1 (λ (n) (_pair _nil n)))

(define pn_sub1 (λ (n) (_rhs n)))

PN stands for Pair-Number (pairnum for short), i.e. a number represented as lambda-calculus pairs. Zero (i.e. pn_0) in this representation is _nil, the empty list. We add one by creating a new pair that contains the pairnum that we added one to. To subtract one we simply reverse this process.

It is now straightforward to build addition, multiplication, and ultimately, the factorial function. We start by defining addition, first recursively:

(define pn_+ (λ (n1 n2)

(__if (pn_zero? n2)

n1

(pn_+ (pn_add1 n1) (pn_sub1 n2)))))

Then we convert this to a "non-cheating" version using the Y combinator:

(define pn_+ (y (λ (f)

(λ (n1 n2)

(__if (pn_zero? n2)

n1

(f (pn_add1 n1) (pn_sub1 n2)))))))

And likewise for multiplication, which is recursive addition:

(define pn_* (λ (n1 n2)

((y (λ (f)

(λ (n1 n2 product)

(__if (pn_zero? n2)

product

(f n1 (pn_sub1 n2)

(pn_+ n1 product))))))

n1 n2 pn_0)))

Note that multiplication is not exactly the same as addition. In the case of addition we could just add one to one number and subtract one from the other until we got to zero. But in the case of addition we need a third variable to accumulate the result.

Finally, the factorial:

(define pn_fact (y (λ (f)

(λ (n)

(__if (pn_zero? n)

(pn_add1 pn_0)

(pn_* n (f (pn_sub1 n))))))))

To make it easier to see what's going on at this point, let us define two utility functions to convert back and forth between pairnums and ordinary Lisp integers. Because these are just for debugging, we’ll simply define them in Lisp:

(defun pn (n &optional (pn pn_0))

(if (zerop n) pn (pn (1- n) (pn_add1 pn))))

(defun ppn (n &optional (cnt 0))

(__if (pn_zero? n) cnt (ppn (pn_sub1 n) (1+ cnt))))

PN converts a Lisp integer into a pairnum, and PPN (Print Pair Num) does the opposite:

? (ppn (pn 10))

10

We can use these to test our factorial:

? (ppn (pn_fact (pn 6)))

720

It works! But why did we only do the factorial of 6? Weren't we going to compute the factorial of 10? Here’s the reason:

? (time (ppn (pn_fact (pn 6))))

(PPN (PN_FACT (PN 6)))

took 135,980 microseconds (0.135980 seconds) to run.

77,331 microseconds (0.077331 seconds, 56.87%) of which was spent in GC.

During that period, and with 4 available CPU cores,

146,451 microseconds (0.146451 seconds) were spent in user mode

8,753 microseconds (0.008753 seconds) were spent in system mode

39,398,912 bytes of memory allocated.

801 minor page faults, 0 major page faults, 0 swaps.

720

? (time (ppn (pn_fact (pn 7))))

(PPN (PN_FACT (PN 7)))

took 4,744,137 microseconds (4.744137 seconds) to run.

2,522,207 microseconds (2.522207 seconds, 53.16%) of which was spent in GC.

During that period, and with 4 available CPU cores,

4,363,321 microseconds (4.363321 seconds) were spent in user mode

431,969 microseconds (0.431969 seconds) were spent in system mode

1,636,124,272 bytes of memory allocated.

26,995 minor page faults, 0 major page faults, 0 swaps.

5040

Notice that it took about 35 times longer to compute the factorial of 7 than the factorial of 6. This isn’t too surprising. We’ve gone out of our way to hamstring ourselves and make our code about as inefficient as it could possibly be. At best, computing the factorial of N this way would require N! steps (since that is how long it would take to construct the answer directly).

At this point you might reasonably conclude that this theory is little more than an interesting academic curiosity. I mean, come on, if you can’t even compute the factorial of 8 in non-geological time?! But bear with me, I promise you it gets better.

Before we proceed, though, let us take a moment to reflect on what we have actually done here. The code we've written looks like ordinary Lisp code, but in fact it is all built exclusively out of a single primitive: λ. In fact, we can expand our definition of pn_fact to see what it would look like if we actually wrote it all out in terms of λ. To do that manually would be very tedious, so we'll write a little helper function (in Lisp):

(defun lc-expand (expr &optional expand-lambdas)

(if (atom expr)

(if (get expr 'λ)

(lc-expand (get expr 'λ))

expr)

(if (and expand-lambdas (eq (car expr) 'λ))

(lc-expand (macroexpand-1 expr))

(mapcar 'lc-expand expr))))

This uses the raw lambda-calculus expressions that the DEFINE macro saves for us to expand a lambda-calculus function into raw lambdas. This is what our factorial function ends up looking like when expanded:

((λ (f) ((λ (g) (g g)) (λ (h x) ((f (h h)) x))))

(λ (f)

(λ (n)

(__if ((λ (pair) (pair (λ (l r) (λ (then else) else)))) n)

((λ (n)

((λ (l r) (λ (selector) (selector l r)))

(λ (selector) (λ (then else) then)) n))

(λ (selector) (λ (then else) then)))

((λ (n1 n2)

(((λ (f) ((λ (g) (g g)) (λ (h x) ((f (h h)) x))))

(λ (r)

(λ (f i l)

(__if ((λ (pair) (pair (λ (l r) (λ (then else) else)))) l)

i

(r f (f i ((λ (pair) (pair (λ (l r) l))) l))

((λ (pair) (pair (λ (l r) r))) l))))))

(λ (product n2)

((λ (n1 n2)

(((λ (f) ((λ (g) (g g)) (λ (h x) ((f (h h)) x))))

(λ (r)

(λ (f i l)

(__if ((λ (pair)

(pair (λ (l r) (λ (then else) else)))) l)

i

(r f (f i ((λ (pair) (pair (λ (l r) l))) l))

((λ (pair) (pair (λ (l r) r))) l))))))

(λ (total n)

((λ (n)

((λ (l r) (λ (selector) (selector l r)))

(λ (selector) (λ (then else) then)) n))

total))

n1 n2))

product n1))

(λ (selector) (λ (then else) then)) n2))

n (f ((λ (n) ((λ (pair) (pair (λ (l r) r))) n)) n)))))))

Wow, what a mess! No wonder it’s the very definition of horrible inefficiency. But notice that it's nothing but lambdas (and __if's)! And it actually works (for some value of “works”). But how do we get from this awful mess to the tiny compact version at the top of this post?

NOTE: If you actually try running lc-expand yourself you will not see the nicely formatted code that appears above. Instead you will see this:

((Λ F ((Λ G (G G)) (Λ (H X) ((F (H H)) X)))) (Λ F (Λ N (__IF ((Λ PAIR (PAIR (Λ (L R) (Λ (THEN ELSE) ELSE)))) N) ((Λ N ((Λ (L R) (Λ (SELECTOR) (SELECTOR L R))) (Λ SELECTOR (Λ (THEN ELSE) THEN)) N)) (Λ SELECTOR (Λ (THEN ELSE) THEN))) ((Λ (N1 N2) (((Λ F ((Λ G (G G)) (Λ (H X) ((F (H H)) X)))) (Λ R (Λ (F I L) (__IF ((Λ PAIR (PAIR (Λ (L R) (Λ (THEN ELSE) ELSE)))) L) I (R F (F I ((Λ PAIR (PAIR (Λ (L R) L))) L)) ((Λ PAIR (PAIR (Λ (L R) R))) L)))))) (Λ (PRODUCT N2) ((Λ (N1 N2) (((Λ F ((Λ G (G G)) (Λ (H X) ((F (H H)) X)))) (Λ R (Λ (F I L) (__IF ((Λ PAIR (PAIR (Λ (L R) (Λ (THEN ELSE) ELSE)))) L) I (R F (F I ((Λ PAIR (PAIR (Λ (L R) L))) L)) ((Λ PAIR (PAIR (Λ (L R) R))) L)))))) (Λ (TOTAL N) ((Λ N ((Λ (L R) (Λ (SELECTOR) (SELECTOR L R))) (Λ SELECTOR (Λ (THEN ELSE) THEN)) N)) TOTAL)) N1 N2)) PRODUCT N1)) (Λ SELECTOR (Λ (THEN ELSE) THEN)) N2)) N (F ((Λ N ((Λ PAIR (PAIR (Λ (L R) R))) N)) N)))))))

The Λ symbol is an upper-case lambda, because Common Lisp by default translates all symbols into upper-case. The nicely formatted version was produced by this code:

(defun lc-expand-pretty (expr)

(string-downcase

(let ((*print-pretty* t)

(*print-right-margin* 100))

(princ-to-string (lc-expand expr)))))

8. Reduce

If you look at the expanded pairnum factorial above you will see a lot of repetitions of the Y combinator. This is because this definition contains embedded versions of the addition and multiplication functions, each of which uses a separate Y combinator to produce recursion.

Note, however, that all of our recursions have a similar pattern: they all traverse a linked list (because that’s how we’re representing numbers) while performing some operation and accumulating a result. For addition, that operation is adding 1. For multiplication, that operation is adding one of the multiplicands. We can “abstract away” the traversal of the list so that we don’t have to do use so many Y combinators. The way this is done is to define a function called REDUCE. (And yes, this is the same “reduce” referred to in the “map-reduce” idiom made famous by Google. You will shortly see why “reduce” is a big deal.)

This is what the REDUCE function is going to look like when we’re done with it:

(define _reduce (λ (f i l)

(__if (_null? l)

i

(_reduce f (f i (_lhs l)) (_rhs l)))))))

_reduce takes a function f, an initial-value i, and a linked list l, and repeatedly calls f on i and the left-hand-side (i.e. the first element) of l. By passing pn_add1 as f we can build an addition function, and by passing that addition function as f we can build multiplication. But before we can do that we have to define _reduce properly. At the moment we have cheated by having _reduce call itself recursively, and we’re not allowed to do that. So we have to fix that first by defining an almost-finished version of _reduce and then transforming that into a recursive version using a Y combinator:

(define almost_reduce

(λ reduce (λ (f i l)

(__if (_null? l)

i

(reduce f (f i (_lhs l)) (_rhs l)))))))

(define _reduce (Y almost-reduce))

or, if we expand it out:

(define _reduce

(y (λ r (λ (f i l)

(__if (_null? l) i (r f (f i (_lhs l)) (_rhs l)))))))

Now we can redefine addition and multiplication in terms of _reduce:

(define pn_+ (λ (n1 n2)

(_reduce (λ (total n) (pn_add1 total)) n1 n2)))

(define pn_* (λ (n1 n2)

(_reduce (λ (product n2) (pn_+ product n1)) pn_0 n2)))

Let’s see where this has brought us:

? (time (ppn (pn_fact (pn 6))))

(PPN (PN_FACT (PN 6)))

took 14,669 microseconds (0.014669 seconds) to run.

12,796 microseconds (0.012796 seconds, 87.23%) of which was spent in GC.

During that period, and with 4 available CPU cores,

15,942 microseconds (0.015942 seconds) were spent in user mode

325 microseconds (0.000325 seconds) were spent in system mode

1,420,912 bytes of memory allocated.

2 minor page faults, 0 major page faults, 0 swaps.

720

My, that looks promising. A factor of 10 performance improvement. Let’s see how far we can go now:

? (time (ppn (pn_fact (pn 7))))

(PPN (PN_FACT (PN 7)))

took 75,560 microseconds (0.075560 seconds) to run.

61,365 microseconds (0.061365 seconds, 81.21%) of which was spent in GC.

During that period, and with 4 available CPU cores,

84,667 microseconds (0.084667 seconds) were spent in user mode

2,380 microseconds (0.002380 seconds) were spent in system mode

9,371,120 bytes of memory allocated.

129 minor page faults, 0 major page faults, 0 swaps.

5040

Looking good! From 4.7 seconds down to 0.075, a 60x speedup. Apparently our speedup is not just a constant factor, we have achieved a reduction in our big-O complexity. Let’s keep going…

? (time (ppn (pn_fact (pn 8))))

(PPN (PN_FACT (PN 8)))

took 508,120 microseconds (0.508120 seconds) to run.

408,459 microseconds (0.408459 seconds, 80.39%) of which was spent in GC.

During that period, and with 4 available CPU cores,

507,662 microseconds (0.507662 seconds) were spent in user mode

14,311 microseconds (0.014311 seconds) were spent in system mode

72,110,448 bytes of memory allocated.

677 minor page faults, 0 major page faults, 0 swaps.

40320

From not being able to compute the factorial of 8 in any reasonable amount of time down to being able to do so in half a second. Not bad!

? (time (ppn (pn_fact (pn 9))))

> Error: Stack overflow on temp stack.

Ooh! So close! But again, this is not too surprising. We’re building up numbers as linked lists (represented as closures!) and so unless everything is properly tail recursive we’re going to grow the stack at least as large as our result. The factorial of 9 is 362880 so it’s not surprising that we blew our stack.

Still, we’ve come a long way with a single optimization. Maybe there’s another big win out there?

9. Church numerals

The next optimization we’re going to make is to change the representation of numbers. Instead of representing the number N as a linked list of length N (which in turn is represented as a series of closures) we’re going to use a representation which is more “native” to the lambda calculus. The Lambda calculus looks a lot like Lisp, and the design of Lisp was heavily influenced by it of course, but Lisp is fundamentally based on linked lists and S-expressions while the lambda calculus is fundamentally based on functions. We can build linked lists out of functions as we have seen, but it’s awkward and inefficient. The real brilliance of Lisp is that it gives you CONS, CAR and CDR as primitives. We can build those out of functions as we have seen, but just because we can doesn’t mean we should.

Instead, we are going to “go native” and represent a number as a repeated application of a function. In other words, the number N will be represented as a function that looks sort of like this:

(λ x (f (f (f … [N times] … (f x) …)

What will the function f be? Anything we want! We can just make it a parameter:

(λ (f x) (f (f (f … [N times] … (f x) …)

So the number 0 is:

(λ (f x) x)

i.e. a function that takes an argument x and a function f and applies f to x zero times. The number 1 is:

(λ (f x) (f x))

2 is:

(λ (f x) (f (f x)))

and so on. In general, a number n will be a function defined by the following equation:

(n f x) ::= (f (f (f … [ n times ] … x)))

Numbers represented this way are called Church numerals, after Alonzo Church who invented them along with the lambda calculus. It’s not immediately apparent how this change in representation is going to help, but trust me, it will.

At this point we’re just building these things manually. If we want to compute we need to build arithmetic. As we saw in the case of pairnums, it suffices to write functions that can add and subtract 1. Everything else can be built from that. So to add 1 we need to do this:

(define cn_add1 (λ n (λ (f x) (f (n f x)))))

The CN in CN_ADD1 stands for Church Numeral of course. It is easy to see that this function does the Right Thing if you remember the defining equation above. If (n f x) is f applied n times, then (f (n f x)) must be f applied n+1 times, i.e. the Church numeral representation of n+1.

Notice that Church numerals are, essentially, little loops that iterate N times. We can use this fact to do all kinds of spiffy things. For example, we can call a Church numeral and pass in the Lisp function 1+ and an initial value of 0 to convert a Church numeral into a Lisp integer, e.g.:

? (define cn_three (cn_add1 (cn_add1 (cn_add1 cn_0))))

#<COMPILED-LEXICAL-CLOSURE THREE>

? (cn_three '1+ 0)

3

We can use this property of Church numerals to define arithmetic on them without using a Y combinator:

(define cn_+ (λ (m n) (m cn_add1 n)))

(define cn_* (λ (m n) (m (cn_+ n) cn_0)))

Notice how “natural” these definitions feel: adding M and N is just M iterations of CN_ADD1 applied to N. Multiplying M and N is just M iterations of adding N applied to zero.

Let’s also define two little utilities to convert back and forth between Churchnums and Lisp integers:

(defun cn (n) (if (zerop n) cn_0 (cn_add1 (cn (1- n)))))

(defun pcn (cn) (funcall cn '1+ 0))

Now let’s test our Churchnum math:

? (pcn (cn_+ (cn 7) (cn 9)))

16

? (pcn (cn_* (cn 7) (cn 9)))

63

We could at this point go on to define a recursive factorial using Church numerals in the same way that we did using pairnums, but we still have two missing pieces: we need to be able to subtract 1, and we need to be able to test for zero. Figuring out how to do these things is not trivial, and don’t really need them anyway because we can compute a factorial by counting up to N instead of down from N, and that is the natural way to do things using Church numerals anyway. But you might want to see if you can figure out how to do it yourself. Your lambda-fu will be stronger if you go through this exercise. But this tutorial is already getting pretty long, so I’m going to skip that part.

Instead, let’s try to construct a function F such that (N F 1) is the factorial of N. What makes this tricky is that we have to keep track of two intermediate values, not just one: the partial factorial that we have computed so far, and the value of N that we need to multiply in at the current step. But lambda-calculus functions can only return a single value. Fortunately, we have already built pairs back in section 4. We can use pairs to package two values as one. So we will put the value of N in the LHS of a pair, and N! in the RHS:

(define cn_factorial_step

(λ (pair)

(_pair (cn_add1 (_lhs pair))

(cn_* (_lhs pair) (_rhs pair)))))

(define cn_0 (λ (x f) x))

(define cn_1 (cn_add1 cn_0))

(define cn_factorial

(λ (n) (_rhs (n cn_factorial_step (_pair cn_1 cn_1)))))

Again, notice how “natural” this all feels. CN_FACTORIAL is (the right hand side of) N repetition of CN_FACTORIAL_STEP, which takes a pair (N, N!) and returns the pair (N+1, N*N!).

Let’s expand that out to see what we have wrought:

(λ (n)

((λ pair (pair (λ (l r) r)))

(n

(λ (pair)

((λ (l r) (λ (selector) (selector l r)))

((λ n (λ (f x) (f (n f x))))

((λ pair (pair (λ (l r) l))) pair))

((λ (m n)

(m ((λ (m n) (m (λ n (λ (f x) (f (n f x)))) n)) n)

(λ (f x) x)))

((λ pair (pair (λ (l r) l))) pair)

((λ pair (pair (λ (l r) r))) pair))))

((λ (l r) (λ (selector) (selector l r)))

((λ n (λ (f x) (f (n f x)))) (λ (f x) x))

((λ n (λ (f x) (f (n f x)))) (λ (f x) x))))))

That’s still pretty messy, but it’s a whole lot better than before. Notice that all of our Y combinators are gone; with Church numerals we don’t need them any more. Furthermore, all of our __if cheats are gone too! This really is a pure lambda calculus factorial! But does it work? Well, let’s see:

? (time (pcn (cn_fact (cn 7))))

(PCN (CN_FACT (CN 7)))

took 4,430 microseconds (0.004430 seconds) to run.

2,702 microseconds (0.002702 seconds, 60.99%) of which was spent in GC.

During that period, and with 4 available CPU cores,

6,208 microseconds (0.006208 seconds) were spent in user mode

363 microseconds (0.000363 seconds) were spent in system mode

1,381,040 bytes of memory allocated.

33 minor page faults, 0 major page faults, 0 swaps.

5040

Looking promising: 4ms to compute the factorial of 7, compared to 24 using pairnums, so another factor of 6 performance improvement. Alas…

? (pcn (cn_fact (cn 8)))

> Error: Stack overflow on temp stack.

One step forward, one step back. And unfortunately we have also gotten to the point where the next step is not so easy to explain or understand. This particular problem took me several hours to figure out and fix. The problem turns out to be in the definition of CN_ADD1:

(define cn_add1 (λ n (λ (f x) (f (n f x)))))

Because we are applying the last f after the first n f’s, this version of cn_add1 is not tail-recursive. We can make it tail-recursive by applying the extra f first instead of last:

(define cn_add1 (λ n (λ (f x) (n f (f x)))))

If you didn’t understand that, don’t worry about it. Tail recursion is another topic that would take us fair afield, and it doesn’t really matter. Just take my word for it for now.

In any event, having made this fix, we can now for the first time compute the factorial of 10:

? (time (pcn (cn_fact (cn 10))))

(PCN (CN_FACT (CN 10)))

took 4,598,543 microseconds (4.598543 seconds) to run.

2,715,376 microseconds (2.715376 seconds, 59.05%) of which was spent in GC.

During that period, and with 4 available CPU cores,

4,324,991 microseconds (4.324991 seconds) were spent in user mode

332,731 microseconds (0.332731 seconds) were spent in system mode

1,427,101,264 bytes of memory allocated.

65,028 minor page faults, 0 major page faults, 0 swaps.

3628800

A tad on the slow side still, but personally I think the fact that it works at all is pretty amazing.

In the next section we’re going to take this to a whole ‘nuther level.

10. Extreme optimization

To give credit where credit is due, I did not figure out any of the content of this section on my own. Everything from here on out is due to Bertram Felgenhauer.

First, let us recall our current baseline, which I have rewritten here as a single block of code:

(λ n

(_rhs

(n (λ (pair)

(_pair (cn_add1 (_lhs pair))

(cn_* (_lhs pair) (_rhs pair))))

(_pair (cn_add1 cn_0) (cn_add1 cn_0)))))

Our first optimization is to use the fact that we have implemented a pair as a function that calls a selector function with the lhs and rhs as arguments. Because we want to operate on both the lhs and the rhs at the same time, we can bypass these selectors and just call the pair on a function that does the computation we want to perform:

(λ n

(_rhs

(n (λ (pair)

(pair (lambda (l r) (_pair (cn_add1 l) (cn_* l r)))))

(_pair (cn_add1 cn_0) (cn_add1 cn_0))))))

Next, we switch to continuation-passing style so that we don’t have to create pairs at all. Instead, we pass a multi-argument continuation in as an argument to the loop:

(λ n

(n (λ (c i m) (c (cn_add1 i) (cn_* i m)))

(λ (i m) m)

(cn_add1 cn_0) (cn_add1 cn_0))))

Next we factor out cn_add1:

(funcall

(λ (add1 n)

(n (λ (c i m) (c (add1 i) (cn_* i m)))

(λ (i m) m)

(add1 cn_0) (add1 cn_0)))

cn_add1))

Because we are going for extreme compactness now, I’m going to rename the variable “add1” to “s” (for “successor”):

(funcall

(λ (s n)

(n (λ (c i m) (c (s i) (cn_* i m)))

(λ (i m) m)

(s cn_0) (s cn_0)))

cn_add1))

Now we expand out the references to cn_* and cn_add1. We could just do this directly, but it’s a bit tedious because cn_* is defined in terms of cn_+ which is defined in terms of cn_add1. Instead, I’m going to use a devilishly clever way of multiplying Church numerals:

(define cn_* (λ (m n) (λ (f) (m (n f)))))

Convincing yourself that this works is left as an exercise. The result is this:

(funcall

(λ (s n)

(n (λ (c i m) (c (s i) ((λ (m n) (λ (f) (m (n f)))) i m)))

(λ (i m) m)

(s cn_0) (s cn_0)))

(λ n (λ (f x) ((n f) (f x)))))

There is one last piece of low-lying fruit here, and that is that we can expand out the call to cn_*:

(funcall

(λ (s n)

(n (λ (c i m) (c (s i) (λ (f) (i (m f)))))

(λ (i m) m)

(s cn_0) (s cn_0)))

(λ n (λ (f x) ((n f) (f x)))))

And finally, we substitute the definition of cn_0 (and factor it out while we’re at it):

((λ (zero s n)

(n (λ (c i m)(c (s i) (λ (f) (i (m f)))))

(λ (i m) m)

(s zero) (s zero)))

(λ (f x) x)

(λ n (λ (f x) ((n f) (f x)))))

This is quite a respectable showing. It’s pure lambda calculus, it’s pretty small, and it computes the factorial of 10 in about 3.5 seconds. One might think that we are approaching the limits of what can be accomplished, but no. There is one final optimization we can make that will improve our performance by a factor of 5! (N.B.: the exclamation point is there for emphasis. It a factor of 5, not 5 factorial. )

Notice that the base case for our iteration is a pair of ones. We are constructing our ones by calling the successor function on the Church numeral zero. In other words:

one ::= (cn_add1 cn_0) == ((λ n (λ (f x) (n f (f x))))) (λ (f x) x)))

But we don’t need all this generality. We can just define one directly:

(define one (λ (f x) (f x)))

That will give us a performance improvement of about 1.5. The final factor of 3 (or so) comes from recalling that:

(λ (f x) (f x)) ::= (λ f (λ x (f x)))

In other words, ONE is not really a function of two arguments, it’s a function of one argument (f) that returns a closure which applies f to a second argument (x). But a closure that applies f to its argument is the same as the function f itself, so our definition of ONE is actually equivalent to the identity function! So we can substitute the identity function for the number one and build our base case from that:

((λ (one s n)

(n (λ (c i m)(c (s i) (λ (f) (i (m f)))))

(λ (i m) m)

one one))

(λ x x)

(λ n (λ (f x) ((n f) (f x)))))

And that version computes 10! in 680ms on my Macbook Pro.

We’re still not quite done, though at this point we are getting very near the end. Felgenhauer’s final version improves on this last one by only about 20% in terms of run-time performance, and the optimizations become progressively more difficult to explain. But just for completeness, I will include the final steps here as they were explained to me by Felgenhauer:

At this point, note that inside the step function, we have access to three values: the current values of 'c' and 'i', and also the return value of 'c (succ i) (mul i m)', which equals n!. This seems wasteful, so the next step is to accumulate the product in the return value instead. This works out because multiplication is commutative and associative. The result looks like this:

(λ n (n (λ (c i) (mul i (c (succ i))))

(λ i one)

one)))

The size [of the code] is now [quite small], but the transformation was non-trivial. The final step is a peephole optimization (after inlining 'mul') that exploits the shape of Church numerals; rather than having 'c' return a Church numeral, we return a repeated iteration of a fixed 'f'. This results in the final [form of the code]:

(λ (n f) (n (λ (c i) (i (c (succ i))))

(λ (i) f)

one)))

And, of course, having come this far, we have to take it up to 11:

? (time (pcn (fac (cn 11))))

(PCN (FAC (CN 11)))

took 6,771,754 microseconds (6.771754 seconds) to run.

3,017,167 microseconds (3.017167 seconds, 44.56%) of which was spent in GC.

During that period, and with 4 available CPU cores,

6,331,324 microseconds (6.331324 seconds) were spent in user mode

516,300 microseconds (0.516300 seconds) were spent in system mode

3,257,955,856 bytes of memory allocated.

5 minor page faults, 0 major page faults, 0 swaps.

39916800

11. Conclusions

Think about what we have done here: we have taken a mathematical formalism that at first glance looks like it would not be capable of doing anything practical and used it to compute the factorial of 10 (and even 11!) on a real machine in a few seconds. Not only that, but we have done this by compiling code in this formalism into x86 machine code, and we’ve done all that in a few dozen lines of code. All this was made possible by two key insights:

1. We can make storage locations and multi-argument functions out of closures

2. We can produce extreme performance improvements by abstracting away recursion through constructs like REDUCE and Church numerals.

But the real lesson here is that functional programming is not just an academic curiosity suitable only for the ivory tower. The theory of functional programming can produce practical results on real-world problems. This is not to say that computing factorials is all that useful, but if you can do factorials in the pure lambda calculus, imagine what you can do with an industrial-strength functional language like Haskell.