In 2001, my favorite programming language was Python. In 2008, my favorite programming language was Scheme. In 2014, my favorite programming language is x64 assembly. For some reason, that progression tends to surprise people. Come on a journey with me.

Python

In this article, we’re going to consider a very simple toy problem: recursively summing up a list of numbers .

def sum_list ( list ): if len ( list ) == 0 : return 0 else : return list [ 0 ] + sum_list ( list [ 1 :])

>>> sum_list(range(101)) 5050

Young Carl Gauss would be proud.

>>> sum_list(range(1001)) RuntimeError: maximum recursion depth exceeded Oops.

Young programmers often learn from this type of experience that recursion sucks. (Or, as a modern young programmer might say, it doesn’t scale.) If they Google around a bit, they might find the following “solution”:

>>> import sys >>> sys.setrecursionlimit(1500) >>> sum_list(range(1001)) 500500

If they have a good computer science teacher, though, they’ll learn that the real solution is to use something called tail recursion. This is a somewhat mysterious, seemingly arbitrary concept. If the result of your recursive call gets returned immediately, without any intervening expessions, then somehow it “doesn’t count” toward the equally arbitrary recursion depth limit. Our example above isn’t tail-recusrive because we add list[0] to sum_list(list[1:]) before returning the result. In order to make sum_list tail-recursive, we have to add an accumulator variable, which represents the sum of those numbers we’ve looked at already. We’ll call this version sum_sublist , and wrap it in a new sum_list function which calls sum_sublist with the initial accumulator 0 (initially, we haven’t looked at any numbers yet, so the sum of them is 0).

def sum_list ( list ): def sum_sublist ( accum , sublist ): if len ( sublist ) == 0 : return accum else : return sum_sublist ( accum + sublist [ 0 ], sublist [ 1 :]) return sum_sublist ( 0 , list )

>>> sum_list(range(101)) 5050

So far, so good.

>>> sum_list(range(1001)) RuntimeError: maximum recursion depth exceeded

Wait, what?

On Wednesday, April 22, 2009, Guido van Rossum wrote: > A side remark about not supporting tail recursion elimination (TRE) > immediately sparked several comments about what a pity it is that Python > doesn’t do this, including links to recent blog entries by others trying to > “prove” that TRE can be added to Python easily. So let me defend my position > (which is that I don’t want TRE in the language). If you want a short > answer, it’s simply unpythonic. Here’s the long answer: [snipped] Third, I don’t believe in recursion as the basis of all programming. This is a fundamental belief of certain computer scientists, especially those who love Scheme… [snipped] Still, if someone was determined to add TRE to CPython, they could modify the compiler roughly as follows…

In other words, the only reason this doesn’t work is that Guido van Rossum prefers it that way. Guido, I respect your right to your opinion, but the reader and I are switching to Scheme.

Scheme

Here’s a line-by-line translation:

( define ( sum_list list ) ( define ( sum_sublist accum sublist ) ( cond (( null? sublist ) ; tests if sublist has length 0 accum ) ; don't need return statement in Scheme ( else ( sum_sublist ( + accum ( car sublist )) ( cdr sublist )) ))) ( sum_sublist 0 list ) )

guile> (sum_list (iota 1001)) 500500

Phew! Let’s make sure that we aren’t just getting lucky with a bigger recursion limit:

guile> (sum_list (iota 10000001)) 50000005000000

Well, isn’t that neat? If we go much bigger, it’ll take a long time, but as long as the output fits into memory, we’ll get the right answer .

Named Let

In our last two versions of sum_list , we defined a helper function ( sum_sublist ), and the rest of the body of sum_list was just a single invocation of that helper function. This is an inelegant pattern , which Scheme has a construct to address.

( define ( sum_list list ) ( let sum_sublist (( accum 0 ) ( sublist list )) ; the named let! ( cond (( null? sublist ) accum ) ( else ( sum_sublist ( + accum ( car sublist )) ( cdr sublist )) ))))

Named let creates a function and invokes it (with the provided initial values) in one step. It is decidedly my favorite control structure of all time. You can have your while loops and your for loops, and your do … until loops too . I’ll take named let any day, because it provides the abstraction barrier of recursion without compromising the conciseness and efficiency of iteration. In case you’re not sufficiently impressed, I discuss the delightful properties of using recursion instead of non-recursive loops below.

Assembly

Named let style translates amazingly naturally into assembly.

bits 64 ; macros for readability %define list rdi ; by calling convention, argument shows up here %define accum rax ; accumulator (literally!) %define sublist rdx global sum_list sum_list: mov accum , 0 ; these are the let-bindings! mov sublist , list .sum_sublist: test sublist , sublist ; is it NULL? jnz .else ; if not, goto else ret ; accum (because return value is rax by calling convention) .else: add accum , [ sublist ] ; ~ accum=accum+car(sublist); mov sublist , [ sublist + 8 ] ; ~ sublist=cdr(sublist); jmp .sum_sublist ; tail-recurse

> sum_list(from(1,100)) 5050 > sum_list(from(1,10000000)) 50000005000000 (Sadly, my assembler doesn’t come with its own REPL; we’re borrowing the LuaJIT REPL instead .)

In fact, if I weren’t so comfortable with named let, I doubt I’d be an effective assembly coder, because assembly doesn’t really have any other iteration constructs . But I don’t miss them. What would they look like, anyway?

In the next installment of Python to Scheme to Assembly, we will look at call-with-current-continuation .

Addendum: C

In this addendum, we’re going to look at the assembly for iteration, non-tail recursion, and tail recursion, as emitted by gcc , and get to the bottom of what the difference is anyway.

At the top of each C file here, we have the following:

#include <stdint.h> struct number_list { uint64_t number ; struct number_list * next ; };

Iteration

If I were solving this problem in the context of a C program, this is how I would do it.

uint64_t sum_list ( struct number_list * list ) { uint64_t accum = 0 ; while ( list ) { accum += list -> number ; list = list -> next ; } return accum ; }

Here’s the generated assembly, translated to nasm syntax and commented.

global sum_list sum_list: xor eax , eax ; equivalent to "mov rax, 0" but faster ; in C it's fine to clobber rdi instead of copying it first test rdi , rdi ; <- same as ours jz done ; here the "if NULL" case is at the bottom .else: add rax , [ rdi ] ; <- same as ours mov rdi , [ rdi + 8 ] ; <- same as ours test rdi , rdi ; <- same as ours, but duplicated jnz .else ; <- same as ours .done: rep ret ; equivalent to "ret", but faster on old AMD chips for no good reason

This is almost identical to the assembly that I wrote, except that it clobbers one of its inputs (which is perfectly allowed by the C calling convention ), it uses xor instead of mov to load 0 (a solid optimization ), it uses rep ret (less compact and no benefit on Intel chips), and it shuffles the instructions around such that two test s are needed (almost certainly not helpful with modern branch prediction and loop detection). I haven’t run benchmarks on this, but my guess is that it would come out about even. (Both versions are eight instructions long.) I also think the shuffling makes this “iterative” version more opaque and difficult to reason about (not least because of the duplicated test ) than my “named let”-style code.

Non-Tail Recursion

uint64_t sum_list ( struct number_list * list ) { if ( ! list ) { return 0 ; } else { return list -> number + sum_list ( list -> next ); } }

gcc -O3 can almost completely convert this version to iteration, so let’s look at the generated assembly from gcc -O1 to get a better sense of what it might look like in a language implementation for which the necessary optimizations are too complex to be made automatically.

global sum_list sum_list: push rbx ; preserve the current value of rbx on the stack mov rbx , rdi ; replace rbx by the argument to the function, list mov eax , 0 ; set up 0 in the result register test rdi , rdi ; check if rdi is NULL jz .else ; if so go to else mov rdi , [ rdi + 8 ] ; ~ list=list->next; call sum_list ; sum_list(list) -> result register (rax) add rax , [ rbx ] ; add list->number (preserved across function call) to rax .else: pop rbx ; restore the state of rbx ret ; return rax

We can see immediately that some new instructions ( push , pop , and call ) have been introduced. These are all stack manipulation instructions . If we carefully pretend to be the CPU running this program, we can see that it pushes the address of every number in the linked list, and then dereferences and adds them up as it pops them from the stack. This is not good; if we wanted our entire data structure to be replicated on the stack, we would have passed it by value ! It’s generally the amount of memory set aside for the stack that we’ve actually run out of in the case of a recursion depth exceeded error.

Tail Recursion

What about translating the tail-recursive version into C? Like Scheme and Python, gcc supports nested function definitions (as a GNU extension to C), so this is no problem:

uint64_t sum_list ( struct number_list * list ) { uint64_t sum_sublist ( uint64_t accum , struct number_list * sublist ) { if ( ! sublist ) { return accum ; } else { return sum_sublist ( accum + sublist -> number , sublist -> next ); } } return sum_sublist ( 0 , list ); }

gcc -O1

sum_sublist.1867: ; A random constant has been added to avoid polluting the namespace. Not the best solution, but okay. sub rsp , 8 ; Decrement the stack by one 8-byte machine word. Seems unnecessary... mov rax , rdi ; Copy first argument (rdi/"accum") into result register (rax). test rsi , rsi ; Test second argument (rsi/"sublist") for nullity. jz .else ; If null, goto else. add rdi , [ rsi ] ; ~ accum = accum + sublist->number; mov rsi , [ rsi + 8 ] ; sublist = sublist->next; call sum_sublist.1867 ; recurse. result appears in rax, ready to pass along (as the return value) to the next caller in the stack. .else: add rsp , 8 ; seems unnecessary ret ; return rax sum_list: sub rsp , 8 mov rsi , rdi ; first argument (rdi/"list") of sum_list becomes 2nd argument (rsi/"sublist") of sum_sublist mov rdi , 0 ; first argument (rdi/"accum") of sum_sublist is 0 call sum_sublist.1867 ; call sum_sublist! add rsp , 8 ret

Here’s whatgives us (translated and commented as before):

In this mode, the tail call is not being eliminated – although we’re no longer push ing rbx , we’re still pushing rip to stack with every call , and eventually we’ll run out of stack that way. The only way to get around this is to replace each call with jmp : since we’re just going to take the return value of the next recursive invocation and then immediately ret back to the previous caller on the stack, there’s no point in even inserting our own address on the stack (as call does); we can just set up the next guy to pass the return value straight back to the previous guy, and quietly disappear.

gcc -O3 does this. In fact, somewhat surprisingly, it generates exactly the same assembly, line for line, for this version as for the purely iterative version above. That’s “tail call optimization” (TCO) or “tail recursion elimination” (TRE) in its most agressive form: it literally just gets rid of all calls and recursions and replaces them with an equivalent iteration (complete with duplicate test ).

The upshot of all this is that not only does Scheme’s “named let” recursion form translate neatly into assembly, it provides – penalty-free – a better abstraction than either iteration (while-loop imitation) or stack-driven recursion, the two options gcc appears to pick from when dealing with various ways to code a list traversal.

Actually, the real point I’m trying to make here is that, unlike in C, I can naturally do named let directly in assembly, and that’s one of the many reasons working in assembly makes me happy.

Appendix: What’s so great about recursion, anyway?

For me, the most important point in favor of a recursive representation of loops is that I find it easier to reason about correctness that way.

Any function we define ought to implement some ideal mathematical function that maps inputs to outputs . If our code truly does implement that ideal function, we say that the code is correct. Generally, we can break down the body of a function as a composition of smaller functions; even in imperative languages, we can think of every statement as pulling in a state of the world, making well-defined changes, and passing the new state of the world into the next statement . At each step, we ask ourselves, “are the outputs of this function going to be what I want them to be?” For loops, though, this gets tricky.

What recursion does for us as aspiring writers of correct functions is automatic translation of the loop verification problem into the much nicer problem of function verification. Intuitively, you can simply assume that all invocations of a recursive function within its own body are going to Do The Right Thing, ensure that the function as a whole Does The Right Thing under that assumption, and then conclude that the function Does The Right Thing in general. If this sounds like circular reasoning, it does ; but it turns out to be valid anyway.

There are many ways to justify this procedure formally, all of which are truly mind-bending . But once you’ve justified this procedure once, you never have to do it again (unlike ad-hoc reasoning about loops). I’ve determined that the most elegant way to explain it is by expanding our named let example into a non-recursive function, which just happens to accept as a parameter a correct version of itself.

( define ( sum_list list ) ( define ( sum_sublist_nonrec f_correct accum sublist ) ( cond (( null? sublist ) accum ) ( else ( f_correct f_correct ( + accum ( car sublist )) ( cdr sublist )) ))) ( sum_sublist_nonrec sum_sublist_nonrec 0 list ) )

Now, sum_sublist_nonrec is an honest-to-goodness non-recursive function, and we can check that it is correct. Given a correct function f_correct (which takes as inputs a correct version of itself, a number, and a list, and correctly returns the sum of all the elements in the list plus the number), a number, and a list, does sum_sublist_nonrec correctly return the sum of all elements in the list plus the number? Why yes, it does. (Constructing a formal proof tree for this claim is left as an exercise for the self-punishing reader.) Note that since f_correct is assumed to already be correct, the correct version of it is still just f_correct , so we can safely pass it to itself without violating our assumptions or introducing new ones. So, sum_sublist_nonrec is correct.

Now let’s consider the correctness of sum_list . It’s supposed to add up all the numbers in list . What it actually does is to apply the (correct) function sum_sublist_nonrec , passing in a correct version of itself (check! it’s already correct), a number to add the sum of the list to (check! adding zero to the sum of the list won’t change it), and the list (check! that’s what we’re supposed to sum up).

We’ve just proved our program correct! The magic of named let is that it generates this clumsy form with a bunch of f_correct s from a compact and elegant form. In so doing, it lets us get away with much less formal reasoning while still having the confidence that it can be converted into something like what we just slogged through. Rest assured that no matter what you do with named let, no matter how complicated the construct you create, this “assume it does the right thing” technique still applies!

With one tiny caveat. We haven’t proved that the program terminates. If this technique proved termination, then we could just write

( define ( do-the-right-thing x ) ( let does-the-right-thing (( x x )) ( does-the-right-thing x )))

and it would be totally correct, no matter what thing we want it to do.

Technically, everywhere I’ve said “correct”, what I mean is partially correct: if it terminates, then the output is correct. (Equivalently, it definitely won’t return something incorrect.) do-the-right-thing is, in fact, partially correct: it never returns at all, so it won’t give you any incorrect outputs!

Termination proofs of recursive functions can usually be handled by structural induction on possible inputs: you establish that it terminates for minimal elements (e.g. the empty list) and that termination for any non-minimal element is dependent only on termination for some set of smaller elements (e.g. the tail of the list). The structure that you need in order to think about termination this way is also much clearer with recursion than with iteration constructs.