This is re-worked from an excerpt of an answer of mine () to an SO question that made me discover some interesting things about Python : turns out it’s not easy to make recursion faster by doing tail call elimination, even by hand. ()

We mean here to look at methods of tail call elimination and their effect on performance, demonstrated within the limits of the language, and with the example of merge sort. Tail call elimination aims at turning tail recursive calls into iterative loops, providing the benefit of not having to manage a recursive call stack, and thus, hopefully, of making things a bit faster. It is also used to deal with functions who are mutually recursive, or, more generally, whose number of recursive calls make the call stack grow very fast.

Python is an interesting example to show tail call elimination within the language itself because it does not have tail call optimization built-in. Some languages (generally functional) have built-in tail call elimination, meaning every tail call you make gets compiled to an iterative loop. If you want an iterative merge sort in one of those languages, all you need is to re-write the classic version to make your recursive calls tail recursive, using a CPS translation, as we will show below.

Of course, if what you want is a fast sort in Python, the native sort , or numpy’s alternative, or even an in-place quicksort are better alternatives. Even more impressively, faster implementations of a sort which closely () mimics the recursive pattern of the merge sort have been proposed on reddit in response to this post. And finally, if you wish to break the limits of what the language intends “natively”, it should be interesting to turn to one of the famous tricks for achieving TCO in Python. Our goal here is to apply tail call elimination within language boundaries, and maybe see if we obtain a TCO'ed function faster that the function it was derived from.

The exhaustive benchmark with timeit of all implementations discussed in this answer (including bubblesort) is posted as a gist here. Just to give a starting point, I find the following results for the average duration of a single run :

Python’s native (Tim)sort : 0.0144600081444

Bubblesort : 26.9620819092

Those are seconds (but since only the relative speed matters, this isn’t very important). As can be expected, Python is much faster. And to think that Timsort is supposed to be a © merge sort !

For starters, we can write a relatively efficient, non-tail-recursive merge sort in the following way: def mergeSort(array): n = len(array) if n <= 1: return array left = array[:n/2] right = array[n/2:] return merge(mergeSort(left),mergeSort(right)) def merge(array1,array2): merged_array=[] while array1 or array2: if not array1: merged_array.append(array2.pop()) elif (not array2) or array1[-1] > array2[-1]: merged_array.append(array1.pop()) else: merged_array.append(array2.pop()) merged_array.reverse() return merged_array It runs, as expected, quite faster than the Bubblesort. Average run time : 0.126505711079 Now, our example may not be the best test of the difficulty of managing a huge call stack depth . The number of recursive calls R(n) on a list of size n is (2n-2) : for $k \geq 2$, $R(k) = 2+2* R(k/2)$, and $R(1)=0$. But in depth, the stack blowup of the recursive merge sort algorithm is in fact modest : as soon as you get out of the leftmost path in the array-slicing recursion pattern, the algorithm starts returning (& removing frames). So for 10K-sized lists, you get a function stack that is at any given time at most of size $log_2(10 000) = 14$ … pretty modest. But we hope it might still give some glimpse of a hint as to what our TCO methods entail.

The call to merge is not tail-recursive; it calls both (mergeSort left) and (mergeSort right) recursively while there is remaining work in the call (merge) . But we can make the merge tail-recursive by using CPS. Note that as-is, this would only be useful as-is if we were using a language that does tail call optimization , without it , it’s a bad idea. Indeed, this is only workable in Python if we do further work on that intermediary function, which, by itself, runs out of stack space for even modest lists: def cps_merge_sort(array): return cpsmergeSort(array,lambda x:x) def cpsmergeSort(array,continuation): n = len(array) if n <= 1: return continuation(array) left = array[:n/2] right = array[n/2:] return cpsmergeSort (left, lambda leftR: cpsmergeSort(right, lambda rightR: continuation(fastmerge(leftR,rightR)))) Once this is done, we can do TCO by hand to defer the call stack management done by recursion to the while loop of a normal function (trampolining, explained e.g. here or there, trick originally due to Guy Steele). Trampolining and CPS work great together . We write a thunking function, that “records” and delays application: it takes a function and its arguments, and returns a function that returns (that original function applied to those arguments). thunk = lambda name, *args: lambda: name(*args) We then write a trampoline that manages calls to thunks: it applies a thunk until the thunk returns a result (as opposed to another thunk) def trampoline(bouncer): while callable(bouncer): bouncer = bouncer() return bouncer Then all that’s left is to “freeze” (thunk) all your recursive calls from the original CPS function, to let the trampoline unwrap them in proper sequence. The function now returns a thunk, without recursion (and discarding its own frame), at every call: def tco_cpsmergeSort(array,continuation): n = len(array) if n <= 1: return continuation(array) left = array[:n/2] right = array[n/2:] return thunk (tco_cpsmergeSort, left, lambda leftR: thunk (tco_cpsmergeSort, right, lambda rightR: (continuation(fastmerge(leftR,rightR))))) mycpomergesort = lambda l: trampoline(tco_cpsmergeSort(l,lambda x:x)) This gives us a recursive merge sort in which the call stack size is at most one at any given time : by returning the thunk, our originally recursive procedure gives the “address” of the next function to call. But this does not go that fast (recursive mergesort:0.126505711079, this trampolined version : 0.170638551712). Indeed, each reursive calls need to create a closure (the thunk) and the trampoline loop has to test for closure (callable) continuously, so the overall naive trampolining is pretty expensive - unless you reach the end of the call stack, in which case they may prove of some use.

We can do a slightly more involved stack-based TCO elimination in the guise of this SO answer (rehashed here). The idea is to consider the recursion as a storage of work to be done and work accomplished until now, and to include pointers to differentiate between the two. Then, reproducing a recursive pattern just consists in managing the order in which work units (here, slices of our array) are treated. This gives the iterative function: def leftcomb(l): maxn,leftcomb = len(l),[] n = maxn/2 while maxn > 1: leftcomb.append((l[n:maxn],False)) maxn,n = n,n/2 return l[:maxn],leftcomb def tcomergesort(l): l,stack = leftcomb(l) while stack: # l sorted, stack contains tagged slices i,ordered = stack.pop() if ordered: l = fastmerge(l,i) else: stack.append((l,True)) # store return call rsub,ssub = leftcomb(i) stack.extend(ssub) #recurse l = rsub return l This goes a tad faster than the inefficient trampolined version above (trampolined mergesort: 0.170638551712, this stack-based version:0.144994809628). This is not surprising. The interesting fact is that this final iterative version goes slower that the basic, non-tail recursive run-of-the mill merge sort you would have written as CS 101 homework. Now this is cool : CPython’s function stack manipulations are faster than my explicit array manipulations. This is more a statement on the size of the recursive call stack (mentioned above) and the efficiency of CPython than on the quality of my CS homework (indeed, an even faster recursive version has been suggested on reddit). But still. It’s surprising that it would take an iterative function that does not do the basic, systematic recursion-to-stack transformation above to beat the run-of-the-mill recursive version.

The final result summary ? on my machine (Ubuntu natty’s stock Python 2.7.1+), the average run timings (out of of 100 runs -except for Bubblesort-, list of size 10000, containing random integers of size 0-10000000) are: