A tale of two sorts

Suppose you want to choose between mergesort and a simple quicksort algorithm — which one is faster? This is a good question to learn from because the algorithms are intuitively clear, yet their analysis is not completely obvious.

Here’s a simple quicksort in Python:

def quicksorted(arr):

n = len(arr)

if n < 2: return arr # Find left & right lists so left < pivot <= right.

pivot, left, right = arr[0], [], []

for item in arr[1:n]:

if item < pivot: left.append(item)

else: right.append(item)

return quicksorted(left) + [pivot] + quicksorted(right)

Quicksort uses a pivot element from the array to filter out two subarrays called left and right so that every element L in left, and every element R in right follows the rule:

L < pivot ≤ R.

From here, left and right are recursively sorted to provide the full sort.

Next is a Python mergesort:

def mergesorted(arr):

n = len(arr)

if n < 2: return arr # 1. Split in half and recursively sort.

mid = int(n / 2)

left = mergesorted(arr[0:mid])

right = mergesorted(arr[mid:n]) # 2. Merge together.

sorted, i, j = [], 0, 0

while len(sorted) < n:

if left[i] < right[j]:

sorted.append(left[i])

i += 1

else:

sorted.append(right[j])

j += 1

if i == len(left): sorted += right[j:]

if j == len(right): sorted += left[i:]

return sorted

This time, the array is split in half using a mid-point set at index int(n/2). These halves are recursively sorted, and a little work is spent merging them together to arrive at the fully sorted array.

Which one is faster?

This is such a common problem that you might have the answer memorized. It’s worthwhile to pretend it’s a new problem to you, and to appreciate that the answer is not obvious at first glance. After all, the main application of big-oh notation for coders is to understand the behavior of new algorithms.

Revealing graphs

Let’s compare the speed of the algorithms. As a unit of measurement, we can look at the number of comparisons used by each graph on similar inputs. Specifically, we’ll measure the number of times these lines are called:

if item < pivot: # in quicksort

or

if left[i] < right[j]: # in mergesort.

This unit of measurement is proportional to the total time each sort algorithm will take, is easy to measure experimentally, and is independent of any particular hardware it may run on.

The graph below shows the number of comparisons needed to sort every possible ordering of [1, 2, 3]. The height of each blue bar gives the number of comparisons used for a particular input. The input is represented by the smaller gray bars beneath the blue bar; for example, the left-most bar represents input [1, 2, 3] to quicksort taking 3 comparisons, while the right-most bar represents input [3, 2, 1] to mergesort. The n=3 label indicates the length of the input.

Both algorithms use either 2 or 3 comparisons in all cases — and neither one is clearly faster than the other for a random size-3 input; they both use 3 comparisons on 4 possible inputs, and 2 comparisons on the others.

Let’s move on to larger inputs. This time we’ll consider all 24 possible orderings of [1, 2, 3, 4] as inputs:

In this graph we can see that mergesort’s slowest case — the maximum number of comparisons—is better than quicksort’s slowest case.

Usually we won’t know the length of the input array ahead of time. With this in mind, we can look for a pattern by extending the above graphs for several larger values of n. For an input array of size n, there are n! different array orderings — that is, n(n-1)(n-2)…1 possibilities.

The bars in the next graph are drawn in order of height since otherwise they’d be difficult to visualize. The last row, with n=10, represents 10! = 3,628,800 different array orderings, so the graph is a compressed but accurate representation.

Something curious is happening in the quicksort column: a little spike has been forming, pushing up the maximum number of comparisons.

Let’s focus on the maximum value within each of these bar graphs. We can use this one number to summarize the performance of each algorithm at each input size n in the following graph:

The earlier bar graphs made it look as if quicksort was often about as fast as mergesort. Suddenly, this last graph is making quicksort look much worse. What’s going on?

Worst-case time complexity

Let’s dive into precise terminology for general algorithms — we’re ready to define the time complexity of an algorithm.

It’s nice to have a single function t(n) that expresses the time an algorithm takes in terms of n, the size of the algorithm’s input. It’s often true that there are many inputs of a single size — for example, many lists with the same length — so we have to decide how to represent all the running times for these inputs with a single number.

In the analysis of quicksort and mergesort so far, we focused on the maximum time used for a given list size. In general, the function

t(n) = max #time units used for an input of size n

is called the worst-case time complexity of an algorithm because it assumes the slowest possible performance based on knowing n.

An alternative to worst-case is the average-case time complexity, which instead measures the average number of time units used for a random size-n input.

Finding t(n)

The bar graphs above were made by running the algorithms on many inputs. This only works for finite n. Let’s try to find an expression for t(n) that doesn’t require any code to compute.

t(n) for quicksort

Below is an example of a quicksort run on input [4, 6, 2, 3, 5, 1]. The orange bars are pivot elements; gray bars separate different input sets in the recursive calls. The new variable nc(k) is the number of comparisons performed at each recursion depth k.

The variable nc(k) is useful since

total #comparisons = nc(0) + nc(1) + nc(2) + …

A recursive call with input s uses len(s)-1 comparisons, excluding comparisons made indirectly with deeper recursive calls. So nc(k) is the sum of len(s)-1 over all sublists s at that depth.

Let’s add two new variables that will help us precisely analyze quicksort:

ne(k) = #elements at depth k (e.g. the non-gray bars in the image above),

ns(k) = #sublists at depth k.

Here’s the above example with the variables for each step k:

Using these, we can rewrite the above nc(k) equation as:

nc(k) = ne(k) - ns(k).

This works because, at depth k, there’s one comparison per non-pivot element. And there are ns(k) pivot elements, so subtracting ns(k) from ne(k) gives us the number of non-pivot elements.

Notice that quicksort recursively calls itself on left and right where len(left) + len(right) = #comparisons performed before recursion; in other words, exactly one element is passed recursively down for each comparison performed. We can express this as ne(k) = nc(k-1). Combine this with our last expression for nc(k), and we arrive at

nc(k) = nc(k-1) - ns(k).

Our goal is to find the maximum #comparisons used by a run of quicksort — that is, the maximum sum of nc(k) over all recursion depths k. This last equation tells us that maximizing nc(k) is the same as minimizing ns(k), so let’s consider the smallest possible values of ns(k).

If ns(k)=0, then nc(k)=0 since no comparisons are done when there are no sublists. So the maximum value of nc(k) is nc(k-1)-1, which happens when ns(k)=1.

If we could make ns(k)=1 for as many k as possible, then we’d have nc(0)=n-1, nc(1)=n-2, etc, with nc(k)=n-k-1 at depth k. And we can make ns(k)=1 for as long as possible by sending in an already-sorted input such as [1, 2, 3, 4, 5, 6]:

This last example gives the maximum #comparisons for any input of length 6. So t(6) = #comparisons on [1, 2, 3, 4, 5, 6] = 5 + 4 + 3 + 2 + 1 = 15.

This argument shows that the maximum #comparisons for any n is found when we run quicksort on any already-sorted input, which would preserve ns(k)=1 for as many recursion depths as possible. This means that

This formula gives us the exact worst-case time complexity of quicksort. Not only did we find quicksort’s speed on already-sorted inputs, but we also carefully argued that no input could use more comparisons.

t(n) for mergesort

Mergesort has two stages: splitting and merging. All the comparisons happen during merging. As a reminder, here’s the merging section of the code:

def mergesorted(arr):

...

# 2. Merge together (left/right have been recursively sorted).

sorted, i, j = [], 0, 0

while len(sorted) < n:

if left[i] < right[j]:

sorted.append(left[i])

i += 1

else:

sorted.append(right[j])

j += 1

if i == len(left): sorted += right[j:]

if j == len(right): sorted += left[i:]

return sorted

A merge ends when either left or right runs out, triggering one of the last two lines of the while loop. Until that happens, one element is appended to sorted for every comparison. At most len(arr)-1 comparisons happen during a merge, corresponding to a single element remaining in either left or right when the merge ends.

Can this theoretical maximum actually happen at every recursion depth for a single input?

The answer turns out to be yes — but we need to do some work to prove it. We’ll construct slow-to-mergesort arrays using a function called antisorted.

The following code accepts an input array arr of distinct elements and returns a new array with the same elements arranged so that every recursive depth of a mergesorted call takes as long as possible:

def antisorted(arr):

if len(arr) < 2: return arr

sortedarr = sorted(arr)

i = 0

left, right = [], []

while True:

right.append(sortedarr[i])

i += 1

if i == len(arr): break

left.append(sortedarr[i])

i += 1

if i == len(arr): break

return antisorted(left) + antisorted(right)

It works by choosing exactly which elements will end up in the left and right subarrays so that, during a merge, the smallest element is chosen from right, then from left, then from right, etc. The right side is filled first in the loop since, when n=len(arr) is odd, right has one more element than left in a run of mergesorted.

Intuitively, the first pass of antisorting — before any recursive calls — performs an operation like this:

The last line of antisorted makes sure that every recursive level of mergesorted also uses a maximal number of comparisons. We can mess with the ordering of elements within left and right in the last line of antisorted because, by the time the merge starts during mergesort, all recursion is done, and left and right have been restored to their sorted selves.

Antisorting examples

Here’s a picture of what antisorting arr=[1, 2, 3, …, 64] looks like:

Each step in the above image represents another level of recursion in the call to antisorted. The next image shows what mergesort does to antisorted([1, .., 8]). Each merge step has to zipper together alternating elements from the arrays being merged:

Antisorting maximizes the number of comparisons done at all recursion levels. As a result, the maximum total comparisons is exactly the sum of len(arr)-1 over all recursion levels. We can summarize this result with the recurrence relations below. The left pair of brackets around n/2 mean to round down, and the right pair mean to round up, like the floor and ceil functions. This is a nice way to handle both even and odd values of n in one equation.

Sadly, it’s not easy to turn this into a nicer, non-recursive expression. For now we’ll deviate to an approximation t’(n) based on the picture below, where each horizontal layer indicates a recursion level in a mergesort. This picture introduces the function lg(n) which is the base-2 logarithm of n.

Each blue bar represents a portion of the array being considered by a mergesort call; the values like n or n/2 give a size estimate for the input to that mergesort call. This estimate won’t be accurate whenever the previous call has an odd-sized input, and our value of h is clearly wrong when lg(n) is not an integer, but this picture can still give us some intuition about how many comparisons are occurring.

Notice that each horizontal layer has exactly n elements, and the layers have 1, 2, 4,... blue bars each. So the number of comparisons for a single layer is n-#bars. Adding up powers of 2 is nice since we can use the equation

We can use that fact to add up the number of comparisons, that is, n-#bars, across all layers:

Let’s plot actual discrete values of t(n) using the recurrence relation against the approximate function t’(n):

Looks pretty good!

Unfortunately, we have no careful proof of a connection between the approximation t’(n) and the exact function t(n). Can we regain proven information about mergesort without having to solve the above recurrence relation?

A sweet spot for big-oh

The motivation for big-oh notation has just ambushed us. It’s perfect for our mergesort analysis — it gives us a way to briefly summarize t(n) without finding a non-recursive expression for each value.

The important fact about t(n) that we want to verify is something like

which can be written more formally as

At each recursion level of mergesort, all of the n elements have been split up into sublists to be sorted. So the number of comparisons at any fixed level is always ≤ n.

The longest sublist after the first split — call this level k=1 — has length ≤ n/2 + 1/2. At level 2, the longest sublist has length ≤ n/4 + 1/4 + 1/2. In general, we cut the last longest sublist in half and add 1/2 to allow for splits of odd numbers. This means the longest sublist at recursion level k is

By the time k=lg(n), the longest sublist is < 2. So no comparisons can occur after recursion level h=lg(n).

We can summarize these ideas in the following picture:

In summary,

Since lg(n) = log(n) / log(2), we can set C = 1/log(2) to get

which satisfies the definition of big-oh; we can re-write it as

Nice.

Next let’s find a big-oh expression for quicksort’s t(n).

Earlier, we found that

Let’s turn that into an equality that can be used in a big-oh expression:

We’re now able to beautifully summarize the worst-case running time of our sorting methods:

By this measure, mergesort is the clearly superior algorithm.

However, there are slight modifications to quicksort that can improve its speed in some cases, and I’ll mention one below.

Average-case complexity

Remember the distribution of running times we found earlier:

It looks like the worst-case for quicksort is isolated to a small subset of inputs. It would be nice if we could give quicksort some credit for being as good as mergesort most of the time. Average-case complexity allows us to overlook slow-but-rare inputs.

Worst-case time complexity was defined like this:

<worst-case> t(n) = max(#time units needed for any length-n input)

Average-case complexity is defined analogously:

<avg-case> t(n) = avg(#time units needed for any length-n input)

An interesting technical question pops up: What is the probability distribution on the inputs? For example, for a string algorithm, should we consider all valid UTF-8 strings as equally-likely inputs — or are natural language strings more likely? The definition of average-case complexity doesn’t specify your probability distribution. In other words, it’s up to you to choose the distribution that makes the most sense, though it’s common to assume that all inputs of the same size are equally likely.

For sorting inputs of size n, we can use permutations of 1, 2, …, n to represent each possible ordering of an input, and treat each permutation as equally likely. With that distribution, the average-case time complexity of quicksort is O(n log n), though it’s a bit of work to figure that out.

Randomized algorithms

Quicksort’s average time is O(n log n), but it’s still slow on presorted input. This can be a problem in practice. An easy improvement is to choose the pivot element randomly.

The modified quicksort would insert these lines:

pivotIndex = random.randint(0, n — 1)

arr[0], arr[pivotIndex] = arr[pivotIndex], arr[0] # Item swap.

just before the line that sets the pivot:

pivot, left, right = arr[0], [], []

Now the running time on any input — including a presorted list — is itself random. Well, pseudorandom in practice.

Since the running time is random even for a fixed input, it’s reasonable to look at the expected running time. In other words, t(n) now represents an simultaneous average over both all length-n inputs and all possible pseudorandom parameters — in this case, our pivot choice. In the case of random-pivot quicksort, the expected running time is the same as the average-case time for the non-random version — O(n log n).

Amortized analysis

We’ve talked about two probabilistic ideas so far: averaging over many inputs, and about algorithms using random parameters. A third variation of average-case complexity is amortized analysis, where averages are taken over a group of consecutive function calls.

Suppose we have an array of integers held consecutively in memory; we can add new integers to the end of the array with an add function. This function takes one fixed-size integer as input, and appends it to the end of the current list of integers. Our array pre-allocates a little extra space to anticipate some add calls. When add is called and no extra space remains, a new block of twice as much memory is allocated to make extra room for incoming elements, and the old elements are copied to the new location:

In this section, we’ll use the variable n to indicate how many add calls have already occurred. Since we didn’t bother to make a remove function, this means there are exactly n integers stored in the array.

Adding an element to a full, no-extra-space array with n elements requires n+1 memory writes, which is our measure of time here. In other words, a single add call could take arbitrarily long, even though it has only one fixed-size input!

But! We can still say something meaningful about the running time. Calling add n times in a row, starting with an empty array, results in this many memory writes:

The amounts in parentheses capture the number of writes to copy the existing elements to a new location, plus one more write for the new element. We can rewrite this expression as

Let T(n) denote the number of memory writes used in n consecutive add calls. We can summarize the last result as T(n) ≤ 3n.

Now let t(n) be the average number of memory writes per add call, after n calls to add. This is simply t(n) = T(n)/n ≤ 3. We can summarize this as

which we could phrase like so: The amortized running time of add is constant. It’s called constant time because the equation t(n) = O(1) means there must be some constant C so that t(n) is always ≤ C for all n after a certain point.

Now we’ve seen a few key examples of how big-oh can be used, as well as explored different ways of choosing a single value t(n) that represents an algorithm’s efficiency. Let’s step back and review some general properties of the notation.