I’ve had the ML implementation for a heap around for a bit but I hadn’t had time to post it. So this blog post will go through it and some of the main ideas. As usual the complete .sml can be found at the symfun github repo

What is a heap?

Let’s get started by discussing a heap, which we briefly mentioned (but not really) when we talked about priority queues. First off: we are discussing the data structure called heap, not the name given to a portion of memory.

You can start thinking about the heap as a tree. Each node can have at most 2 children (we can call these the left and right children) and, with the exception of the root node, 1 parent. Now, the interesting part about heaps is that these nodes aren’t just placed willy-nilly in our heap, but rather, they maintain what is called the “heap property”. There are 2 kinds of heaps, min-heaps and max-heaps, each with a slight variant of the property. The max-heap property states that a parent node must be larger than either children nodes, and similarly, the min-heap property states that a parent node must be smaller than either children node. Now, something we had not yet mentioned, but is critical, is that the children of a given node are themselves heaps and thus guaranteed to hold the heap-property (as long as we are careful about insertion etc).

So we get a couple of nice things “for free” in a heap, for example, we obtain some information regarding ordered statistics, meaning we have some sense of where the largest, second largest, third largest elements etc (in the case of a max-heap) lie within our heap. Let’s assume we have a max-heap. We know that the largest element MUST be at the root. Why is that? For the heap property to hold, all parents must be larger than their children. Given that their children are heaps as well, this in turn implies that an ancestor node is larger than all it’s descendants.

Let’s assume we number our nodes as follows: our root node is 0, its left child is 1 and its right child is 2. We can guarantee that the second largest element in the heap is in position 1 or 2, but we cannot state for certain which one. Unlike a binary tree, the heap imposes no restrictions in terms of sibling nodes. As long as the relationship to the parent node holds, the children node can be equal, or one larger than the other but with no guaranteed position. This in turn implies that while we have information along a path in the heap, we do not have information to compare 2 separate branches.

Things are slightly more interesting if we want to think about the possible positions for the third largest element. We know for sure it cannot be at position 0, as that is reserved for the largest element. Now, by the argument we made in our prior paragraph, the third largest element could be in position 1 or 2 (and the other position would be occupied by the second largest element). Note that the third largest element could also be a child of the second largest element (which means it could be in either children of node 1 or either children of node 2). More generally, we can see that the nth largest element must be in a node within the first log2(n) “levels” of our tree (the root being considered level 0), excluding the root node (unless n=1, in which case that is its only possible position).

Why are heaps interesting?

The interesting part about heaps is that they allow certain operations to take place with a reduced complexity when compared to other data structures. For example, we can add elements to our heap and remove elements from it in O(log N), which can be faster than what is achievable in other structures (for example, recall that our linked-list priority queue implementation had linear time insertion). How does the heap accomplish this? Well, it is mainly the result of 2 of its characteristics.

First: our heap is structured as a tree where each node (with the exception of the leaves) has 2 children. Now, what does this mean in practical terms? Consider the following, let’s say we number our tree starting from the root (0) and progressing in a breadth first manner, meaning, we number elements at each “level” before progressing to the next level. So the children of 0 would be 1 and 2. Now the children of 1 would be 3/4 and those of 2 would be 5/6, and so forth.

Now, let’s say we had a linked list and want to reach element number 6 and we must begin our search at the first element (the root 0), how many steps does it take to reach our target? 0->1->2 ….->6 that is 6 steps. Now, let’s assume our data is not structured as list but rather as a tree (as it is in a heap), how many steps does it take now? 0->2->6, that’s 2 steps! That is a hell of a speed up. Now, this of course assumes that our tree is balanced, which informally means we didn’t create a super deep branch but rather “spread” out the data across the branches, filling each “level” before we progress to the next.

Second: we now have information about the underlying data in our structure. We know that each parent is larger than its children. We know the largest element is at the root (in a max-heap).

The effect of these 2 factors will be clear when we walk through the ML implementation.

ML: heaps heaps heaps

We take advantage of ML’s rich type system to create our own type, called a heap. We do so using a datatype called heap, which has 2 data constructors: Empty and H of … Our H data constructor takes 4 elements: an int, an element for our heap and 2 children, which in turn are heaps.

datatype 'a heap = Empty | H of int * 'a * 'a heap * 'a heap;

We have that int at the start to keep track of the number of nodes under a given branch in our heap. We need this to make sure we keep our tree balanced. In an imperative implementation (as we will do in java), we might have our heap be an array and thus we could simply keep track of next open slot where we should be putting elements, however, in our functional implementation we don’t have this and must gain this information as we traverse our heap to find the empty slot.

We have a simple “constructor” for a a heap node. It takes an element e and wraps it in our data constructor.

(* construct a node for the heap *) fun make_node e = H(1, e, Empty, Empty);

We include a simple function to extract the count of descendants from a node (the int in our H of quadruple). While we could have simply used pattern matching for this in our code, having a quick function made the implementation cleaner.

(* how many elements are in the tree rooted at x *) fun get_size Empty = 0 | get_size (H(n, _, _, _)) = n ;

Let’s start with an insertion.

Under a traditional, array-based implementation, one possible way of inserting an element is to put it into the current available slot (which you keep track of as you add more elements). You then let the element “float up”, meaning, you swap it with its parent if the heap property does not hold and you continue to do so until it arrives at a spot where it holds. The complexity of this operation is log N. Why is that? Consider that if you for some reason inserted the largest possible (or smallest possible in the case of a min-heap) element in a heap, you might have to swap all the way to the root! How many steps would that be? Well, we know that each parent can have 2 children, right? So if we are node index X, we can count the number of steps to the node by figuring out what “level” of the tree that node is in. We can very simply do that by dividing repeatedly by 2 until we get to 1. And we have a quick way of stating that: log base 2! I confess that I didn’t really “get” log until I started dealing with trees and then it became apparent how useful it is for daily activities.

Our functional implementation starts at the root instead, as we have no “marker” for the next available spot. We then float down the element to the appropriate slot (so we must reach a leaf for an empty spot, so our implementation is always log N, not just log N in the worst case scenario), performing any swappings necessary along the way

(* insert a value into the heap *) fun insert lte e Empty = make_node e | insert lte e (node as H(n, r, left, right)) = let val bal_ins = insert0 (get_size left <= get_size right) lte (* insert into branch with less elems, to keep balance *) in if lte(e, r) then bal_ins e node else bal_ins r (H(n, e, left, right)) end and insert0 in_left lte e (H(n, root, left, right)) = if in_left then H(n + 1, root, insert lte e left, right) else H(n + 1, root, left, insert lte e right) ;

Our function takes a predicate lte (so that we are not restricted to a specific type), which compares whether the first argument is less than the second argument (we can simply provide a greater than operation if we wanted a min-heap instead). It also takes the element we wish to insert and a heap into which we want to insert it. We define a local operation called bal_ins (short for balanced insert) which decides whether we will insert into the left or the right branch by comparing the number of nodes underneath each and choosing the one with less nodes (or the left if both have an equal number). This in turn makes use of a mutually recursive function called insert0. Finally, we compare the new element to this root and make sure our insert maintains the heap property. We continue inserting e down the tree if the heap property holds otherwise we make e the new root and recursively insert the old root instead.

The next natural operation would be to take something out of our heap. We will implement a remove function that returns the root element and a heap without that root.

fun rem lte Empty = (NONE, Empty) | rem lte (heap as H(_, r, _, _)) = (SOME r, move_up lte heap) ;

In the case of an empty heap we return the option NONE as we have done before and the empty heap. If the heap is non-empty, then we return the root and a new heap which has been manipulated to make sure the heap property holds. In a traditional imperative implementation, one would traditionally store the root value in a variable, then swap the last element in the heap with our root, decrease the valid size of the heap (ie move our “next slot” marker) and then apply a recursive procedure at the root to make sure the heap property is maintained. This recursive procedure has a worst-case complexity of log N, as we might have moved up the smallest element in a max-heap and need to let it “float down” to a leaf.

Our functional implementation is a similar idea, however, it always has complexity log N, not just in the worst case scenario, as we effectively “move” one of the branches in our heap up, which means we swap up log N nodes.

(* if we remove the root of a subnode, we need to move one of the branches up to replace *) fun move_up lte Empty = Empty | move_up lte (H(_, _, Empty, Empty)) = Empty | move_up lte (H(n, _, left as H(_, le, _, _), Empty)) = H(n - 1, le, move_up lte left, Empty) | move_up lte (H(n, _, Empty, right as H(_, re, _, _))) = H(n - 1, re, Empty, move_up lte right) | move_up lte (H(n, _, left as H(_, le, _ , _), right as H(_, re, _, _))) = if lte(le, re) then H(n - 1, re, left, move_up lte right) else H(n - 1, le, move_up lte left, right) ;

If our heap is empty or a leaf, the result of “moving up” the branch is just the empty heap. Now, if our heap has only 1 branch, then we simply move that branch up. Note that we decrease the count of nodes underneath. Now, if we have 2 children, both non-empty, we need to compare the value of the left and right nodes and move up the larger (if we’re working with a max-heap) of the 2 to become the root and recursively move up its descendants.

Now making a heap is quite easy, we can start with an empty tree and fold over a list, inserting each into the heap using our insert function.

fun make_heap lte ls = foldl (fn (e, h) => insert lte e h) Empty ls;

We won’t go into why, but the complexity of this function is not N*log N as one might expect (that is a valid upper bound but we can tighten it up). It’s actually O(N).

Heap sort in turn is consists of constantly removing elements from the root and maintaing the heap.

fun heapsort0 lte Empty = [] | heapsort0 lte heap = let val (SOME e, new_heap) = rem lte heap in e::heapsort0 lte new_heap end; fun heapsort lte ls = heapsort0 lte (make_heap lte ls);

We build the heap and then apply the “heavy lifting function” heapsort0. Heapsort0 takes the root element, gives us a new heap without that element and then recursively applies the heapsort0 function to the new heap, and “conses” the element to the result. This means that our heapsort returns a decreasing ordered list for a max-heap and an increasing-ordered list for a min-heap.

- heapsort op>= [4,7,1,0, ~10]; val it = [~10,0,1,4,7] : int list - heapsort op<= [4,7,1,0, ~10]; val it = [7,4,1,0,~10] : int list

Finally, for funnsies, we have a quick print heap function that performs depth-first traversal and prints out the element at each node a heap. It takes a toString function that creates a string representation of our heap elements (Int.toString does just fine for our int heap) and a heap.

fun print_heap0 spaces toString Empty = TextIO.print(spaces ^ "*

") | print_heap0 spaces toString (H(_, root, left, right)) = let val next_print = print_heap0 (spaces ^ " ") toString in (TextIO.print (spaces ^ toString root ^ "

"); next_print left; next_print right ) end ; fun print_heap toString heap = print_heap0 "" toString heap;

Note that there are various pitfalls to this implementation. Namely, our comparison function is provided anew for each (or most) functions relating to heaps. This is not only a nuisance but also dangerous. There is no guarantee that we give it the same function in each call, and we could thus wreck havoc on the structure. In real life, we would implement heaps using a functor, that would take a type variable and a comparison function, which we would then make sure to use for all functions in the structure outputted. Additionally, we might want to hide the data constructors associated with out heap datatype. We only want valid heaps to be built with those constructors, exposing them lets others use them arbitrarily.

I really like how the ML code is clear (or at least clearer than other implementations). The polymorphism is great (we can put anything we want in our heap). And it was quite simple to write.

That’s it for now. I’ve coded up the Java version but haven’t cleaned it up so that might have to wait a week or so.