This is an anecdote describing the process of creating a patch I came up with for the python-dateutil module which resulted in significant performance gains. It’s a good demonstration of the heapq module included with Python.

For the new version of my RTA bus schedule app, I decided to use the dateutil module for all my time calculations. This is a fantastic module that makes finding things like “the 3rd instance into the month of one of Tuesday, Wednesday or Thursday, for the next 3 months” easy. Since bus schedules repeat weekly—with special schedules for weekends, from the schedule’s effective date until the next effective date, with exceptions for holidays—I figured this would make my job much, much easier. And for the most part, I was right. Each time listed in a bus schedule’s time table can be represented as a recurrence rule, and the resulting set of rrules can be combined into an rruleset which defines the whole bus schedule.

But after asking for the next bus time from one of my rrulesets, I was a bit surprised to be waiting more than ten minutes for the result. In fact one particular query on a rruleset still hadn’t terminated even after 40 minutes. I tried looping over the individual rrules myself to get the answer I wanted, instead of using the rruleset, and that worked in under a minute. Obviously dateutil’s rruleset implementation was doing something silly. So I profiled it.

I queried a rruleset with about 700 rrules starting in June 2006 by asking for the next time in the recurrence after today (in December 2007). These were the most expensive calls reported by cProfile1 (output trimmed to points of interest):

261917242 function calls in 686.251 CPU seconds ncalls tottime filename:lineno(function) 152824 513.894 {method 'sort' of 'list' objects} 129590630 83.894 rrule.py:842(__cmp__) 129590630 82.439 {cmp} 153900 1.296 rrule.py:399(_iter) 304393/151570 0.963 rrule.py:102(_iter_cached) 57358 0.630 rrule.py:780(wdayset) 151570 0.510 rrule.py:864(_iter) 152127 0.232 rrule.py:836(next)

That pretty much says it all. Of the 686 seconds (11½ minutes) it spent finding the right time, 513 seconds (8½ minutes) were spent calling list.sort more than 150,000 times. The remaining time was mostly spent performing comparisons, presumably for all that sorting. Now, timsort may be good, but that amount of sorting is just excessive.

Let’s look at the code calling list.sort . The relevant portion of rruleset._iter looks like this:

lastdt = None total = 0 while rlist: ritem = rlist[0] if not lastdt or lastdt != ritem.dt: while exlist and exlist[0] < ritem: exlist[0].next() exlist.sort() if not exlist or ritem != exlist[0]: total += 1 yield ritem.dt lastdt = ritem.dt ritem.next() rlist.sort()

What’s going on here? This code yields each date generated by all the rrules in order, the earliest date being first, after checking that it has not been excluded by another rrule (one that generates holidays, for instance). rlist and exlist are both Python lists consisting of iterators that provide dates on behalf of the rrules via next() . Both lists operate in this way: each iterator in the list uses its own last retrieved date as its comparison key, which changes on every call to next() , which is why sort() is called again on every iteration—note where these calls are in the code, for later. The lists start out sorted (before the while loop).

The iterators take care of removing themselves from the list when they are exhausted:

def next(self): try: self.dt = self.gen() except StopIteration: self.genlist.remove(self)

Clearly we can do better than sorting the list on every iteration. One approach would be to start out sorting the list in O(n log n) time, then when one of the iterators is advanced (changing its comparison key), remove it from the list and put it back with an insertion sort in O(n) time.

It turns out we can do better than that. You may recall that one of many batteries Python includes is the heapq module, which implements what is also known as a priority queue. This gives us the following properties: the list can be “heapified” in linear time, the first item in the list will always be the smallest, and items can be replaced in near-constant time. This is a perfect fit for the code above: we only ever care about the smallest item (the earliest date), and it may need to adjust its position in the list after every iteration. What follows are the very simple changes I made to the code to use heap queues.

First, import it:

import heapq

Then, starting before the while loop over rlist , heapify the lists:

heapq.heapify(rlist) heapq.heapify(exlist)

Then, wherever the list is sorted after calling next() , instead check to see if the item has removed itself (recall that this is the responsibility of the iterator), and if it’s still there, use heapreplace to replace the item with itself—in effect, moving it to the correct position given its new comparison key. In the new while loop, just the calls to sort() have been changed:

while rlist: ritem = rlist[0] if not lastdt or lastdt != ritem.dt: while exlist and exlist[0] < ritem: exitem = exlist[0] exitem.next() if exlist and exlist[0] is exitem: heapq.heapreplace(exlist, exitem) if not exlist or ritem != exlist[0]: total += 1 yield ritem.dt lastdt = ritem.dt ritem.next() if rlist and rlist[0] is ritem: heapq.heapreplace(rlist, ritem)

There’s one other place where the lists are modified: where the iterators remove themselves. This needs to be updated too, since it could mess up the properties of the heap if they just go modifying it in arbitrary places:

def next(self): try: self.dt = self.gen() except StopIteration: if self.genlist[0] is self: heapq.heappop(self.genlist) else: self.genlist.remove(self) heapq.heapify(self.genlist)

In this case, we can use the constant-time heappop if the exhausted item is the smallest, otherwise the list will need to be reheapified after removing it in the same way as the old version. So what does performance look like now for exactly the same operation?

5765127 function calls in 13.211 CPU seconds ncalls tottime filename:lineno(function) 152137 5.911 {_heapq.heapreplace} 154000 1.226 rrule.py:400(_iter) 1514270 1.021 rrule.py:847(__cmp__) 1514270 0.860 {cmp} 304414/151581 0.743 rrule.py:103(_iter_cached) 57378 0.601 rrule.py:781(wdayset) 151580 0.440 rrule.py:869(_iter)

Yup, thirteen seconds—more than a 52X speedup. Not bad for a net growth of 8 lines of code.

DATA STRUCTURES: They work!