Some friends of mine run a business, and want to visualize where their customers are by United States ZIP code. Starting from an Excel file with ZIP codes (customer locations), the program would plot them over a base map. They had previously worked with software that did the same, but used pin markers, which they found visually messy. The goal is an application where they can load a spreadsheet and visualize datasets in different sheets (i.e. a simple interactive GUI rather than a system that generates a set of PNG files).

I thought it would be interesting, so I took the project on. I'm an intermediate-skill Python programmer, with zero experience with GIS or GUI programming. After some false starts, I settled on CartoPy as the drawing tool. I have seen complaints that CartoPy is slow, but I figured it would be "fast enough" (the program is interactive, but not animated, so 5-10 seconds per plot should be fine). I used Carl Colglazier's nice plot of the US Midterm Election Results as my starting point. I downloaded the ZIP code shape files from the US Census Bureau.

Version 1: Individual Plot Method

My first pass at this was pretty simple: Create a base plot for the United States, then draw each ZIP code on top of it. It works well, and I could plot a set of ZIP code, like so: The problem was, it was slow. Like... really slow. My first pass took around an hour to plot every ZIP code in the United States. I know (a) Matplotlib and CartoPy aren't speed demons and (b) 34k shapes is a fair bit of data, but that seemed excessive.

After some tweaks based on Bastian Bechtold's animation speed experiments, I was able to get this down to a few minutes.

for code in codes : zip_shape = self . shape_database . shape_for_zipcodes ([ code ]) state_abr = self . zipcode_database . state_from_zipcode ( code ) ax = self . _ax_from_state ( state_abr ) ax . add_feature ( zip_shape , color = ZipPlotter . _highlight_color , linewidth = 0.00 , edgecolor = 'w' )

Running the code with the first method gets the the following speeds on my laptop:

Plotting first pass in Small-to-Large order using Individual-Shapes method.

1658 ZIP codes required 13.95 seconds.

6629 ZIP codes required 47.99 seconds.

33144 ZIP codes required 246.42 seconds.



Better, but still too slow for an interactive application.

Version 2: Group Plot Method

My next idea was that, instead of creating a new Feature for every ZIP code, I could create a single Feature covering every ZIP code at once. Theoretically it's the same result, but maybe something inside would benefit from treating the different shape files as a group.

# Separate into groups codes_co , codes_ak , codes_hi = self . zipcode_database . group_by_region ( codes ) # Create a single shape for each group shape_co = self . shape_database . shape_for_zipcodes ( codes_co ) shape_ak = self . shape_database . shape_for_zipcodes ( codes_ak ) shape_hi = self . shape_database . shape_for_zipcodes ( codes_hi ) # Plot them self . ax_continental . add_feature ( shape_co , color = ZipPlotter . _highlight_color , linewidth = 0.00 , edgecolor = 'w' ) self . ax_alaska . add_feature ( shape_ak , color = ZipPlotter . _highlight_color , linewidth = 0.00 , edgecolor = 'w' ) self . ax_hawaii . add_feature ( shape_hi , color = ZipPlotter . _highlight_color , linewidth = 0.00 , edgecolor = 'w' )

This isn't ideal, because you're stuck with the same color for every ZIP code. That meets the current spec, but missions creep, and they'll soon want something like a heatmap. Anyway, the resulting speeds were an improvement, but not the order-of-magnitude change I was hoping for:

Plotting first pass in Small-to-Large order using Grouped-Shapes method.

1658 ZIP codes required 7.01 seconds.

6629 ZIP codes required 18.70 seconds.

33144 ZIP codes required 94.58 seconds.



Version 2a: Save and Load

Then one time I accidentally plotted the data set a second time in the same program instance, and noticed something interesting:

Plotting second pass in Small-to-Large order using Grouped-Shapes method.

1658 ZIP codes required 0.54 seconds.

6629 ZIP codes required 1.04 seconds.

33144 ZIP codes required 3.45 seconds.



Whoa! The second time these are drawn, the time drops dramatically, finally in the range I was hoping to achieve. Further experiments showed: The first time a given shape file is plotted it is slow, the second time it is fast. I did some profiling and noticed the first iteration was spent in a CartoPy projection step that's not present in the second pass. My suspicion is that CartoPy somehow archives the projections for use later. As long as the projections don't change, it can skip this in further plots. I'm not sure this is what's happening, but that's my suspicion.

Trying to capture this stored data, I decided to create a plot and save it using pickle . Then I could reload and get my fast execution times. There are some hiccups to this, specifically (1) You have to restore the canvas, and (2) For some reason, CartoPy changes the axis extents on save and load, so you have to reset them:

def _set_axis_extents ( self ): self . ax_continental . set_extent ([ - 125 , - 66.5 , 20 , 50 ]) self . ax_hawaii . set_extent ([ - 155 , - 165 , 20 , 15 ]) self . ax_alaska . set_extent ([ - 185 , - 130 , 70 , 50 ]) def restore_canvas ( self ): """ Restore the canvas after save and load :return: None """ if not self . fig : return dummy = plt . figure () new_manager = dummy . canvas . manager new_manager . canvas . figure = self . fig self . fig . set_canvas ( new_manager . canvas ) self . _set_axis_extents ()

Except... it didn't help:

Saving plotter and re-loading.

Plotting third pass in Small-to-Large order using Grouped-Shapes method after save and load.

1658 ZIP codes required 6.33 seconds.

6629 ZIP codes required 18.47 seconds.

33144 ZIP codes required 98.06 seconds.



We're right back to our original timing. Whatever is saved in the plot process is lost in the pickle archiving. Maybe setting the extents kills the saved data? No idea.

Version 3: Color Change Method

Well, so much for that. One last thing I wanted to try: The plot-by-group method prevents more interesting use of color (e.g. heat maps). What if we ate the startup cost and just plotted every ZIP code in white, then toggled each ZIP code color to/from background/highlight as the individual plot needs. Startup time is a beast, of course, but we're just changing the color on an existing plot after that. It should be at least as fast as that sweet, sweet second plot. And hey, maybe we can figure out the startup time in some other way (magical thinking).

# Change the colors according to the new codes for code , zip_plot in self . _zip_plots . items (): new_color = ZipPlotter . _highlight_color if code in codes else ZipPlotter . _background_color # Has this color changed? if zip_plot . _kwargs [ 'facecolor' ] == new_color : continue # Update the color for this ZIP code zip_plot . _kwargs [ 'edgecolor' ] = new_color zip_plot . _kwargs [ 'facecolor' ] = new_color zip_plot . pchanged () zip_plot . stale = True

Sadly, no dice:

Testing Color-Change-Zipcode plot.

Plotting first pass in Small-to-Large order using Individual-Shapes with color change method.

1658 ZIP codes required 340.24 seconds.

6629 ZIP codes required 271.93 seconds.

33144 ZIP codes required 278.57 seconds.



First plot takes 6 minutes, which isn't a surprise. What is surprising is that subsequent plots take as long to change the color on existing drawings as drawing the whole thing fresh.

Conclusions

So, here's where I'm at: Barring some new revelation, Version 2 is the best I can do. The first big plot is going to take a couple of minutes, but everything after that should be fast enough. Unfortunately, this means I'm stuck to a single color (or a few specific colors) per plot. Bummer.

Even bigger bummer: If I want to extend this program, I might have to switch to C++. Or maybe a different Python GIS system would solve the problems? It's a big world out there.

I realize I said right at the top "I have seen complaints that CartoPy is slow", so maybe I shouldn't be surprised. However, the second-plot time shows that I wasn't crazy: the speed is in there somewhere, I'm just not sure how to unlock it.

Notes on the code:



Execution requires the matplotlib , cartopy , and uszipcode packages. The first time the code runs it builds up databases of shapes and ZIP codes, which takes a couple of minutes. This data is then archived using pickle so that further runs only take a few seconds to start up. The story above and code in the repository are a distillation of three months of experimentation. The code encapsulates all three plotting variants. Each run plots three ZIP code groups of increasing size (to test scaling): ~2k ZIP codes, ~7k ZIP codes, and the full set (~34k).

Each plot is performed three times: First plot is just normal plotting. After the first round, the program builds the plots again to show difference in second access times. Finally, the plot data is pickled, cleared, and reloaded to demonstrate possibilities of saving and loading.

Fun fact #1: ZIP codes are lines, not polygons. Fun Fact #2: The "full" dataset only covers 33144 of the 42000 ZIP codes in existence. Apparently there isn't an easy way to get shapefiles for all of them. Also, some are for US territories that aren't covered in this application.

Notes on the results



The Individual-Plot method also benefits from the second-plot speedup, but not as much (around 2x, not order-of-magnitude). The repo code demonstrates this. You also see the second-plot speedup if you plot the big cases first. For example, if you reverse the order in Version 2, the small cases operate at second-plot speeds because the shapes were already plotted once. You might notice that the first plot in Version 3 takes longer than the largest plot in Version 1, even though they seem like the same amount of work (drawing every ZIP code). However, Version 3 first creates an all-white ZIP code plot, then changes the color for the 1658 ZIP codes in that plot. We could combine these steps for a speed boost, but I think it's a dead end. The whole method seems too slow to get the speed I'm seeking. If you're wondering what exactly got the plot times from hours to minutes in the early stages... so am I. The changes were made months ago when I was just fiddling around with things, and hadn't bothered with a repository yet. I think I was accidentally re-drawing after every ZIP code, but I can't verify.

TL;DR: Plotting large sets of data is slow in CartoPy. Speed is improved by drawing in groups, and greatly increased in second time drawing. However, the second-time-plot speedup is lost on save and load of a figure.