Python Geocoder

Python offers a plethora of libraries to analyse geospatial data. Common ones are shapely , geopy and geopandas . We use shapely and geopandas quite a bit to perform geocoding using GeoJSON shapefiles. Shapefiles are also analysed and visualised using geopandas , see illustration below.

Visualisation of the UK County GeoJSON using GeoPandas

The following code snippet will give you an example of how we use shapely . Notice that the code is similar to PostGIS. This is because the underlying geometry engine (GEOS) is the same and spatial joins are performed using R-trees. The same spatial join operation can be performed in geopandas as well using the geopandas.tools.sjoin function.

Another library that’s handy for geocoding is geopy . This library uses web APIs such as Google and OSM Nominatim, and therefore can be slow with the added risk of request throttling. Another limitation with all of these libraries is that they are single-threaded and therefore, do not scale well for large inputs.

All of these limitations led me to build a library that is offline and also multi-threaded. It improves on an existing one built by Richard Penman. Rather than using GeoJSON shapefiles, this library comes built-in with a database of coordinates of cities around the world, stored in CSV format. The source of this data is GeoNames and it contains coordinates of cities with population greater than 1000. The purpose of this library is to be able to do quick geospatial analysis on your own machine, by exploiting its multi-core capability.

Since the built-in database contains point data of cities that are disjoint, the k-d tree data structure is used to represent it. R-trees would be overkill for the purposes of geocoding since we only need to do a simple nearest neighbour search to find the point (city) closest to the input coordinate. The Euclidean distance function is used to find the nearest neighbour. scipy has a nice single-threaded k-d tree implementation, which I’ve extended to create my own parallelised version. In this k-d tree, the built-in database of cities is stored in shared memory. The list of input coordinates is then split into chunks and separate processes are spawned to query the k-d tree with those inputs. Multiprocessing is used rather than threading to avoid limitations due to the GIL. All the processes write their outputs into shared memory and once they’re all done, the final geocoded output is returned.

The following code snippet shows how the library can be used.

Here’s a performance comparison of the single-threaded k-d tree with the parallelised version on a Macbook Pro with 8 cores. The parallelised k-d tree comes into its own for really large inputs. For 10 million coordinates, it runs twice as fast!

Since its launch, the library has been downloaded over 12,000 times and has received over 1,260 stars on Github. It received a huge boost when it got featured on Hacker News (#mymomentinthesun). A C++ wrapper has been written for it and the library has also been ported to Rust. Here’s me presenting it at the PyData Conference in London, circa 2015.

The library has some limitations since it uses the Euclidean distance for nearest neighbour search and also a limited database of cities around the world. These problems can be circumvented by loading your own custom data source, provided it’s in the right format. I’ve written a separate blog post on this, here.