If you’ve done any data analysis in Python, you’ve probably run across Pandas, a fantastic analytics library written by Wes McKinney. By conferring dataframe analysis functionality to Python, Pandas has effectively put Python on the same footing as some of the more established analysis tools, such as R or SAS.

Unfortunately, early on, Pandas had gotten a nasty reputation for being “slow”. It’s true that your Pandas code is unlikely to reach the calculation speeds of, say, fully optimized raw C code. However, the good news is that for most applications, well-written Pandas code is fast enough; and what Pandas lacks in speed, it makes up for in being powerful and user-friendly.

In this post, we’ll review the efficiency of several methodologies for applying a function to a Pandas DataFrame, from slowest to fastest:

1. Crude looping over DataFrame rows using indices

2. Looping with iterrows()

3. Looping with apply()

4. Vectorization with Pandas series

5. Vectorization with NumPy arrays

For our example function, we’ll use the Haversine (or Great Circle) distance formula. Our function takes the latitude and longitude of two points, adjusts for Earth’s curvature, and calculates the straight-line distance between them. The function looks something like this:

To test our function on real data, we’ll use a dataset containing the coordinates of all hotels in New York state, sourced from Expedia’s developer site. We’ll calculate the distance between each hotel and a sample set of coordinates (which happen to belong to a fantastic little shop called the Brooklyn Superhero Supply Store in NYC).

You can download the dataset, and the Jupyter notebook containing the functions used in this blog, here.

This post is based on my PyCon talk, which you can watch here.