03 Jan 2018

I often use R’s dplyr package for exploratory data analysis and data manipulation. In addition to providing a consistent set of functions that one can use to solve the most common data manipulation problems, dplyr also allows one to write elegant, chainable data manipulation code using pipes.

Now, Python is my main language and pandas is my swiss army knife for data analysis, yet I often wished there was a Python package that allowed dplyr-style data manipulation directly on pandas DataFrames. I searched the Internet and found a package called dfply , developed by Kiefer Katovich. Like dplyr, dfply also allows chaining of multiple operations with pipe operators.

This post will focus on the core functions of the dfply package and show how to use them to manipulate pandas DataFrames. The complete source code and dataset is available on Github.

Getting Started

The first thing we need to do is install the package using pip .

pip install dfply

According to the project’s Github repo, dfply only works with Python 3, so ensure you have the right version of Python installed.

Data

To explore the functionality of dfply, we will use the same data used by the Introduction to dplyr vignette. The data is from the Bureau of Transporation Statistics and it contains information about all the 336,776 flights that departed from New York City in 2013.

from dfply import * import pandas as pd flight_data = pd . read_csv ( 'nycflights13.csv' ) flight_data . head ()

year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier flight tailnum origin dest air_time distance hour minute time_hour 0 2013 1 1 517.0 515 2.0 830.0 819 11.0 UA 1545 N14228 EWR IAH 227.0 1400 5 15 2013-01-01 05:00:00 1 2013 1 1 533.0 529 4.0 850.0 830 20.0 UA 1714 N24211 LGA IAH 227.0 1416 5 29 2013-01-01 05:00:00 2 2013 1 1 542.0 540 2.0 923.0 850 33.0 AA 1141 N619AA JFK MIA 160.0 1089 5 40 2013-01-01 05:00:00 3 2013 1 1 544.0 545 -1.0 1004.0 1022 -18.0 B6 725 N804JB JFK BQN 183.0 1576 5 45 2013-01-01 05:00:00 4 2013 1 1 554.0 600 -6.0 812.0 837 -25.0 DL 461 N668DN LGA ATL 116.0 762 6 0 2013-01-01 06:00:00

Piping

Let’s say you want to perform n discrete transformation operations on your dataset before outputting the final result. The most common way is to perform the operations step by step and store the result of each step in a variable. The variable holding the intermediate result is then used in the next step of the transformation pipeline. Let’s take a look at an abstract example.

# 'original_data' could be a pandas DataFrame. result_1 = transformation_1 ( original_data , * args , ** kwargs ) result_2 = transformation_2 ( result_1 , * args , ** kwargs ) result_3 = transformation_3 ( result_2 , * args , ** kwargs ) . . . final_result = transformation_n ( result_n - 1 , * args , ** kwargs )

This isn’t very elegant code and it can get confusing and messy to write. This is where piping comes to the rescue. Piping allows us to rewrite the above code without needing those intermediate variables.

final_result = original_data --> transformation_1 ( * args , ** kwargs ) --> transformation_2 ( * args , ** kwargs ) --> transformation_3 ( * args , ** kwargs ) --> . . . transformation_n ( * args , ** kwargs )

Magic?! No, it isn’t. Piping works by implicitly making the output of one stage the input of the following stage. In other words, each transformation step works on the transformed result of its previous step.

Piping with dfply

dfply allows chaining multiple operations on a pandas DataFrame with the >> operator. One can chain operations and assign the final output (a pandas DataFrame, since dfply works directly on DataFrames) to a variable. In dfply, the DataFrame result of each step of a chain of operations is represented by X .

For example, if you want to select three columns from a DataFrame in a step, drop the third column in the next step, and then show the first three rows of the final dataframe, you could do something like this:

# 'data' is the original pandas DataFrame ( data >> select ( X . first_col , X . second_col , X . third_col ) >> drop ( X . third_col ) >> head ( 3 ))

select and drop are both dfply transformation functions, while X represents the result of each transformation step.

Exploring some of dfply’s transformation methods

dfply provides a set of functions for selecting and dropping columns, subsetting and filtering rows, grouping data, and reshaping data, to name a few.

Select and drop columns with select() and drop()

Occassionally, you will work on datasets with a lot of columns, but only a subset of the columns will be of interest; select() allows you to select these columns.

For example, to select the origin , dest , and hour columns in the flight_data DataFrame we loaded earlier, we do:

( flight_data >> select ( X . origin , X . dest , X . hour ))

origin dest hour 0 EWR IAH 5 1 LGA IAH 5 2 JFK MIA 5 3 JFK BQN 5 4 LGA ATL 6

drop() is the inverse of select() . It returns all the columns except those passed in as arguments.

For example, to get all the columns except the year , month , and day columns:

( flight_data >> drop ( X . year , X . month , X . day ))

You can also drop columns inside the select() method by putting a tilde ~ in front of the column(s) you wish to drop.

For example, to select all but the hour and minute columns in the flight_data DataFrame:

( flight_data >> select ( ~ X . hour , ~ X . minute ))

Filter rows with mask()

mask() allows you to select a subset of rows in a pandas DataFrame based on logical criteria. mask() selects all the rows where the criteria is/are true.

For example, to select all flights longer than 10 hours that originated from JFK airport on January 1:

( flight_data >> mask ( X . month == 1 , X . day == 1 , X . origin == 'JFK' , X . hour > 10 ))

year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier flight tailnum origin dest air_time distance hour minute time_hour 151 2013 1 1 848.0 1835 853.0 1001.0 1950 851.0 MQ 3944 N942MQ JFK BWI 41.0 184 18 35 2013-01-01 18:00:00 258 2013 1 1 1059.0 1100 -1.0 1210.0 1215 -5.0 MQ 3792 N509MQ JFK DCA 50.0 213 11 0 2013-01-01 11:00:00 265 2013 1 1 1111.0 1115 -4.0 1222.0 1226 -4.0 B6 24 N279JB JFK BTV 52.0 266 11 15 2013-01-01 11:00:00 266 2013 1 1 1112.0 1100 12.0 1440.0 1438 2.0 UA 285 N517UA JFK SFO 364.0 2586 11 0 2013-01-01 11:00:00 272 2013 1 1 1124.0 1100 24.0 1435.0 1431 4.0 B6 641 N590JB JFK SFO 349.0 2586 11 0 2013-01-01 11:00:00

Sort rows with arrange()

arrange() allows you to order rows based on one or multiple columns; the default behaviour is to sort the rows in ascending order.

For example, to sort by distance and then by the number of hours the flights take, we do:

( flight_data >> arrange ( X . distance , X . hour ))

year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier flight tailnum origin dest air_time distance hour minute time_hour 275945 2013 7 27 NaN 106 NaN NaN 245 NaN US 1632 NaN EWR LGA NaN 17 1 6 2013-07-27 01:00:00 3083 2013 1 4 1240.0 1200 40.0 1333.0 1306 27.0 EV 4193 N14972 EWR PHL 30.0 80 12 0 2013-01-04 12:00:00 3901 2013 1 5 1155.0 1200 -5.0 1241.0 1306 -25.0 EV 4193 N14902 EWR PHL 29.0 80 12 0 2013-01-05 12:00:00 3426 2013 1 4 1829.0 1615 134.0 1937.0 1721 136.0 EV 4502 N15983 EWR PHL 28.0 80 16 15 2013-01-04 16:00:00 10235 2013 1 12 1613.0 1617 -4.0 1708.0 1722 -14.0 EV 4616 N11150 EWR PHL 36.0 80 16 17 2013-01-12 16:00:00

To sort in descending order, you set the ascending keyword argument of arrange() to False , like this:

( flight_data >> arrange ( X . distance , X . hour , ascending = False ))

Add new columns with mutate()

mutate() allows you to create new columns in the DataFrame. The new columns can be composed from existing columns.

For example, let’s create two new columns: one by dividing the distance column by 1000 , and the other by concatenating the carrier and origin columns. We will name these new columns new_distance and carrier_origin respectively.

( flight_data >> mutate ( new_distance = X . distance / 1000 , carrier_origin = X . carrier + X . origin ))

year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier ... tailnum origin dest air_time distance hour minute time_hour carrier_origin new_distance 0 2013 1 1 517.0 515 2.0 830.0 819 11.0 UA ... N14228 EWR IAH 227.0 1400 5 15 2013-01-01 05:00:00 UAEWR 1.400 1 2013 1 1 533.0 529 4.0 850.0 830 20.0 UA ... N24211 LGA IAH 227.0 1416 5 29 2013-01-01 05:00:00 UALGA 1.416 2 2013 1 1 542.0 540 2.0 923.0 850 33.0 AA ... N619AA JFK MIA 160.0 1089 5 40 2013-01-01 05:00:00 AAJFK 1.089 3 2013 1 1 544.0 545 -1.0 1004.0 1022 -18.0 B6 ... N804JB JFK BQN 183.0 1576 5 45 2013-01-01 05:00:00 B6JFK 1.576 4 2013 1 1 554.0 600 -6.0 812.0 837 -25.0 DL ... N668DN LGA ATL 116.0 762 6 0 2013-01-01 06:00:00 DLLGA 0.762

The newly created columns will be at the end of the DataFrame.

Group and ungroup data with group_by() and ungroup()

group_by() allows you to group the DataFrame by one or multiple columns. Functions chained after group_by() are applied on the group until the DataFrame is ungrouped with the ungroup() function. For example, to group the data by the originating airport, we do:

( flight_data >> group_by ( X . origin ))

Summarise data using summarize()

summarize() is typically used together with group_by() to reduce each group into a single row summary. In other words, the output will have one row for each group. For example, to calculate the mean distance for flights originating from every airport, we do:

( flight_data >> group_by ( X . origin ) >> summarize ( mean_distance = X . distance . mean ()) )

origin mean_distance 0 EWR 1056.742790 1 JFK 1266.249077 2 LGA 779.835671

Bringing it all together with pipes

Let’s say you want to perform the following operations on the flights data

[Step 1]: Filter out all flights less than 10 hours

[Step 2]: Create a new column, speed , using the formula [distance / (air time * 60)]

, using the formula [distance / (air time * 60)] [Step 3]: Calculate the mean speed for flights originating from each airport

[Step 4]: Sort the result by mean speed in descending order

We will write the operations using dfply piping operator >> . We won’t have to use intermediate variables to save the result of each step.

( flight_data >> mask ( X . hour > 10 ) >> # step 1 mutate ( speed = X . distance / ( X . air_time * 60 )) >> # step 2 group_by ( X . origin ) >> # step 3a summarize ( mean_speed = X . speed . mean ()) >> # step 3b arrange ( X . mean_speed , ascending = False ) # step 4 )

origin mean_speed 0 EWR 0.109777 1 JFK 0.109427 2 LGA 0.107362

If we use pandas data manipulation functions instead of dfply ’s, our code will look something like this:

flight_data . loc [ flight_data [ 'hour' ] > 10 , 'speed' ] = flight_data [ 'distance' ] / ( flight_data [ 'air_time' ] * 60 ) result = flight_data . groupby ( 'origin' , as_index = False )[ 'speed' ] . mean () result . sort_values ( 'speed' , ascending = False )

I find the dfply version easier to read and understand than the pandas version.

Conclusion

This is by no means an exhaustive coverage of the functionality of the dfply package. The package documentation is really good and I advise that you check it out to learn more.

If you have suggestions or questions, please drop a comment in the comment section below. You can also send me an email at hello [at] allenkunle [dot] me or tweet at me @allenakinkunle, and I will reply as soon as I can.

The complete source code for this blog post is available on Github. Thank you for reading, and please don’t forget to share.