I recently discovered the high-performance Pandas library written in Python while performing data munging in a machine learning project. Using simple examples, I want to highlight my favorite (and sometimes hard to find) features.

Apart from serving as a quick reference, I hope this post will help new users to quickly start extracting value from Pandas. For a good overview of Pandas and its advanced features, I highly recommended Wes McKinney’s Python for Data Analysis book and the documentation on the website.

Here is my top 10 list:

Example DataFrame

I will use a simple data frame for explanation.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 In [ 1 ]: import pandas as pd In [ 2 ]: import numpy as np In [ 3 ]: from pandas import DataFrame , Series In [ 4 ]: df = DataFrame ({ 'int_col' : [ 1 , 2 , 6 , 8 , - 1 ], 'float_col' : [ 0.1 , 0.2 , 0.2 , 10.1 , None ], 'str_col' : [ 'a' , 'b' , None , 'c' , 'a' ]}) In [ 5 ]: df Out [ 5 ]: float_col int_col str_col 0 0.1 1 a 1 0.2 2 b 2 0.2 6 None 3 10.1 8 c 4 NaN - 1 a

Indexing

Selecting a subset of columns

It is one of the simplest features but was surprisingly difficult to find. The ix method works elegantly for this purpose. Suppose you wanted to index only using columns int_col and string_col, you would use the advanced indexing ix method as shown below.

1 2 3 4 5 6 7 8 In [ 6 ]: df . ix [:,[ 'float_col' , 'int_col' ]] Out [ 6 ]: float_col int_col 0 0.1 1 1 0.2 2 2 0.2 6 3 10.1 8 4 NaN - 1

EDIT Suggestion by Dan in the comments below. Another (probably more elegant) syntax for indexing multiple columns is given below.

1 2 3 4 5 6 7 8 In [ 9 ]: df [[ 'float_col' , 'int_col' ]] Out [ 9 ]: float_col int_col 0 0.1 1 1 0.2 2 2 0.2 6 3 10.1 8 4 NaN - 1

Conditional indexing

One can index using boolean indexing

1 2 3 4 5 6 In [ 7 ]: df [ df [ 'float_col' ] > 0.15 ] Out [ 7 ]: float_col int_col str_col 1 0.2 2 b 2 0.2 6 None 3 10.1 8 c

1 2 3 4 In [ 8 ]: df [ df [ 'float_col' ] == 0.1 ] Out [ 8 ]: float_col int_col str_col 0 0.1 1 a

EDIT Suggestion by Roby Levy in the comments below. One can select multiple boolean operators (| for or, & for and, and ~ for not) and group them by parenthisis.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 In [ 10 ]: df [( df [ 'float_col' ] > 0.1 ) & ( df [ 'int_col' ] > 2 )] Out [ 10 ]: float_col int_col str_col 2 0.2 6 None 3 10.1 8 c In [ 11 ]: df [( df [ 'float_col' ] > 0.1 ) | ( df [ 'int_col' ] > 2 )] Out [ 11 ]: float_col int_col str_col 1 0.2 2 b 2 0.2 6 None 3 10.1 8 c In [ 12 ]: df [ ~ ( df [ 'float_col' ] > 0.1 )] Out [ 12 ]: float_col int_col str_col 0 0.1 1 a 4 NaN - 1 a

Renaming columns

Use the rename method to rename columns. It copies the data to another DataFrame.

1 2 3 4 5 6 7 8 9 10 In [ 9 ]: df2 = df . rename ( columns = { 'int_col' : 'some_other_name' }) In [ 10 ]: df2 Out [ 10 ]: float_col some_other_name str_col 0 0.1 1 a 1 0.2 2 b 2 0.2 6 None 3 10.1 8 c 4 NaN - 1 a

Set the inplace = True flag incase you want to modify the existing DataFrame.

1 2 3 4 5 6 7 8 In [ 11 ]: df2 . rename ( columns = { 'some_other_name' : 'int_col' }, inplace = True ) Out [ 11 ]: float_col int_col str_col 0 0.1 1 a 1 0.2 2 b 2 0.2 6 None 3 10.1 8 c 4 NaN - 1 a

Handling missing values

Handling of missing values can be performed beautifully using pandas.

Drop missing values

The dropna can used to drop rows or columns with missing data (NaN). By default, it drops all rows with any missing entry.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 n [ 12 ]: df2 Out [ 12 ]: float_col int_col str_col 0 0.1 1 a 1 0.2 2 b 2 0.2 6 None 3 10.1 8 c 4 NaN - 1 a In [ 13 ]: df2 . dropna () Out [ 13 ]: float_col int_col str_col 0 0.1 1 a 1 0.2 2 b 3 10.1 8 c

Fill missing values

The fillna method on the other hand can be used to fill missing data (NaN). The example below shows a simple replacement using the mean of the available values.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 In [ 14 ]: df3 = df . copy () In [ 15 ]: mean = df3 [ 'float_col' ] . mean () In [ 16 ]: df3 Out [ 16 ]: float_col int_col str_col 0 0.1 1 a 1 0.2 2 b 2 0.2 6 None 3 10.1 8 c 4 NaN - 1 a In [ 17 ]: df3 [ 'float_col' ] . fillna ( mean ) Out [ 17 ]: 0 0.10 1 0.20 2 0.20 3 10.10 4 2.65 Name : float_col

Map, Apply

Forget writing for loops while using pandas. One can do beautiful vectorized computation by applying function over rows and columns using the map, apply and applymap methods.

map

The map operation operates over each element of a Series.

1 2 3 4 5 6 7 In [ 18 ]: df [ 'str_col' ] . dropna () . map ( lambda x : 'map_' + x ) Out [ 18 ]: 0 map_a 1 map_b 3 map_c 4 map_a Name : str_col

apply

The apply is a pretty flexible function which, as the name suggests, applies a function along any axis of the DataFrame. The examples show the application of the sum function over columns. (Thanks to Mindey in the comments below to use np.sum instead of np.sqrt in the example)

1 2 3 4 5 In [ 19 ]: df . ix [:,[ 'int_col' , 'float_col' ]] . apply ( np . sqrt ) Out [ 19 ]: int_col 16.0 float_col 10.6 dtype : float64

applymap

The applymap operation can be used to apply the function to each element of the DataFrame.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 In [ 39 ]: def some_fn ( x ): .... : if type ( x ) is str : .... : return 'applymap_' + x .... : elif x : .... : return 100 * x .... : else : .... : return .... : In [ 40 ]: df . applymap ( some_fn ) Out [ 40 ]: float_col int_col str_col 0 10 100 applymap_a 1 20 200 applymap_b 2 20 600 None 3 1010 800 applymap_c 4 NaN - 100 applymap_a

Vectorized mathematical and string operations

HT: @janschulz

One can perform vectorized calculations using simple operators and numpy functions.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 In [ 4 ]: df = pd . DataFrame ( data = { "A" :[ 1 , 2 ], "B" :[ 1.2 , 1.3 ]}) In [ 5 ]: df [ "C" ] = df [ "A" ] + df [ "B" ] In [ 6 ]: df Out [ 6 ]: A B C 0 1 1.2 2.2 1 2 1.3 3.3 In [ 7 ]: df [ "D" ] = df [ "A" ] * 3 In [ 8 ]: df Out [ 8 ]: A B C D 0 1 1.2 2.2 3 1 2 1.3 3.3 6 In [ 9 ]: df [ "E" ] = np . sqrt ( df [ "A" ]) In [ 10 ]: df Out [ 10 ]: A B C D E 0 1 1.2 2.2 3 1.000000 1 2 1.3 3.3 6 1.414214

Also, vectorized string operations are easy to use.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 In [ 11 ]: df = pd . DataFrame ( data = { "A" :[ 1 , 2 ], "B" :[ 1.2 , 1.3 ], "Z" :[ "a" , "b" ]}) In [ 12 ]: df Out [ 12 ]: A B Z 0 1 1.2 a 1 2 1.3 b In [ 13 ]: df [ "F" ] = df . Z . str . upper () In [ 14 ]: df Out [ 14 ]: A B Z F 0 1 1.2 a A 1 2 1.3 b B

GroupBy

The groupby method let’s you perform SQL-like grouping operations. The example below shows a grouping operation performed with str_col columns entries as keys. It is used to calculate the mean of the float_col for each key. For more details, please refer to the split-apply-combine description on the pandas website.

1 2 3 4 5 6 7 8 In [ 41 ]: grouped = df [ 'float_col' ] . groupby ( df [ 'str_col' ]) In [ 42 ]: grouped . mean () Out [ 42 ]: str_col a 0.1 b 0.2 c 10.1

New Columns = f(Existing Columns)

Generating new columns from existing columns in a data frame is an integral part of my workflow. This was one of the hardest parts for me to figure out. I hope these examples will save time and effort for other people.

I will try to illustrate it in a piecemeal manner – multiple columns as a function of a single column, single column as a function of multiple columns, and finally multiple columns as a function of multiple columns.

multiple columns as a function of a single column

I often have to generate multiple columns of a DataFrame as a function of a single columns. Related Stack Overflow question

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 In [ 43 ]: df4 = df . copy () In [ 44 ]: def two_three_strings ( x ): .... : return x * 2 , x * 3 .... : In [ 45 ]: df4 [ 'twice' ], df4 [ 'thrice' ] = zip ( * df4 [ 'int_col' ] . map ( two_three_strings )) In [ 46 ]: df4 Out [ 46 ]: float_col int_col str_col twice thrice 0 0.1 1 a 2 3 1 0.2 2 b 4 6 2 0.2 6 None 12 18 3 10.1 8 c 16 24 4 NaN - 1 a - 2 - 3

single column as a function of multiple columns

It’s sometimes useful to generate multiple DataFrame columns from a single column. It comes in handy especially when methods return tuples. Related Stack Overflow question

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 In [ 47 ]: df5 = df . copy () In [ 48 ]: def sum_two_cols ( series ): .... : return series [ 'int_col' ] + series [ 'float_col' ] .... : In [ 49 ]: df5 [ 'sum_col' ] = df5 . apply ( sum_two_cols , axis = 1 ) In [ 50 ]: df5 Out [ 50 ]: float_col int_col str_col sum_col 0 0.1 1 a 1.1 1 0.2 2 b 2.2 2 0.2 6 None 6.2 3 10.1 8 c 18.1 4 NaN - 1 a NaN

multiple columns as a function of multiple columns

Finally, a way to generate a new DataFrame with multiple columns based on multiple columns in an existing DataFrame. Related Stack Overflow question

1 2 3 4 5 6 7 8 9 10 11 12 13 14 In [ 51 ]: import math In [ 52 ]: def int_float_squares ( series ): .... : return pd . Series ({ 'int_sq' : series [ 'int_col' ] ** 2 , 'flt_sq' : series [ 'float_col' ] ** 2 }) .... : In [ 53 ]: df . apply ( int_float_squares , axis = 1 ) Out [ 53 ]: flt_sq int_sq 0 0.01 1 1 0.04 4 2 0.04 36 3 102.01 64 4 NaN 1

Stats

Pandas provides nifty methods to understand your data. I am highlighting the describe, correlation, covariance, and correlation methods that I use to quickly make sense of my data.

describe

The describe method provides quick stats on all suitable columns.

1 2 3 4 5 6 7 8 9 10 11 In [ 54 ]: df . describe () Out [ 54 ]: float_col int_col count 4.00000 5.000000 mean 2.65000 3.200000 std 4.96689 3.701351 min 0.10000 - 1.000000 25 % 0.17500 1.000000 50 % 0.20000 2.000000 75 % 2.67500 6.000000 max 10.10000 8.000000

covariance

The cov method provides the covariance between suitable columns.

1 2 3 4 5 In [ 55 ]: df . cov () Out [ 55 ]: float_col int_col float_col 24.670000 12.483333 int_col 12.483333 13.700000

correlation

The corr method provides the correlation between suitable columns.

1 2 3 4 5 In [ 56 ]: df . corr () Out [ 56 ]: float_col int_col float_col 1.000000 0.760678 int_col 0.760678 1.000000

Merge and Join

Pandas supports database-like joins which makes it easy to link data frames.

I will use the simple example to highlight the joins using the merge command.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 n [ 57 ]: df Out [ 57 ]: float_col int_col str_col 0 0.1 1 a 1 0.2 2 b 2 0.2 6 None 3 10.1 8 c 4 NaN - 1 a In [ 58 ]: other = DataFrame ({ 'str_col' : [ 'a' , 'b' ], 'some_val' : [ 1 , 2 ]}) In [ 59 ]: other Out [ 59 ]: some_val str_col 0 1 a 1 2 b

The inner, outer, left and right joins are show below. The data frames are joined using the str_col keys.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 In [ 60 ]: pd . merge ( df , other , on = 'str_col' , how = 'inner' ) Out [ 60 ]: float_col int_col str_col some_val 0 0.1 1 a 1 1 NaN - 1 a 1 2 0.2 2 b 2 In [ 61 ]: pd . merge ( df , other , on = 'str_col' , how = 'outer' ) Out [ 61 ]: float_col int_col str_col some_val 0 0.1 1 a 1 1 NaN - 1 a 1 2 0.2 2 b 2 3 0.2 6 None NaN 4 10.1 8 c NaN In [ 62 ]: pd . merge ( df , other , on = 'str_col' , how = 'left' ) Out [ 62 ]: float_col int_col str_col some_val 0 0.1 1 a 1 1 NaN - 1 a 1 2 0.2 2 b 2 3 0.2 6 None NaN 4 10.1 8 c NaN In [ 63 ]: pd . merge ( df , other , on = 'str_col' , how = 'right' ) Out [ 63 ]: float_col int_col str_col some_val 0 0.1 1 a 1 1 NaN - 1 a 1 2 0.2 2 b 2

Plot

I was thoroughly surprised by the plotting capabilities of the pandas library. There are several plotting methods available. I am highlighting a couple of simple plots that I use the most.

Let’s start with a simple data frame to plot.

1 2 3 In [ 3 ]: plot_df = DataFrame ( np . random . randn ( 1000 , 2 ), columns = [ 'x' , 'y' ]) In [ 4 ]: plot_df [ 'y' ] = plot_df [ 'y' ] . map ( lambda x : x + 1 )

Plot

A simple plot command goes a long way.

1 2 In [ 5 ]: plot_df . plot () Out [ 5 ]: < matplotlib . axes . AxesSubplot at 0x10e6cad10 >

Histograms

I really enjoy histograms to get a quick idea about the distribution of the data.

1 2 In [ 6 ]: plot_df . hist () Out [ 6 ]: array ([[ Axes ( 0.125 , 0.1 ; 0.336957 x0 . 8 ), Axes ( 0.563043 , 0.1 ; 0.336957 x0 . 8 )]], dtype = object )

Scikit-learn conversion

This took me a non-trivial amount of time to figure out and I hope others can avoid this mistake. According to the pandas documentation, the ndarray object obtained via the values method has object dtype if values contain more than float and integer dtypes. Now even if you slice the str columns away, the resulting array will still consist of object dtype and might not play well with other libraries such as scikit-learn which are expecting a float dtype. Explicitly converting type works well in this scenario.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 n [ 18 ]: df Out [ 18 ]: float_col int_col str_col 0 0.1 1 a 1 0.2 2 b 2 0.2 6 NaN 3 10.1 8 c 4 NaN - 1 a In [ 25 ]: df . values [:,: - 1 ] Out [ 25 ]: array ([[ 0.1 , 1 ], [ 0.2 , 2 ], [ 0.2 , 6 ], [ 10.1 , 8 ], [ nan , - 1 ]], dtype = object ) In [ 26 ]: df . values [:,: - 1 ] . astype ( float32 ) Out [ 26 ]: array ([[ 0.1 , 1. ], [ 0.2 , 2. ], [ 0.2 , 6. ], [ 10.10000038 , 8. ], [ nan , - 1. ]], dtype = float32 )

EDIT HT: Wes Turner via comments. The sklearn-pandas library looks great for bridging pandas scikit-learn.

Summary

I hope these examples will help new users quickly extract a lot of value out of pandas and serve as a useful quick reference for the pandas pros.

Happy munging!