This is a repository for short and sweet examples and links for useful pandas recipes. We encourage users to add to this documentation.

Adding interesting links and/or inline examples to this section is a great First Pull Request.

Simplified, condensed, new-user friendly, in-line examples have been inserted where possible to augment the Stack-Overflow and GitHub links. Many of the links contain expanded information, above what the in-line examples offer.

Pandas (pd) and Numpy (np) are the only two abbreviated imported modules. The rest are kept explicitly imported for newer users.

These examples are written for python 3.4. Minor tweaks might be necessary for earlier python versions.

Idioms¶ These are some neat pandas idioms if-then/if-then-else on one column, and assignment to another one or more columns: In [1]: df = pd . DataFrame ( ...: { 'AAA' : [ 4 , 5 , 6 , 7 ], 'BBB' : [ 10 , 20 , 30 , 40 ], 'CCC' : [ 100 , 50 , - 30 , - 50 ]}); df ...: Out[1]: AAA BBB CCC 0 4 10 100 1 5 20 50 2 6 30 -30 3 7 40 -50 if-then...¶ An if-then on one column In [2]: df . ix [ df . AAA >= 5 , 'BBB' ] = - 1 ; df Out[2]: AAA BBB CCC 0 4 10 100 1 5 -1 50 2 6 -1 -30 3 7 -1 -50 An if-then with assignment to 2 columns: In [3]: df . ix [ df . AAA >= 5 ,[ 'BBB' , 'CCC' ]] = 555 ; df Out[3]: AAA BBB CCC 0 4 10 100 1 5 555 555 2 6 555 555 3 7 555 555 Add another line with different logic, to do the -else In [4]: df . ix [ df . AAA < 5 ,[ 'BBB' , 'CCC' ]] = 2000 ; df Out[4]: AAA BBB CCC 0 4 2000 2000 1 5 555 555 2 6 555 555 3 7 555 555 Or use pandas where after you’ve set up a mask In [5]: df_mask = pd . DataFrame ({ 'AAA' : [ True ] * 4 , 'BBB' : [ False ] * 4 , 'CCC' : [ True , False ] * 2 }) In [6]: df . where ( df_mask , - 1000 ) Out[6]: AAA BBB CCC 0 4 -1000 2000 1 5 -1000 -1000 2 6 -1000 555 3 7 -1000 -1000 if-then-else using numpy’s where() In [7]: df = pd . DataFrame ( ...: { 'AAA' : [ 4 , 5 , 6 , 7 ], 'BBB' : [ 10 , 20 , 30 , 40 ], 'CCC' : [ 100 , 50 , - 30 , - 50 ]}); df ...: Out[7]: AAA BBB CCC 0 4 10 100 1 5 20 50 2 6 30 -30 3 7 40 -50 In [8]: df [ 'logic' ] = np . where ( df [ 'AAA' ] > 5 , 'high' , 'low' ); df Out[8]: AAA BBB CCC logic 0 4 10 100 low 1 5 20 50 low 2 6 30 -30 high 3 7 40 -50 high Splitting¶ Split a frame with a boolean criterion In [9]: df = pd . DataFrame ( ...: { 'AAA' : [ 4 , 5 , 6 , 7 ], 'BBB' : [ 10 , 20 , 30 , 40 ], 'CCC' : [ 100 , 50 , - 30 , - 50 ]}); df ...: Out[9]: AAA BBB CCC 0 4 10 100 1 5 20 50 2 6 30 -30 3 7 40 -50 In [10]: dflow = df [ df . AAA <= 5 ] In [11]: dfhigh = df [ df . AAA > 5 ] In [12]: dflow ; dfhigh Out[12]: AAA BBB CCC 2 6 30 -30 3 7 40 -50 Building Criteria¶ Select with multi-column criteria In [13]: df = pd . DataFrame ( ....: { 'AAA' : [ 4 , 5 , 6 , 7 ], 'BBB' : [ 10 , 20 , 30 , 40 ], 'CCC' : [ 100 , 50 , - 30 , - 50 ]}); df ....: Out[13]: AAA BBB CCC 0 4 10 100 1 5 20 50 2 6 30 -30 3 7 40 -50 ...and (without assignment returns a Series) In [14]: newseries = df . loc [( df [ 'BBB' ] < 25 ) & ( df [ 'CCC' ] >= - 40 ), 'AAA' ]; newseries Out[14]: 0 4 1 5 Name: AAA, dtype: int64 ...or (without assignment returns a Series) In [15]: newseries = df . loc [( df [ 'BBB' ] > 25 ) | ( df [ 'CCC' ] >= - 40 ), 'AAA' ]; newseries ; ...or (with assignment modifies the DataFrame.) In [16]: df . loc [( df [ 'BBB' ] > 25 ) | ( df [ 'CCC' ] >= 75 ), 'AAA' ] = 0.1 ; df Out[16]: AAA BBB CCC 0 0.1 10 100 1 5.0 20 50 2 0.1 30 -30 3 0.1 40 -50 Select rows with data closest to certain value using argsort In [17]: df = pd . DataFrame ( ....: { 'AAA' : [ 4 , 5 , 6 , 7 ], 'BBB' : [ 10 , 20 , 30 , 40 ], 'CCC' : [ 100 , 50 , - 30 , - 50 ]}); df ....: Out[17]: AAA BBB CCC 0 4 10 100 1 5 20 50 2 6 30 -30 3 7 40 -50 In [18]: aValue = 43.0 In [19]: df . ix [( df . CCC - aValue ) . abs () . argsort ()] Out[19]: AAA BBB CCC 1 5 20 50 0 4 10 100 2 6 30 -30 3 7 40 -50 Dynamically reduce a list of criteria using a binary operators In [20]: df = pd . DataFrame ( ....: { 'AAA' : [ 4 , 5 , 6 , 7 ], 'BBB' : [ 10 , 20 , 30 , 40 ], 'CCC' : [ 100 , 50 , - 30 , - 50 ]}); df ....: Out[20]: AAA BBB CCC 0 4 10 100 1 5 20 50 2 6 30 -30 3 7 40 -50 In [21]: Crit1 = df . AAA <= 5.5 In [22]: Crit2 = df . BBB == 10.0 In [23]: Crit3 = df . CCC > - 40.0 One could hard code: In [24]: AllCrit = Crit1 & Crit2 & Crit3 ...Or it can be done with a list of dynamically built criteria In [25]: CritList = [ Crit1 , Crit2 , Crit3 ] In [26]: AllCrit = functools . reduce ( lambda x , y : x & y , CritList ) In [27]: df [ AllCrit ] Out[27]: AAA BBB CCC 0 4 10 100

Missing Data¶ The missing data docs. Fill forward a reversed timeseries In [85]: df = pd . DataFrame ( np . random . randn ( 6 , 1 ), index = pd . date_range ( '2013-08-01' , periods = 6 , freq = 'B' ), columns = list ( 'A' )) In [86]: df . ix [ 3 , 'A' ] = np . nan In [87]: df Out[87]: A 2013-08-01 -1.054874 2013-08-02 -0.179642 2013-08-05 0.639589 2013-08-06 NaN 2013-08-07 1.906684 2013-08-08 0.104050 In [88]: df . reindex ( df . index [:: - 1 ]) . ffill () Out[88]: A 2013-08-08 0.104050 2013-08-07 1.906684 2013-08-06 1.906684 2013-08-05 0.639589 2013-08-02 -0.179642 2013-08-01 -1.054874 cumsum reset at NaN values Replace¶ Using replace with backrefs

Timedeltas¶ The Timedeltas docs. Using timedeltas In [175]: s = pd . Series ( pd . date_range ( '2012-1-1' , periods = 3 , freq = 'D' )) In [176]: s - s . max () Out[176]: 0 -2 days 1 -1 days 2 0 days dtype: timedelta64[ns] In [177]: s . max () - s Out[177]: 0 2 days 1 1 days 2 0 days dtype: timedelta64[ns] In [178]: s - datetime . datetime ( 2011 , 1 , 1 , 3 , 5 ) Out[178]: 0 364 days 20:55:00 1 365 days 20:55:00 2 366 days 20:55:00 dtype: timedelta64[ns] In [179]: s + datetime . timedelta ( minutes = 5 ) Out[179]: 0 2012-01-01 00:05:00 1 2012-01-02 00:05:00 2 2012-01-03 00:05:00 dtype: datetime64[ns] In [180]: datetime . datetime ( 2011 , 1 , 1 , 3 , 5 ) - s Out[180]: 0 -365 days +03:05:00 1 -366 days +03:05:00 2 -367 days +03:05:00 dtype: timedelta64[ns] In [181]: datetime . timedelta ( minutes = 5 ) + s Out[181]: 0 2012-01-01 00:05:00 1 2012-01-02 00:05:00 2 2012-01-03 00:05:00 dtype: datetime64[ns] Adding and subtracting deltas and dates In [182]: deltas = pd . Series ([ datetime . timedelta ( days = i ) for i in range ( 3 ) ]) In [183]: df = pd . DataFrame ( dict ( A = s , B = deltas )); df Out[183]: A B 0 2012-01-01 0 days 1 2012-01-02 1 days 2 2012-01-03 2 days In [184]: df [ 'New Dates' ] = df [ 'A' ] + df [ 'B' ]; In [185]: df [ 'Delta' ] = df [ 'A' ] - df [ 'New Dates' ]; df Out[185]: A B New Dates Delta 0 2012-01-01 0 days 2012-01-01 0 days 1 2012-01-02 1 days 2012-01-03 -1 days 2 2012-01-03 2 days 2012-01-05 -2 days In [186]: df . dtypes Out[186]: A datetime64[ns] B timedelta64[ns] New Dates datetime64[ns] Delta timedelta64[ns] dtype: object Another example Values can be set to NaT using np.nan, similar to datetime In [187]: y = s - s . shift (); y Out[187]: 0 NaT 1 1 days 2 1 days dtype: timedelta64[ns] In [188]: y [ 1 ] = np . nan ; y Out[188]: 0 NaT 1 NaT 2 1 days dtype: timedelta64[ns]

Aliasing Axis Names¶ To globally provide aliases for axis names, one can define these 2 functions: In [189]: def set_axis_alias ( cls , axis , alias ): .....: if axis not in cls . _AXIS_NUMBERS : .....: raise Exception ( "invalid axis [ %s ] for alias [ %s ]" % ( axis , alias )) .....: cls . _AXIS_ALIASES [ alias ] = axis .....: In [190]: def clear_axis_alias ( cls , axis , alias ): .....: if axis not in cls . _AXIS_NUMBERS : .....: raise Exception ( "invalid axis [ %s ] for alias [ %s ]" % ( axis , alias )) .....: cls . _AXIS_ALIASES . pop ( alias , None ) .....: In [191]: set_axis_alias ( pd . DataFrame , 'columns' , 'myaxis2' ) In [192]: df2 = pd . DataFrame ( np . random . randn ( 3 , 2 ), columns = [ 'c1' , 'c2' ], index = [ 'i1' , 'i2' , 'i3' ]) In [193]: df2 . sum ( axis = 'myaxis2' ) Out[193]: i1 0.239786 i2 0.259018 i3 0.163470 dtype: float64 In [194]: clear_axis_alias ( pd . DataFrame , 'columns' , 'myaxis2' )