Introduction As many of you know, pandas released version 0.17.0 on October 9th. In typical pandas fashion there are a bunch of updates, bug fixes and new features which I encourage you to read all about here. I do not plan to go through all of the changes but there are a couple of key things that I think will be useful to me in my daily work that I will explore briefly in this article. In addition, I am including a couple of other tips and tricks for pandas that I use on a frequent basis and hope will be useful to you.

Excel MultiIndex Strangely, one of the most exciting things about 0.17 is that MultiIndex supports to_excel again. I say again because it used to work just fine for me (prior to 0.16.2). However in that release (and up to 0.17) the method would raise a NotImplementedError . With 0.17, all works as expected. Yay! In order to illustrate in a little more detail what this supports, here is a short example: from __future__ import print_function import pandas as pd sales_df = pd . read_excel ( 'https://github.com/chris1610/pbpython/blob/master/data/sample-salesv3.xlsx?raw=true' ) sales_df [ "prod_group" ] = sales_df [ "sku" ] . str [ 0 : 2 ] print ( sales_df . head ()) Here is our Data: account number name sku quantity unit price ext price date prod_group 0 740150 Barton LLC B1-20000 39 86.69 3380.91 2014-01-01 07:21:51 B1 1 714466 Trantow-Barrows S2-77896 -1 63.16 -63.16 2014-01-01 10:00:47 S2 2 218895 Kulas Inc B1-69924 23 90.7 2086.1 2014-01-01 13:24:58 B1 3 307599 Kassulke, Ondricka and Metz S1-65481 41 21.05 863.05 2014-01-01 15:05:22 S1 4 412290 Jerde-Hilpert S2-34077 6 83.21 499.26 2014-01-01 23:26:55 S2 We are using our simple sales data again to show purchase history over several months. The prod_group column is just a way to break down the products into different categories based on their SKU. Now, let’s create a pivot table of the data. The key here is that the pivot table creates a MultiIndex , which will cause problems when we try to save to excel. sales_summary = pd . pivot_table ( sales_df , index = [ "name" ], columns = [ "prod_group" ], values = [ "ext price" , "quantity" ]) ext price quantity prod_group B1 S1 S2 B1 S1 S2 name Barton LLC 1171.640278 1306.237600 1647.786190 20.777778 28.040000 28.190476 Cronin, Oberbrunner and Spencer 1387.616842 1342.598571 1302.787407 26.210526 25.428571 23.740741 Frami, Hills and Schmidt 1475.132143 1439.311875 1401.317857 25.071429 31.062500 25.142857 Fritsch, Russel and Anderson 1372.360286 1188.012857 1569.353200 23.714286 26.047619 29.400000 Halvorson, Crona and Champlin 1356.640000 1267.756667 983.963158 25.857143 22.333333 17.842105 All looks good until you try to save this using to_excel sales_summary . to_excel ( 'sampleout.xlsx' ) In prior versions of pandas you’ll see something like this: Traceback ( most recent call last ): File "pandas_version.py" , line 25 , in < module > sales_summary . to_excel ( 'sampleout.xlsx' ) File "/home/chris/miniconda3/envs/pbpython2/lib/python2.7/site-packages/pandas/core/frame.py" , line 1252 , in to_excel raise NotImplementedError ( "Writing as Excel with a MultiIndex is " NotImplementedError : Writing as Excel with a MultiIndex is not yet implemented . Thankfully, pandas 0.17 works again and you get a clean Excel file that you would expect. The other reason I wanted to specifically call this out is that some of the code in my older blog posts stopped working as a result of this change. Now, I can point people here to get the solution - upgrade to pandas 0.17!

Rounding Data One simple (but useful) new feature is the addition of a round method to a DataFrame. Prior to this release, if you wanted to round data in a column you had to use np.round or change the display using display.float_format . In this example, let’s just look at the first 5 rows of data and round two columns: sales_df = sales_df [ 0 : 5 ] sales_df . round ({ 'ext price' : 0 , 'unit price' : 0 }) account number name sku quantity unit price ext price date prod_group 0 740150 Barton LLC B1-20000 39 87 3381 2014-01-01 07:21:51 B1 1 714466 Trantow-Barrows S2-77896 -1 63 -63 2014-01-01 10:00:47 S2 2 218895 Kulas Inc B1-69924 23 91 2086 2014-01-01 13:24:58 B1 3 307599 Kassulke, Ondricka and Metz S1-65481 41 21 863 2014-01-01 15:05:22 S1 4 412290 Jerde-Hilpert S2-34077 6 83 499 2014-01-01 23:26:55 S2 I can control the number of decimal places to round as well as specify which columns to round by including in the dictionary of column names. This is pretty useful.

Troubleshooting Merges In my experience manipulating and combining data, pandas tends to just do what I would expect. One of the areas where I can find myself getting tripped up is when I do a merge of DataFrames. Sometimes I need to play around with whether or not I really want a left, right, outer or inner join. To help troubleshoot these sometimes pesky problems, there is a new indicator argument that can help you figure out why your merge may not be doing what you expect. For this example, let’s add in a description to our data. Here is the sales_groups DataFrame that looks like this: prod_group Desc 0 B1 Belt-Large 1 B2 Belt-Medium 2 S1 Shirt-Large 3 S2 Shirt-Medium 4 H1 Hat-Large If we want to merge the sales_groups data with our sales data, we could do something like this: pd . merge ( sales_df , sales_groups , on = 'prod_group' , how = 'left' , indicator = True ) Notice the _merge column shows that the data is only included if it is in both DataFrames. account number name sku quantity unit price ext price date prod_group Desc _merge 0 740150 Barton LLC B1-20000 39 86.69 3380.91 2014-01-01 07:21:51 B1 Belt-Large both 1 714466 Trantow-Barrows S2-77896 -1 63.16 -63.16 2014-01-01 10:00:47 S2 Shirt-Medium both 2 218895 Kulas Inc B1-69924 23 90.70 2086.10 2014-01-01 13:24:58 B1 Belt-Large both 3 307599 Kassulke, Ondricka and Metz S1-65481 41 21.05 863.05 2014-01-01 15:05:22 S1 Shirt-Large both 4 412290 Jerde-Hilpert S2-34077 6 83.21 499.26 2014-01-01 23:26:55 S2 Shirt-Medium both Let’s see what happens when we do a right join: pd . merge ( sales_df , sales_groups , on = 'prod_group' , how = 'right' , indicator = True ) account number name sku quantity unit price ext price date prod_group Desc _merge 0 740150 Barton LLC B1-20000 39 86.69 3380.91 2014-01-01 07:21:51 B1 Belt-Large both 1 218895 Kulas Inc B1-69924 23 90.70 2086.10 2014-01-01 13:24:58 B1 Belt-Large both 2 714466 Trantow-Barrows S2-77896 -1 63.16 -63.16 2014-01-01 10:00:47 S2 Shirt-Medium both 3 412290 Jerde-Hilpert S2-34077 6 83.21 499.26 2014-01-01 23:26:55 S2 Shirt-Medium both 4 307599 Kassulke, Ondricka and Metz S1-65481 41 21.05 863.05 2014-01-01 15:05:22 S1 Shirt-Large both 5 NaN NaN NaN NaN NaN NaN NaN B2 Belt-Medium right_only 6 NaN NaN NaN NaN NaN NaN NaN H1 Hat-Large right_only Given this small dataset, the indicator is not required but you could imagine that if you had thousands of rows, this could be really helpful to make sure you are getting the results you expect.

Sorting API One warning you will probably see after upgrading to pandas 0.17 is something like this: FutureWarning : sort ( columns =.... ) is deprecated , use sort_values ( by =..... ) sales_df . sort ( columns = [ "name" , "sku" ]) To fix it: sales_df . sort_values ( by = [ "name" , "sku" ]) The old syntax will continue to work but it’s best to proactively make these changes now so that upgrades will be simpler. The next couple items are not related to this latest release but include some quick notes on working with pandas DataFrames and Series.

Saving Series to Excel Pandas has a couple of functions I use quite a bit that return a Series of data. One that I use is value_counts . What is sometimes frustrating is that there is no to_excel function for a series. Here is an example of using value_counts (I included all my data but truncated the results for brevity): sales_df [ "sku" ] . value_counts () S2-77896 73 S1-82801 60 S2-10342 59 S1-47412 58 S1-93683 57 B1-38851 56 S2-82423 56 S1-50961 55 S1-30248 55 B1-53636 53 S1-06532 53 S1-27722 53 B1-20000 53 S2-34077 51 S2-83881 51 ........ These results are useful and wouldn’t it be nice to dump it to Excel? Unfortunately when you try, you get a nasty AttributeError: 'Series' object has no attribute to_excel : sales_df [ "sku" ] . value_counts () . to_excel ( "values.xlsx" ) Traceback ( most recent call last ): File "pandas_version.py" , line 50 , in < module > sales_df [ "sku" ] . value_counts () . to_excel ( "values.xlsx" ) File "/home/chris/miniconda3/lib/python3.4/site-packages/pandas/core/generic.py" , line 2246 , in __getattr__ ( type ( self ) . __name__ , name )) AttributeError : 'Series' object has no attribute 'to_excel' A handy shortcut to fix this is to use to_frame to force it to a DataFrame which can be saved: sales_df [ "sku" ] . value_counts () . to_frame ( "frequency" ) . to_excel ( "values.xlsx" ) You can pass one argument to to_frame - the label to use for your data. It is optional but I find it handy to include.