Introduction Every once in a while it is useful to take a step back and look at pandas’ functions and see if there is a new or better way to do things. I was recently working on a problem and noticed that pandas had a Grouper function that I had never used before. I looked into how it can be used and it turns out it is useful for the type of summary analysis I tend to do on a frequent basis. In addition to functions that have been around a while, pandas continues to provide new and improved capabilities with every release. The updated agg function is another very useful and intuitive tool for summarizing data. This article will walk through how and why you may want to use the Grouper and agg functions on your own data. Along the way, I will include a few tips and tricks on how to use them most effectively.

Grouping Time Series Data Pandas’ origins are in the financial industry so it should not be a surprise that it has robust capabilities to manipulate and summarize time series data. Just look at the extensive time series documentation to get a feel for all the options. I encourage you to review it so that you’re aware of the concepts. In order to illustrate this particular concept better, I will walk through an example of sales data and some simple operations to get total sales by month, day, year, etc. For this example, I’ll use my trusty transaction data that I’ve used in other articles. You can follow along in the notebook as well. import pandas as pd df = pd . read_excel ( "https://github.com/chris1610/pbpython/blob/master/data/sample-salesv3.xlsx?raw=True" ) df [ "date" ] = pd . to_datetime ( df [ 'date' ]) df . head () account number name sku quantity unit price ext price date 0 740150 Barton LLC B1-20000 39 86.69 3380.91 2014-01-01 07:21:51 1 714466 Trantow-Barrows S2-77896 -1 63.16 -63.16 2014-01-01 10:00:47 2 218895 Kulas Inc B1-69924 23 90.70 2086.10 2014-01-01 13:24:58 3 307599 Kassulke, Ondricka and Metz S1-65481 41 21.05 863.05 2014-01-01 15:05:22 4 412290 Jerde-Hilpert S2-34077 6 83.21 499.26 2014-01-01 23:26:55 Before I go much further, it’s useful to become familiar with Offset Aliases. These strings are used to represent various common time frequencies like days vs. weeks vs. years. I always forget what these are called and how to use the more esoteric ones so make sure to bookmark the link! For example, if you were interested in summarizing all of the sales by month, you could use the resample function. The tricky part about using resample is that it only operates on an index. In this data set, the data is not indexed by the date column so resample would not work without restructuring the data. In order to make it work, use set_index to make the date column an index and then resample: df . set_index ( 'date' ) . resample ( 'M' )[ "ext price" ] . sum () date 2014-01-31 185361.66 2014-02-28 146211.62 2014-03-31 203921.38 2014-04-30 174574.11 2014-05-31 165418.55 2014-06-30 174089.33 2014-07-31 191662.11 2014-08-31 153778.59 2014-09-30 168443.17 2014-10-31 171495.32 2014-11-30 119961.22 2014-12-31 163867.26 Freq: M, Name: ext price, dtype: float64 This is a fairly straightforward way to summarize the data but it gets a little more challenging if you would like to group the data as well. If we would like to see the monthly results for each customer, then you could do this (results truncated to 20 rows): df . set_index ( 'date' ) . groupby ( 'name' )[ "ext price" ] . resample ( "M" ) . sum () name date Barton LLC 2014-01-31 6177.57 2014-02-28 12218.03 2014-03-31 3513.53 2014-04-30 11474.20 2014-05-31 10220.17 2014-06-30 10463.73 2014-07-31 6750.48 2014-08-31 17541.46 2014-09-30 14053.61 2014-10-31 9351.68 2014-11-30 4901.14 2014-12-31 2772.90 Cronin, Oberbrunner and Spencer 2014-01-31 1141.75 2014-02-28 13976.26 2014-03-31 11691.62 2014-04-30 3685.44 2014-05-31 6760.11 2014-06-30 5379.67 2014-07-31 6020.30 2014-08-31 5399.58 Name: ext price, dtype: float64 This certainly works but it feels a bit clunky. Fortunately Grouper makes this a little more streamlined. Instead of having to play around with reindexing, we can use our normal groupby syntax but provide a little more info on how to group the data in the date column: df . groupby ([ 'name' , pd . Grouper ( key = 'date' , freq = 'M' )])[ 'ext price' ] . sum () name date Barton LLC 2014-01-31 6177.57 2014-02-28 12218.03 2014-03-31 3513.53 2014-04-30 11474.20 2014-05-31 10220.17 2014-06-30 10463.73 2014-07-31 6750.48 2014-08-31 17541.46 2014-09-30 14053.61 2014-10-31 9351.68 2014-11-30 4901.14 2014-12-31 2772.90 Cronin, Oberbrunner and Spencer 2014-01-31 1141.75 2014-02-28 13976.26 2014-03-31 11691.62 2014-04-30 3685.44 2014-05-31 6760.11 2014-06-30 5379.67 2014-07-31 6020.30 2014-08-31 5399.58 Name: ext price, dtype: float64 Since groupby is one of my standard functions, this approach seems simpler to me and it is more likely to stick in my brain. The nice benefit of this capability is that if you are interested in looking at data summarized in a different time frame, just change the freq parameter to one of the valid offset aliases. For instance, an annual summary using December as the last month would look like this: df . groupby ([ 'name' , pd . Grouper ( key = 'date' , freq = 'A-DEC' )])[ 'ext price' ] . sum () name date Barton LLC 2014-12-31 109438.50 Cronin, Oberbrunner and Spencer 2014-12-31 89734.55 Frami, Hills and Schmidt 2014-12-31 103569.59 Fritsch, Russel and Anderson 2014-12-31 112214.71 Halvorson, Crona and Champlin 2014-12-31 70004.36 Herman LLC 2014-12-31 82865.00 Jerde-Hilpert 2014-12-31 112591.43 Kassulke, Ondricka and Metz 2014-12-31 86451.07 Keeling LLC 2014-12-31 100934.30 Kiehn-Spinka 2014-12-31 99608.77 Koepp Ltd 2014-12-31 103660.54 Kuhn-Gusikowski 2014-12-31 91094.28 Kulas Inc 2014-12-31 137351.96 Pollich LLC 2014-12-31 87347.18 Purdy-Kunde 2014-12-31 77898.21 Sanford and Sons 2014-12-31 98822.98 Stokes LLC 2014-12-31 91535.92 Trantow-Barrows 2014-12-31 123381.38 White-Trantow 2014-12-31 135841.99 Will LLC 2014-12-31 104437.60 Name: ext price, dtype: float64 If your annual sales were on a non-calendar basis, then the data can be easily changed by modifying the freq parameter. I encourage you to play around with different offsets to get a feel for how it works. When dealing with summarizing time series data, this is incredibly handy. To put this in perspective, try doing this in Excel. It is certainly possible (using pivot tables and custom grouping) but I do not think it is nearly as intuitive as the pandas approach.

New and improved aggregate function In pandas 0.20.1, there was a new agg function added that makes it a lot simpler to summarize data in a manner similar to the groupby API. To illustrate the functionality, let’s say we need to get the total of the ext price and quantity column as well as the average of the unit price . The process is not very convenient: df [[ "ext price" , "quantity" ]] . sum () ext price 2018784.32 quantity 36463.00 dtype: float64 df [ "unit price" ] . mean () 55.007526666666664 This works but it’s a bit messy. The new agg makes this simpler: df [[ "ext price" , "quantity" , "unit price" ]] . agg ([ 'sum' , 'mean' ]) ext price quantity unit price sum 2.018784e+06 36463.000000 82511.290000 mean 1.345856e+03 24.308667 55.007527 The results are good but including the sum of the unit price is not really that useful. Fortunately we can pass a dictionary to agg and specify what operations to apply to each column. df . agg ({ 'ext price' : [ 'sum' , 'mean' ], 'quantity' : [ 'sum' , 'mean' ], 'unit price' : [ 'mean' ]}) quantity ext price unit price mean 24.308667 1.345856e+03 55.007527 sum 36463.000000 2.018784e+06 NaN I find this approach really handy when I want to summarize several columns of data. In the past, I would run the individual calculations and build up the resulting dataframe a row at a time. It was tedious. This is a much better approach. As an added bonus, you can define your own functions. For instance, I frequently find myself needing to aggregate data and use a mode function that works on text. I found a lambda function that uses value_counts to do what I need and frequently use this get_max function: get_max = lambda x : x . value_counts ( dropna = False ) . index [ 0 ] Then, if I want to include the most frequent sku in my summary table: df . agg ({ 'ext price' : [ 'sum' , 'mean' ], 'quantity' : [ 'sum' , 'mean' ], 'unit price' : [ 'mean' ], 'sku' : [ get_max ]}) quantity sku ext price unit price <lambda> NaN S2-77896 NaN NaN mean 24.308667 NaN 1.345856e+03 55.007527 sum 36463.000000 NaN 2.018784e+06 NaN This is pretty cool but there is one thing that has always bugged me about this approach. The fact that the column says “<lambda>” bothers me. Ideally I want it to say “most frequent.” In the past I’d jump through some hoops to rename it. But, when working on this article I stumbled on another approach - explicitly defining the name of the lambda function. get_max . __name__ = "most frequent" Now, when I do the aggregation: df . agg ({ 'ext price' : [ 'sum' , 'mean' ], 'quantity' : [ 'sum' , 'mean' ], 'unit price' : [ 'mean' ], 'sku' : [ get_max ]}) quantity sku ext price unit price most frequent NaN S2-77896 NaN NaN mean 24.308667 NaN 1.345856e+03 55.007527 sum 36463.000000 NaN 2.018784e+06 NaN I get a much nicer label! It’s a small thing but I am definitely glad I finally figured that out. As a final final bonus, here’s one other trick. The aggregate function using a dictionary is useful but one challenge is that it does not preserve order. If you want to make sure your columns are in a specific order, you can use an OrderedDict : import collections f = collections . OrderedDict ([( 'ext price' , [ 'sum' , 'mean' ]), ( 'quantity' , [ 'sum' , 'mean' ]), ( 'sku' , [ get_max ])]) df . agg ( f ) ext price quantity sku mean 1.345856e+03 24.308667 NaN most frequent NaN NaN S2-77896 sum 2.018784e+06 36463.000000 NaN