There are times when I want to use split-apply-combine to save the results of a groupby to a json file while preserving the resulting column values as a list. Before we start, let’s import Pandas and generate a dataframe with some example email data





Import Pandas and Create an Email DataFrame







import pandas as pd import numpy as np

df = pd . DataFrame ({ 'Sender' : [ 'Alice' , 'Alice' , 'Bob' , 'Carl' , 'Bob' , 'Alice' ], 'Receiver' : [ 'David' , 'Eric' , 'Frank' , 'Ginger' , 'Holly' , 'Ingrid' ], 'Emails' : [ 9 , 3 , 5 , 1 , 6 , 7 ] }) df







Emails Receiver Sender 0 9 David Alice 1 3 Eric Alice 2 5 Frank Bob 3 1 Ginger Carl 4 6 Holly Bob 5 7 Ingrid Alice







Purpose





Here, the goal is to generate a list of e-mail receivers for each sender. To accomplish this, we can first group the data by sender:





grouped = df . groupby ( 'Sender' )





And if we examine the groups for each sender then we’ll see the row indices that are associated with each sender:





grouped . groups

{ 'Alice' : [ 0 , 1 , 5 ], 'Bob' : [ 2 , 4 ], 'Carl' : [ 3 ]}





We usually follow a groupby call by aggregating data along the other columns. For example:





grouped . aggregate ( np . sum )







Emails Sender Alice 19 Bob 11 Carl 1





However, notice that the e-mail receiver column doesn’t get aggregated since it isn’t obvious how to sum up text together or what that would even mean. In this case, we could provide a custom aggregator that simply concatenates all of the receivers into a single string.





grouped [ 'Receiver' ]. agg ( lambda x : x . sum ())





But how can we end up with a list of receivers?





The Secret Sauce





Well, we could take the custom aggregator a step further and have it return a list instead of a concatenated string.





grouped [ 'Receiver' ]. agg ({ 'Receivers' :( lambda x : list ( x ))})







Receivers Sender Alice [David, Eric, Ingrid] Bob [Frank, Holly] Carl [Ginger]





But note that we’ve excluded the number of e-mails for clarity. Putting it all together, we can incorporate all of the data and do something like this:





grouped . agg ({ 'Receiver' : { 'Receiver_List' : ( lambda x : list ( x ))}, 'Emails' : { 'Total_Emails' : np . sum } })







Emails Receiver Total_Emails Receiver_List Sender Alice 19 [David, Eric, Ingrid] Bob 11 [Frank, Holly] Carl 1 [Ginger]





Above, the agg function takes on the format {column: {name: agg_func}} where column is the dataframe column, name is the column name of the resulting aggregation result, and agg_func is the name of the aggregation function to use. Note that we can simplify things by omitting the name completely and breaking free of the inner dict with:





out = grouped . agg ({ 'Receiver' : ( lambda x : list ( x )), 'Emails' : np . sum }) out







Emails Receiver Sender Alice 19 [David, Eric, Ingrid] Bob 11 [Frank, Holly] Carl 1 [Ginger]





Finally, the dataframe can be written to a json file





out . to_json ( 'easy_as_pie.json' )







{ "Emails" : { "Alice" : 19 , "Bob" : 11 , "Carl" : 1 }, "Receiver" : { "Alice" :[ "David" , "Eric" , "Ingrid" ], "Bob" :[ "Frank" , "Holly" ], "Carl" :[ "Ginger" ]} }







Easy-As-Pie!











Here are some ways to write this output (including when you have integers or floating point values instead of strings) to a jagged file with a space delimiter