Introduction As part of my continued exploration of pandas, I am going to walk through a real world example of how to use pandas to automate a process that could be very difficult to do in Excel. My business problem is that I have two Excel files that are structured similarly but have different data and I would like to easily understand what has changed between the two files. Basically, I want an Excel diff tool. Here is a snapshot of the type of data I’m looking at: account number name street city state postal code 935480 Bruen Group 5131 Nienow Viaduct Apt. 290 Port Arlie Alabama 14118 371770 Cruickshank-Boyer 839 Lana Expressway Suite 234 South Viviana Alabama 57838 548367 Spencer, Grady and Herman 65387 Lang Circle Apt. 516 Greenholtbury Alaska 58394 296620 Schamberger, Hagenes and Brown 26340 Ferry Neck Apt. 612 McCulloughstad Alaska 74052 132971 Williamson, Schumm and Hettinger 89403 Casimer Spring Jeremieburgh Arkansas 62785 In this example, I have two customer address lists and I would like to understand: which customers are new

which customers are removed

which customers have changed information between the two files You can envision this being fairly useful when auditing changes in a system or potentially providing a list of changes so you can have your sales team contact new customers.

Research My first thought was that I wanted to evaluate existing tools that could easily perform a diff on two Excel files. I did some google searching and found a stack overflow discussion on the topic. There are some decent solutions in the thread but nothing that I felt would meet my requirements. One of my requirements is that I’d like to make it as easy as possible to replicate for someone that may not be very technically inclined. Before pandas, I might have created a script to loop through each file and do my comparison. However, I thought that I might be able to come up with a better solution using pandas. In hindsight, this was a useful exercise to help me understand more about working with pandas. Once I decided to work work pandas, I did another search and found stack overflow thread that looked like a good start.

First Attempt Like I did in my previous article, I am using an IPython notebook to test out my solution. If you would like to follow along, here are sample-address-1 and sample-address-2 The first step, is my normal imports: import pandas as pd import numpy as np Next, read in both of our excel files into dataframes df1 = pd . read_excel ( 'sample-address-1.xlsx' , 'Sheet1' , na_values = [ 'NA' ]) df2 = pd . read_excel ( 'sample-address-2.xlsx' , 'Sheet1' , na_values = [ 'NA' ]) Order by account number and reindex so that it stays this way. df1 . sort ( columns = "account number" ) df1 = df1 . reindex () df2 . sort ( columns = "account number" ) df2 = df2 . reindex () Create a diff function to show what the changes are. def report_diff ( x ): return x [ 0 ] if x [ 0 ] == x [ 1 ] else ' {} ---> {} ' . format ( * x ) Merge the two datasets together in a Panel . I will admit that I haven’t fully grokked the panel concept yet but the only way to learn is to keep pressing on! diff_panel = pd . Panel ( dict ( df1 = df1 , df2 = df2 )) Once the data is in a panel, we use the report_diff function to highlight all the changes. I think this is a very intuitive way (for this data set) to show changes. It is relatively simple to see what the old value is and the new one. For example, someone could easily check and see why that postal code changed for account number 880043. diff_output = diff_panel . apply ( report_diff , axis = 0 ) diff_output . tail () account number name street city state postal code 95 677936 Hodkiewicz-Koch 604 Lemke Knoll Suite 661 East Laurence Wisconsin 98576 96 880043 Beatty Inc 3641 Schaefer Isle Suite 171 North Gardnertown Wyoming 64318 —-> 64918 97 899885 Kessler and Sons 356 Johnson Isle Suite 991 Casiehaven Wyoming 37996 98 704567 Yundt-Abbott 8338 Sauer Highway Jennyfort Wyoming 19932 99 880729 Huels PLC 695 Labadie Lakes Apt. 256 Port Orland Wyoming 42977 One of the things we want to do is flag rows that have changes so it is easier to see the changes. We will create a has_change function and use apply to run the function against each row. def has_change ( row ): if "--->" in row . to_string (): return "Y" else : return "N" diff_output [ 'has_change' ] = diff_output . apply ( has_change , axis = 1 ) diff_output . tail () account number name street city state postal code has_change 95 677936 Hodkiewicz-Koch 604 Lemke Knoll Suite 661 East Laurence Wisconsin 98576 N 96 880043 Beatty Inc 3641 Schaefer Isle Suite 171 North Gardnertown Wyoming 64318 —-> 64918 Y 97 899885 Kessler and Sons 356 Johnson Isle Suite 991 Casiehaven Wyoming 37996 N 98 704567 Yundt-Abbott 8338 Sauer Highway Jennyfort Wyoming 19932 N 99 880729 Huels PLC 695 Labadie Lakes Apt. 256 Port Orland Wyoming 42977 N It is simple to show all the columns with a change: diff_output [( diff_output . has_change == 'Y' )] account number name street city state postal code has_change 24 595932 Kuhic, Eichmann and West 4059 Tobias Inlet —-> 4059 Tobias St New Rylanfurt Illinois 89271 Y 30 558879 Watsica Group 95616 Enos Grove Suite 139 —-> 829 Big street West Atlas —-> Smithtown Iowa —-> Ohio 47419 —-> 47919 Y 96 880043 Beatty Inc 3641 Schaefer Isle Suite 171 North Gardnertown Wyoming 64318 —-> 64918 Y Finally, let’s write it out to an Excel file: diff_output [( diff_output . has_change == 'Y' )] . to_excel ( 'my-diff.xlsx' ) Here is a simple program that does what I’ve just shown: import pandas as pd import numpy as np # Define the diff function to show the changes in each field def report_diff ( x ): return x [ 0 ] if x [ 0 ] == x [ 1 ] else ' {} ---> {} ' . format ( * x ) # We want to be able to easily tell which rows have changes def has_change ( row ): if "--->" in row . to_string (): return "Y" else : return "N" # Read in both excel files df1 = pd . read_excel ( 'sample-address-1.xlsx' , 'Sheet1' , na_values = [ 'NA' ]) df2 = pd . read_excel ( 'sample-address-2.xlsx' , 'Sheet1' , na_values = [ 'NA' ]) # Make sure we order by account number so the comparisons work df1 . sort ( columns = "account number" ) df1 = df1 . reindex () df2 . sort ( columns = "account number" ) df2 = df2 . reindex () # Create a panel of the two dataframes diff_panel = pd . Panel ( dict ( df1 = df1 , df2 = df2 )) #Apply the diff function diff_output = diff_panel . apply ( report_diff , axis = 0 ) # Flag all the changes diff_output [ 'has_change' ] = diff_output . apply ( has_change , axis = 1 ) #Save the changes to excel but only include the columns we care about diff_output [( diff_output . has_change == 'Y' )] . to_excel ( 'my-diff-1.xlsx' , index = False , columns = [ "account number" , "name" , "street" , "city" , "state" , "postal code" ])

Scaling Up I have to be honest, I was feeling pretty good so I decided to run this on a more complex dataset and see what happened. I’ll spare you the steps but show you the output: account number name street city state postal code 19 878977.0 —-> 869125 Swift PLC —-> Wiza LLC 5605 Hodkiewicz Views —-> 9824 Noemi Harbors Summerfurt —-> North Tristin Vermont —-> Maine 98029.0 —-> 98114 20 880043.0 —-> 875910 Beatty Inc —-> Lowe, Tremblay and Bruen 3641 Schaefer Isle Suite 171 —-> 3722 Tatyana… North Gardnertown —-> Selmafurt Wyoming —-> NorthDakota 64318.0 —-> 17496 21 880729.0 —-> 878977 Huels PLC —-> Swift PLC 695 Labadie Lakes Apt. 256 —-> 5605 Hodkiewic… Port Orland —-> Summerfurt Wyoming —-> Vermont 42977.0 —-> 98029 22 nan —-> 880043 nan —-> Beatty Inc nan —-> 3641 Schaefer Isle Suite 171 nan —-> North Gardnertown nan —-> Wyoming nan —-> 64318 23 nan —-> 880729 nan —-> Huels PLC nan —-> 695 Labadie Lakes Apt. 256 nan —-> Port Orland nan —-> Wyoming nan —-> 42977 Hmmm. This isn’t going to work is it? I am going to rethink this and see if I can come up with an approach that will scale on a bigger data set.

Second Attempt I will use a similar approach but build it out to show more details on the changes and make the solution more robust for bigger data sets. Here are the data sets for those interested: sample-address-new and sample-address-old. Start with the standard imports. import pandas as pd import numpy as np We will define our report_diff function like we did in the previous exercise. def report_diff ( x ): return x [ 0 ] if x [ 0 ] == x [ 1 ] else ' {} ---> {} ' . format ( * x ) Read in the values in the two different sheets old = pd . read_excel ( 'sample-address-old.xlsx' , 'Sheet1' , na_values = [ 'NA' ]) new = pd . read_excel ( 'sample-address-new.xlsx' , 'Sheet1' , na_values = [ 'NA' ]) Label the two data sets so that when we combine them, we know which is which old [ 'version' ] = "old" new [ 'version' ] = "new" We can look at the data to see what the format looks like and how many records we ended up with. new . head () account number name street city state postal code version 0 935480 Bruen and Jones Group 5131 Nienow Viaduct Apt. 290 Port Arlie Alabama 14118 new 1 371770 Cruickshank-Boyer 839 Lana Expressway Suite 234 South Viviana Alabama 57838 new 2 548367 Spencer, Grady and Herman 65387 Lang Circle Apt. 516 Greenholtbury Alaska 58394 new 3 132971 Williamson, Schumm and Hettinger 89403 Casimer Spring Jeremieburgh Arkansas 6278 new 4 985603 Bosco-Upton 89 Big Street Small Town Texas 19033 new old . head () account number name street city state postal code version 0 935480 Bruen Group 5131 Nienow Viaduct Apt. 290 Port Arlie Alabama 14118 old 1 371770 Cruickshank-Boyer 839 Lana Expressway Suite 234 South Viviana Alabama 57838 old 2 548367 Spencer, Grady and Herman 65387 Lang Circle Apt. 516 Greenholtbury Alaska 58394 old 3 296620 Schamberger, Hagenes and Brown 26340 Ferry Neck Apt. 612 McCulloughstad Alaska 74052 old 4 132971 Williamson, Schumm and Hettinger 89403 Casimer Spring Jeremieburgh Arkansas 62785 old len ( old ) 22 len ( new ) 24 We will add all the data together into a new table full_set = pd . concat ([ old , new ], ignore_index = True ) As expected, the full set includes 46 records. full_set . count () account number 46 name 46 street 46 city 46 state 46 postal code 46 version 46 dtype: int64 full_set . head ( 5 ) account number name street city state postal code version 0 935480 Bruen Group 5131 Nienow Viaduct Apt. 290 Port Arlie Alabama 14118 old 1 371770 Cruickshank-Boyer 839 Lana Expressway Suite 234 South Viviana Alabama 57838 old 2 548367 Spencer, Grady and Herman 65387 Lang Circle Apt. 516 Greenholtbury Alaska 58394 old 3 296620 Schamberger, Hagenes and Brown 26340 Ferry Neck Apt. 612 McCulloughstad Alaska 74052 old 4 132971 Williamson, Schumm and Hettinger 89403 Casimer Spring Jeremieburgh Arkansas 62785 old full_set . tail ( 5 ) account number name street city state postal code version 41 869125 Wiza LLC 9824 Noemi Harbors North Tristin Maine 98114 new 42 875910 Lowe, Tremblay and Bruen 3722 Tatyana Springs Apt. 464 Selmafurt NorthDakota 17496 new 43 878977 Swift PLC 5605 Hodkiewicz Views Summerfurt Vermont 98029 new 44 880043 Beatty Inc 3641 Schaefer Isle Suite 171 North Gardnertown Wyoming 64318 new 45 880729 Huels PLC 695 Labadie Lakes Apt. 256 Port Orland Wyoming 42977 new We use drop_duplicates to get rid of the obvious columns where there has not been any change. Note that we keep the last one using take_last=True so we can tell which accounts have been removed in the new data set. One interesting note about drop_duplicates , you can specify which columns you care about. This functionality is really useful if you have extra columns (say sales, or notes) that you expect to change but don’t really care about for these purposes. changes = full_set . drop_duplicates ( subset = [ "account number" , "name" , "street" , "city" , "state" , "postal code" ], take_last = True ) We have cut down our data set to 28 records. len ( changes ) 28 Sort and take a look at what the data looks like. If you look at account number 132971, you can get an idea for how the data is structured. changes . sort ( columns = "account number" ) . head ( 5 ) account number name street city state postal code version 27 121213 Bauch-Goldner 7274 Marissa Common Shanahanchester California 49681 new 4 132971 Williamson, Schumm and Hettinger 89403 Casimer Spring Jeremieburgh Arkansas 62785 old 25 132971 Williamson, Schumm and Hettinger 89403 Casimer Spring Jeremieburgh Arkansas 6278 new 28 214098 Goodwin, Homenick and Jerde 649 Cierra Forks Apt. 078 Rosaberg Colorado 47743 new 3 296620 Schamberger, Hagenes and Brown 26340 Ferry Neck Apt. 612 McCulloughstad Alaska 74052 old Use the get_duplicates function to get a list of all the account numbers which are duplicated. dupe_accts = changes . set_index ( 'account number' ) . index . get_duplicates () dupe_accts [132971, 935480, 985603] Get a list of all the dupes into one frame using isin . dupes = changes [ changes [ "account number" ] . isin ( dupe_accts )] dupes account number name street city state postal code version 0 935480 Bruen Group 5131 Nienow Viaduct Apt. 290 Port Arlie Alabama 14118 old 4 132971 Williamson, Schumm and Hettinger 89403 Casimer Spring Jeremieburgh Arkansas 62785 old 5 985603 Bosco-Upton 03369 Moe Way Port Casandra Arkansas 86014 old 22 935480 Bruen and Jones Group 5131 Nienow Viaduct Apt. 290 Port Arlie Alabama 14118 new 25 132971 Williamson, Schumm and Hettinger 89403 Casimer Spring Jeremieburgh Arkansas 6278 new 26 985603 Bosco-Upton 89 Big Street Small Town Texas 19033 new We need two data frames of the same size so split them into a new and old version. change_new = dupes [( dupes [ "version" ] == "new" )] change_old = dupes [( dupes [ "version" ] == "old" )] Drop the version columns since we don’t need them any more. change_new = change_new . drop ([ 'version' ], axis = 1 ) change_old = change_old . drop ([ 'version' ], axis = 1 ) change_old account number name street city state postal code 0 935480 Bruen Group 5131 Nienow Viaduct Apt. 290 Port Arlie Alabama 14118 4 132971 Williamson, Schumm and Hettinger 89403 Casimer Spring Jeremieburgh Arkansas 62785 5 985603 Bosco-Upton 03369 Moe Way Port Casandra Arkansas 86014 Index on the account number. change_new . set_index ( 'account number' , inplace = True ) change_new name street city state postal code account number 935480 Bruen and Jones Group 5131 Nienow Viaduct Apt. 290 Port Arlie Alabama 14118 132971 Williamson, Schumm and Hettinger 89403 Casimer Spring Jeremieburgh Arkansas 6278 985603 Bosco-Upton 89 Big Street Small Town Texas 19033 change_old . set_index ( 'account number' , inplace = True ) change_old name street city state postal code account number 935480 Bruen Group 5131 Nienow Viaduct Apt. 290 Port Arlie Alabama 14118 132971 Williamson, Schumm and Hettinger 89403 Casimer Spring Jeremieburgh Arkansas 62785 985603 Bosco-Upton 03369 Moe Way Port Casandra Arkansas 86014 Run our diff process like we did in our first attempt now that we have the data structured in the way we need to. diff_panel = pd . Panel ( dict ( df1 = change_old , df2 = change_new )) diff_output = diff_panel . apply ( report_diff , axis = 0 ) diff_output name street city state postal code account number 935480 Bruen Group —-> Bruen and Jones Group 5131 Nienow Viaduct Apt. 290 Port Arlie Alabama 14118 132971 Williamson, Schumm and Hettinger 89403 Casimer Spring Jeremieburgh Arkansas 62785 —-> 6278 985603 Bosco-Upton 03369 Moe Way —-> 89 Big Street Port Casandra —-> Small Town Arkansas —-> Texas 86014 —-> 19033 Looks pretty good! We know our diff, now we need to figure out which accounts were removed in the new list. We need to find records from the “old” version that are no longer in the “new” version. changes [ 'duplicate' ] = changes [ "account number" ] . isin ( dupe_accts ) removed_accounts = changes [( changes [ "duplicate" ] == False ) & ( changes [ "version" ] == "old" )] removed_accounts account number name street city state postal code version duplicate 3 296620 Schamberger, Hagenes and Brown 26340 Ferry Neck Apt. 612 McCulloughstad Alaska 74052 old False The final portion is figuring out which accounts are new. We will go back to the full set and take only the first duplicate row. new_account_set = full_set . drop_duplicates ( subset = [ "account number" , "name" , "street" , "city" , "state" , "postal code" ], take_last = False ) new_account_set . head () account number name street city state postal code version 0 935480 Bruen Group 5131 Nienow Viaduct Apt. 290 Port Arlie Alabama 14118 old 1 371770 Cruickshank-Boyer 839 Lana Expressway Suite 234 South Viviana Alabama 57838 old 2 548367 Spencer, Grady and Herman 65387 Lang Circle Apt. 516 Greenholtbury Alaska 58394 old 3 296620 Schamberger, Hagenes and Brown 26340 Ferry Neck Apt. 612 McCulloughstad Alaska 74052 old 4 132971 Williamson, Schumm and Hettinger 89403 Casimer Spring Jeremieburgh Arkansas 62785 old Add a duplicate column again. new_account_set [ 'duplicate' ] = new_account_set [ "account number" ] . isin ( dupe_accts ) new_account_set . head () account number name street city state postal code version duplicate 0 935480 Bruen Group 5131 Nienow Viaduct Apt. 290 Port Arlie Alabama 14118 old True 1 371770 Cruickshank-Boyer 839 Lana Expressway Suite 234 South Viviana Alabama 57838 old False 2 548367 Spencer, Grady and Herman 65387 Lang Circle Apt. 516 Greenholtbury Alaska 58394 old False 3 296620 Schamberger, Hagenes and Brown 26340 Ferry Neck Apt. 612 McCulloughstad Alaska 74052 old False 4 132971 Williamson, Schumm and Hettinger 89403 Casimer Spring Jeremieburgh Arkansas 62785 old True We want to find the accounts that aren’t duplicated and are only in the new data set. added_accounts = new_account_set [( new_account_set [ "duplicate" ] == False ) & ( new_account_set [ "version" ] == "new" )] Let’s look at all the new accounts we have added: added_accounts account number name street city state postal code version duplicate 27 121213 Bauch-Goldner 7274 Marissa Common Shanahanchester California 49681 new False 28 214098 Goodwin, Homenick and Jerde 649 Cierra Forks Apt. 078 Rosaberg Colorado 47743 new False 29 566618 Greenfelder, Wyman and Harris 17557 Romaguera Field South Tamica Colorado 50037 new False Finally we can save all of this into three different sheets in an Excel file. writer = pd . ExcelWriter ( "my-diff-2.xlsx" ) diff_output . to_excel ( writer , "changed" ) removed_accounts . to_excel ( writer , "removed" , index = False , columns = [ "account number" , "name" , "street" , "city" , "state" , "postal code" ]) added_accounts . to_excel ( writer , "added" , index = False , columns = [ "account number" , "name" , "street" , "city" , "state" , "postal code" ]) writer . save () Here is a full streamlined code example: import pandas as pd import numpy as np # Define the diff function to show the changes in each field def report_diff ( x ): return x [ 0 ] if x [ 0 ] == x [ 1 ] else ' {} ---> {} ' . format ( * x ) # Read in the two files but call the data old and new and create columns to track old = pd . read_excel ( 'sample-address-old.xlsx' , 'Sheet1' , na_values = [ 'NA' ]) new = pd . read_excel ( 'sample-address-new.xlsx' , 'Sheet1' , na_values = [ 'NA' ]) old [ 'version' ] = "old" new [ 'version' ] = "new" #Join all the data together and ignore indexes so it all gets added full_set = pd . concat ([ old , new ], ignore_index = True ) # Let's see what changes in the main columns we care about changes = full_set . drop_duplicates ( subset = [ "account number" , "name" , "street" , "city" , "state" , "postal code" ], take_last = True ) #We want to know where the duplicate account numbers are, that means there have been changes dupe_accts = changes . set_index ( 'account number' ) . index . get_duplicates () #Get all the duplicate rows dupes = changes [ changes [ "account number" ] . isin ( dupe_accts )] #Pull out the old and new data into separate dataframes change_new = dupes [( dupes [ "version" ] == "new" )] change_old = dupes [( dupes [ "version" ] == "old" )] #Drop the temp columns - we don't need them now change_new = change_new . drop ([ 'version' ], axis = 1 ) change_old = change_old . drop ([ 'version' ], axis = 1 ) #Index on the account numbers change_new . set_index ( 'account number' , inplace = True ) change_old . set_index ( 'account number' , inplace = True ) #Now we can diff because we have two data sets of the same size with the same index diff_panel = pd . Panel ( dict ( df1 = change_old , df2 = change_new )) diff_output = diff_panel . apply ( report_diff , axis = 0 ) #Diff'ing is done, we need to get a list of removed items #Flag all duplicated account numbers changes [ 'duplicate' ] = changes [ "account number" ] . isin ( dupe_accts ) #Identify non-duplicated items that are in the old version and did not show in the new version removed_accounts = changes [( changes [ "duplicate" ] == False ) & ( changes [ "version" ] == "old" )] # We have the old and diff, we need to figure out which ones are new #Drop duplicates but keep the first item instead of the last new_account_set = full_set . drop_duplicates ( subset = [ "account number" , "name" , "street" , "city" , "state" , "postal code" ], take_last = False ) #Identify dupes in this new dataframe new_account_set [ 'duplicate' ] = new_account_set [ "account number" ] . isin ( dupe_accts ) #Identify added accounts added_accounts = new_account_set [( new_account_set [ "duplicate" ] == False ) & ( new_account_set [ "version" ] == "new" )] #Save the changes to excel but only include the columns we care about writer = pd . ExcelWriter ( "my-diff-2.xlsx" ) diff_output . to_excel ( writer , "changed" ) removed_accounts . to_excel ( writer , "removed" , index = False , columns = [ "account number" , "name" , "street" , "city" , "state" , "postal code" ]) added_accounts . to_excel ( writer , "added" , index = False , columns = [ "account number" , "name" , "street" , "city" , "state" , "postal code" ]) writer . save () Here is the final output excel file: my-diff-2

Conclusion I would not be surprised if someone looks at this and finds a simpler way to do this. However, the final code is relatively straightforward and with minimal tweaks could be applied to your custom data set. I also think this was a good exercise for me to walk through and learn more about the various pandas functions and how to use them to solve my real world problem. I hope it is as helpful to you as it was to me!