Introduction There are many data analysis tools available to the python analyst and it can be challenging to know which ones to use in a particular situation. A useful (but somewhat overlooked) technique is called association analysis which attempts to find common patterns of items in large data sets. One specific application is often called market basket analysis. The most commonly cited example of market basket analysis is the so-called “beer and diapers” case. The basic story is that a large retailer was able to mine their transaction data and find an unexpected purchase pattern of individuals that were buying beer and baby diapers at the same time. Unfortunately this story is most likely a data urban legend. However, it is an illustrative (and entertaining) example of the types of insights that can be gained by mining transactional data. While these types of associations are normally used for looking at sales transactions; the basic analysis can be applied to other situations like click stream tracking, spare parts ordering and online recommendation engines - just to name a few. If you have some basic understanding of the python data science world, your first inclination would be to look at scikit-learn for a ready-made algorithm. However, scikit-learn does not support this algorithm. Fortunately, the very useful MLxtend library by Sebastian Raschka has a a an implementation of the Apriori algorithm for extracting frequent item sets for further analysis. The rest of this article will walk through an example of using this library to analyze a relatively large online retail data set and try to find interesting purchase combinations. By the end of this article, you should be familiar enough with the basic approach to apply it to your own data sets.

Why Association Analysis? In today’s world, there are many complex ways to analyze data (clustering, regression, Neural Networks, Random Forests, SVM, etc.). The challenge with many of these approaches is that they can be difficult to tune, challenging to interpret and require quite a bit of data prep and feature engineering to get good results. In other words, they can be very powerful but require a lot of knowledge to implement properly. Association analysis is relatively light on the math concepts and easy to explain to non-technical people. In addition, it is an unsupervised learning tool that looks for hidden patterns so there is limited need for data prep and feature engineering. It is a good start for certain cases of data exploration and can point the way for a deeper dive into the data using other approaches. As an added bonus, the python implementation in MLxtend should be very familiar to anyone that has exposure to scikit-learn and pandas. For all these reasons, I think it is a useful tool to be familiar with and can help you with your data analysis problems. One quick note - technically, market basket analysis is just one application of association analysis. In this post though, I will use association analysis and market basket analysis interchangeably.

Association Analysis 101 There are a couple of terms used in association analysis that are important to understand. This chapter in Introduction to Data Mining is a great reference for those interested in the math behind these definitions and the details of the algorithm implementation. Association rules are normally written like this: {Diapers} -> {Beer} which means that there is a strong relationship between customers that purchased diapers and also purchased beer in the same transaction. In the above example, the {Diaper} is the antecedent and the {Beer} is the consequent. Both antecedents and consequents can have multiple items. In other words, {Diaper, Gum} -> {Beer, Chips} is a valid rule. Support is the relative frequency that the rules show up. In many instances, you may want to look for high support in order to make sure it is a useful relationship. However, there may be instances where a low support is useful if you are trying to find “hidden” relationships. Confidence is a measure of the reliability of the rule. A confidence of .5 in the above example would mean that in 50% of the cases where Diaper and Gum were purchased, the purchase also included Beer and Chips. For product recommendation, a 50% confidence may be perfectly acceptable but in a medical situation, this level may not be high enough. Lift is the ratio of the observed support to that expected if the two rules were independent (see wikipedia). The basic rule of thumb is that a lift value close to 1 means the rules were completely independent. Lift values > 1 are generally more “interesting” and could be indicative of a useful rule pattern. One final note, related to the data. This analysis requires that all the data for a transaction be included in 1 row and the items should be 1-hot encoded. The MLxtend documentation example is useful: Apple Corn Dill Eggs Ice cream Kidney Beans Milk Nutmeg Onion Unicorn Yogurt 0 0 0 0 1 0 1 1 1 1 0 1 1 0 0 1 1 0 1 0 1 1 0 1 2 1 0 0 1 0 1 1 0 0 0 0 3 0 1 0 0 0 1 1 0 0 1 1 4 0 1 0 1 1 1 0 0 1 0 0 The specific data for this article comes from the UCI Machine Learning Repository and represents transactional data from a UK retailer from 2010-2011. This mostly represents sales to wholesalers so it is slightly different from consumer purchase patterns but is still a useful case study.

Let’s Code MLxtend can be installed using pip, so make sure that is done before trying to execute any of the code below. Once it is installed, the code below shows how to get it up and running. I have made the notebook available so feel free to follow along with the examples below. Get our pandas and MLxtend code imported and read the data: import pandas as pd from mlxtend.frequent_patterns import apriori from mlxtend.frequent_patterns import association_rules df = pd . read_excel ( 'http://archive.ics.uci.edu/ml/machine-learning-databases/00352/Online%20Retail.xlsx' ) df . head () InvoiceNo StockCode Description Quantity InvoiceDate UnitPrice CustomerID Country 0 536365 85123A WHITE HANGING HEART T- LIGHT HOLDER 6 2010-12-01 08:26:00 2.55 17850.0 United Kingdom 1 536365 71053 WHITE METAL LANTERN 6 2010-12-01 08:26:00 3.39 17850.0 United Kingdom 2 536365 84406B CREAM CUPID HEARTS COAT HANGER 8 2010-12-01 08:26:00 2.75 17850.0 United Kingdom 3 536365 84029G KNITTED UNION FLAG HOT WATER BOTTLE 6 2010-12-01 08:26:00 3.39 17850.0 United Kingdom 4 536365 84029E RED WOOLLY HOTTIE WHITE HEART . 6 2010-12-01 08:26:00 3.39 17850.0 United Kingdom There is a little cleanup, we need to do. First, some of the descriptions have spaces that need to be removed. We’ll also drop the rows that don’t have invoice numbers and remove the credit transactions (those with invoice numbers containing C). df [ 'Description' ] = df [ 'Description' ] . str . strip () df . dropna ( axis = 0 , subset = [ 'InvoiceNo' ], inplace = True ) df [ 'InvoiceNo' ] = df [ 'InvoiceNo' ] . astype ( 'str' ) df = df [ ~ df [ 'InvoiceNo' ] . str . contains ( 'C' )] After the cleanup, we need to consolidate the items into 1 transaction per row with each product 1 hot encoded. For the sake of keeping the data set small, I’m only looking at sales for France. However, in additional code below, I will compare these results to sales from Germany. Further country comparisons would be interesting to investigate. basket = ( df [ df [ 'Country' ] == "France" ] . groupby ([ 'InvoiceNo' , 'Description' ])[ 'Quantity' ] . sum () . unstack () . reset_index () . fillna ( 0 ) . set_index ( 'InvoiceNo' )) Here’s what the first few columns look like (note, I added some numbers to the columns to illustrate the concept - the actual data in this example is all 0’s): Description 10 COLOUR SPACEBOY PEN 12 COLOURED PARTY BALLOONS 12 EGG HOUSE PAINTED WOOD 12 MESSAGE CARDS WITH ENVELOPES 12 PENCIL SMALL TUBE WOODLAND 12 PENCILS SMALL TUBE RED RETROSPOT 12 PENCILS SMALL TUBE SKULL 12 PENCILS TALL TUBE POSY InvoiceNo 536370 11.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 536852 0.0 0.0 0.0 0.0 5.0 0.0 0.0 0.0 536974 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 537065 0.0 0.0 0.0 0.0 0.0 7.0 0.0 0.0 537463 0.0 0.0 9.0 0.0 0.0 0.0 0.0 0.0 There are a lot of zeros in the data but we also need to make sure any positive values are converted to a 1 and anything less the 0 is set to 0. This step will complete the one hot encoding of the data and remove the postage column (since that charge is not one we wish to explore): def encode_units ( x ): if x <= 0 : return 0 if x >= 1 : return 1 basket_sets = basket . applymap ( encode_units ) basket_sets . drop ( 'POSTAGE' , inplace = True , axis = 1 ) Now that the data is structured properly, we can generate frequent item sets that have a support of at least 7% (this number was chosen so that I could get enough useful examples): frequent_itemsets = apriori ( basket_sets , min_support = 0.07 , use_colnames = True ) The final step is to generate the rules with their corresponding support, confidence and lift: rules = association_rules ( frequent_itemsets , metric = "lift" , min_threshold = 1 ) rules . head () antecedants consequents support confidence lift 0 ( PLASTERS IN TIN WOODLAND ANIMALS ) ( PLASTERS IN TIN CIRCUS PARADE ) 0.170918 0.597015 3.545907 1 ( PLASTERS IN TIN CIRCUS PARADE ) ( PLASTERS IN TIN WOODLAND ANIMALS ) 0.168367 0.606061 3.545907 2 ( PLASTERS IN TIN CIRCUS PARADE ) ( PLASTERS IN TIN SPACEBOY ) 0.168367 0.530303 3.849607 3 ( PLASTERS IN TIN SPACEBOY ) ( PLASTERS IN TIN CIRCUS PARADE ) 0.137755 0.648148 3.849607 4 ( PLASTERS IN TIN WOODLAND ANIMALS ) ( PLASTERS IN TIN SPACEBOY ) 0.170918 0.611940 4.442233 That’s all there is to it! Build the frequent items using apriori then build the rules with association_rules . Now, the tricky part is figuring out what this tells us. For instance, we can see that there are quite a few rules with a high lift value which means that it occurs more frequently than would be expected given the number of transaction and product combinations. We can also see several where the confidence is high as well. This part of the analysis is where the domain knowledge will come in handy. Since I do not have that, I’ll just look for a couple of illustrative examples. We can filter the dataframe using standard pandas code. In this case, look for a large lift (6) and high confidence (.8): rules [ ( rules [ 'lift' ] >= 6 ) & ( rules [ 'confidence' ] >= 0.8 ) ] antecedants consequents support confidence lift 8 ( SET /6 RED SPOTTY PAPER CUPS ) ( SET /6 RED SPOTTY PAPER PLATES ) 0.137755 0.888889 6.968889 9 ( SET /6 RED SPOTTY PAPER PLATES ) ( SET /6 RED SPOTTY PAPER CUPS ) 0.127551 0.960000 6.968889 10 ( ALARM CLOCK BAKELIKE GREEN ) ( ALARM CLOCK BAKELIKE RED ) 0.096939 0.815789 8.642959 11 ( ALARM CLOCK BAKELIKE RED ) ( ALARM CLOCK BAKELIKE GREEN ) 0.094388 0.837838 8.642959 16 ( SET /6 RED SPOTTY PAPER CUPS , SET /6 RED SPOTTY … ( SET /20 RED RETROSPOT PAPER NAPKINS ) 0.122449 0.812500 6.125000 17 ( SET /6 RED SPOTTY PAPER CUPS , SET /20 RED RETRO … ( SET /6 RED SPOTTY PAPER PLATES ) 0.102041 0.975000 7.644000 18 ( SET /6 RED SPOTTY PAPER PLATES , SET /20 RED RET … ( SET /6 RED SPOTTY PAPER CUPS ) 0.102041 0.975000 7.077778 22 ( SET /6 RED SPOTTY PAPER PLATES ) ( SET /20 RED RETROSPOT PAPER NAPKINS ) 0.127551 0.800000 6.030769 In looking at the rules, it seems that the green and red alarm clocks are purchased together and the red paper cups, napkins and plates are purchased together in a manner that is higher than the overall probability would suggest. At this point, you may want to look at how much opportunity there is to use the popularity of one product to drive sales of another. For instance, we can see that we sell 340 Green Alarm clocks but only 316 Red Alarm Clocks so maybe we can drive more Red Alarm Clock sales through recommendations? basket [ 'ALARM CLOCK BAKELIKE GREEN' ] . sum () 340.0 basket [ 'ALARM CLOCK BAKELIKE RED' ] . sum () 316.0 What is also interesting is to see how the combinations vary by country of purchase. Let’s check out what some popular combinations might be in Germany: basket2 = ( df [ df [ 'Country' ] == "Germany" ] . groupby ([ 'InvoiceNo' , 'Description' ])[ 'Quantity' ] . sum () . unstack () . reset_index () . fillna ( 0 ) . set_index ( 'InvoiceNo' )) basket_sets2 = basket2 . applymap ( encode_units ) basket_sets2 . drop ( 'POSTAGE' , inplace = True , axis = 1 ) frequent_itemsets2 = apriori ( basket_sets2 , min_support = 0.05 , use_colnames = True ) rules2 = association_rules ( frequent_itemsets2 , metric = "lift" , min_threshold = 1 ) rules2 [ ( rules2 [ 'lift' ] >= 4 ) & ( rules2 [ 'confidence' ] >= 0.5 )] antecedants consequents support confidence lift 7 ( PLASTERS IN TIN SPACEBOY ) ( PLASTERS IN TIN WOODLAND ANIMALS ) 0.107221 0.571429 4.145125 9 ( PLASTERS IN TIN CIRCUS PARADE ) ( PLASTERS IN TIN WOODLAND ANIMALS ) 0.115974 0.584906 4.242887 10 ( RED RETROSPOT CHARLOTTE BAG ) ( WOODLAND CHARLOTTE BAG ) 0.070022 0.843750 6.648168 It seems that in addition to David Hasselhoff, Germans love Plasters in Tin Spaceboy and Woodland Animals. In all seriousness, an analyst that has familiarity with the data would probably have a dozen different questions that this type of analysis could drive. I did not replicate this analysis for additional countries or customer combos but the overall process would be relatively simple given the basic pandas code shown above.