Map over all inputs you care about; output customer_id → state_mapping Shuffle all mapper outputs to collate by customer_id Reduce customer_id → state_mapping to cohort_id → state_mapping Re-Shuffle reducer output again to collate by cohort_id Re-Reduce cohort_id → state_mapping to a single combined state_mapping for that day Output the row of cohort data to CSV, etc

Cohort Source User signed up User purchased User purchased twice ... 10/20/12 homepage 3 3 4 ... 10/21/12 8 7 1 ... 10/21/12 blog 10 5 0 ...









Been doing some cohort analysis recently. I had heard the term but never got it until I read this post . Hooray for simple explanations. The other popular example I could find is a narrow use-case, made no sense, and didn't motivate me.The gist for the uninitiated: What you get from cohort analysis is a picture of how users/customers progress in your product as a function of when they first signed up. It lets you see how product changes, marketing pushes, network-effects, press, etc. impact conversions and the funnel.Anyways, to do cohort analysis using Map Reduce , the pattern I'm using is:So:Then I bring it into a spreadsheet or something to further regroup, filter, and visualize the data.The output table looks like this:The output graph is normalized to 100% and looks like this:The multi-level mapreduce as a diagram with intermediates:The most important parts:. The row to select is the very first day they entered your system. This is the user's cohort. The reducer in step #3 will have access to the user's full history in your system. There you'll need to sort the user's events by time and figure out the earliest date they were active.. In #3 you need to decide what the user's mutually exclusive state is. This should be the furthest point along in your product lifecycle that they've reached. For my product the states go from "sign-up" to "repeat customer".. For the state_mapping intermediates above, I use JSON dictionaries mapping exclusive states to the number 1. Then I fold the dicts in the reducer in step #5 to get the totals for each output row.. Similar to tracking an advertising campaign , you can segment your cohorts and visualize them independently. For example, say one group of users signed up from a homepage flow, while another group signed up via a blog flow. These can be analyzed separately too. To do this I make the reducer in #3 output a cohort_id that contains the cohort date and the cohort group (e.g., "10/22/12:homepage").Let me know if you have any suggestions! This is my shiny new hammer and everything looks like a nail~