Basic customer retention curves

Customer retention curves are essential to any business looking to understand its clients, and will go a long way towards explaining other things like sales figures or the impact of marketing initiatives. They are an easy way to visualize a key interaction between customers and the business, which is to say, whether or not customers return — and at what rate — after the first visit.

The first step to building a customer retention curve is to identify those who visited your business during the reference period, what I will call p1. It is important that the length of the period chosen is a reasonable one, and reflects expected frequency of visits.

Different types of businesses are going to expect their customers to return at different rates:

A coffee shop may choose to use an expected frequency of visits of once a week.

A supermarket may choose a longer period, perhaps 2 weeks or a month.

In the following example, I will use a month, and assume that we are looking at customer retention of customers who visited in January 2016 over the following year.

As previously stated, the first step is to identify the original pool of customers:

January_pool AS

(

SELECT DISTINCT cust_id

FROM dataset

WHERE month(transaction_date)=1

AND year(transaction_date)=2016)

Then, we look at how those customers behaved over time: for example, how many of them returned per month over the rest of the year?

SELECT Year(transaction_date),

Month(transaction_date),

count (distinct cust_id) AS number FROM dataset WHERE year(transaction_date)=2016 AND cust_id IN january_pool GROUP BY 1, 2

As you can see, the original SELECT function is included in this second step.

If we had 1000 unique customers in January, we can expect our results to look something like this:

The resulting graph would then look like this:

Data visualized with Statsbot

Evolution of customer retention over time

What is described above is obviously only the first step, as we would also like to see whether there are any trends in customer retention, i.e. are we getting any better at it?

So, one idea we might have is to say: of those who came in January, how many returned in February? Of those who came in February, how many returned in March? And other one-month intervals.

So, then we need to set up an iterative model, which can be built in a few simple steps. First, we need to create a table where each user’s visits are logged by month, allowing for the possibility that these will have occurred over multiple years since whenever our business started operations. I have assumed here that the start date is the year 2000, but you can adjust this as necessary.

Visit_log AS SELECT cust_id, datediff(month, ‘2000–01–01’, transaction_date) AS visit_month FROM dataset GROUP BY 1, 2 ORDER BY 1, 2

This will give us a view that looks like this:

We then need to reorganize this information in order to identify the time lapse between each visit. So, for each person and for each month, see when the next visit is.

Time_lapse AS SELECT cust_id, visit_month lead(visit_month, 1) over (partition BY cust_id ORDER BY cust_id, visit_month) FROM visit_log

We then need to calculate the time gaps between visits:

Time_diff_calculated AS SELECT cust_id, visit_month, lead, lead — visit_month AS time_diff FROM time_lapse

Now, a small reminder of what customer retention analysis measures: it is the proportion of customers who return after x lag of time. So, what we want to do is compare the number of customers visiting in a given month to how many of those return the next month. We also want to define those who return after a certain absence, and those who don’t return at all. In order to do that, we need to categorize the customers depending on their visit pattern.

Custs_categorized AS SELECT cust_id, visit_month, CASE WHEN time_diff=1 THEN ‘retained’, WHEN time_diff>1 THEN ‘lagger’, WHEN time_diff IS NULL THEN ‘lost’ END AS cust_type FROM time_diff_calculated

This will allow us, in a final step, to establish a count of the number of customers who visited in a given month, and how many of those return the next month.

SELECT visit_month, count(cust_id where cust_type=’retained’)/count(cust_id) AS retention FROM custs_categorized GROUP BY 1

This gives us, month by month, the proportion of customers who returned.

Data visualized with Statsbot

Other techniques for customer retention

There are, of course, other ways to think about customer retention analysis. In particular, we might want to look at how to deal with returning customers: if in p1 there are 100 customers and in p2 80 of them return, in p3 do we want to keep the original 100 or look only at the 80?

In the above example, I talked only about period-to-period retention, but I also classified those who were “laggers,” or who took more than one month to return. A business may want, through targeted marketing efforts, to reward customers who keep a monthly visit pattern, or perhaps, to encourage those who visit less than once a month to come back more often.

So, another way to look at it would be to look at what proportion of our visitors in any given month are retained, how many are returning, and how many are new. In this case, we would want to change our perspective slightly, and look not at the next visit, but at the previous one.

Time_lapse_2 AS SELECT cust_id, Visit_month, lag(visit_month, 1) over (partition BY cust_id ORDER BY cust_id, visit_month) FROM visit_log Time_diff_calculated_2 AS SELECT cust_id, visit_month, lag, visit_month — lag AS time_diff FROM time_lapse_2 Custs_categorized AS SELECT cust_id, visit_month, CASE WHEN time_diff=1 THEN ‘retained’, WHEN time_diff>1 THEN ‘returning’, WHEN time_diff IS NULL THEN ‘new’ END AS cust_type FROM time_diff_calculated_2

So then, it’s simple to count the number of each type of customer in any given month.

SELECT visit_month, cust_type, Count(cust_id) FROM custs_categorized GROUP BY 1, 2

With a simple export and graph, you will get something that looks like this:

Data visualized with Statsbot

Cohort tables

A popular way to analyze customer retention is through the use of cohorts, i.e. defining each user by their first visit, and then seeing how they return over time.

Ideally, our end product will be a table like this: