Following the previous posts on predicting email churn (here and here), we continued investigating different model approaches for predicting churn, utilizing some very constructive advice we received from our blog’s readers. Apart from the well-known general-purpose models we already evaluated, more purpose-specific models are also worthwhile. Again, our analysis focuses on mailing list churn prediction. The first vital step of this procedure is to define over new the exact meaning of term “Email Churn”. Having a well-defined problem to solve ensures that if you’re successful in your endeavor in building a predictive model once used, it will produce value.

This is post #3 in the series of mailing list churn prediction using data from MailChimp where I cover:

TIP: If you want to have the series of posts in a PDF you can always refer to, get our free ebook on how to predict email churn.

What is email churn?

As you well know, every business wants to keep its customers engaged. By persuading a recipient to remain in your list, you maintain an invaluable communication channel with him through which you can demonstrate your company’s products and field expertise. On the other hand, letting your potential clients slip away will have a negative impact on your company’s long-term profitability.

Constructing a formal definition of churn is not easy, even in the case of email marketing. While in most cases churned recipients are considered only those who unsubscribe from a mailing list, in fact, there are two “flavors” of list churn as described at Adstation:

Transparent or voluntary churn: Includes unsubscribes, hard bounces and spam complaints. These are people you cannot contact anymore.

Opaque churn or involuntary churn: Includes recipients who just aren’t seeing your emails for a long time. Either the original email is going to the spam or junk folder or possibly to an account that they rarely check, if ever. This group is open for re-engagement.

The critical point is that in the second case, in contrast to the first, we have no way to tell when recipients churn involuntarily for good. In this case, churn is guessed, not observed. Utilizing the recipients’ engagement measured by some metric like open rate during a time period can provide us with a model capable of detecting both voluntary and involuntary churn incidents as in both cases recipients would stop opening our emails.

How big of a problem is email churn?

As explained in the previous posts too, a framework that can be used to understand the process of a marketing campaign mathematically is a valuable tool in the hands of any marketer. Particularly in the case of email churn prediction, all decisions made based on this framework will directly affect the performance of your campaigns.

As far as the email marketing is concerned, while traditionally the growth of the mailing lists was considered to be the ultimate way of increasing profitability, it appears that the retention decrease is crucial too. According to some experts such as Tom Au from AT&T it is even more important than recipient acquisition, as “retaining existing customers is more profitable than acquiring new customers due primarily to savings on acquisition costs, the higher volume of service consumption and customer referrals”

To decrease the retention, we need a model which will inform us who is going to churn in the near future. This churn problem is usually referred as “Buy Till You Die”. In this case, as time goes by since a recipient last opened an email, the probability that this customer is churned increases from 0 to 1, representing the churn score.

Using models from the BTYD class like the NBD/Pareto we can estimate this score using the recipient’s previously monitored behavior.

Buy Till You Die: The NBD/Pareto model

The Pareto/NBD model is based on the following assumptions:

→ Customers go through two stages in their “lifetime” with a specific firm: they are “alive” for some period of time, then become permanently inactive.

→ While alive, the number of opens made by a customer follows a Poisson process with open rate λ. This is equivalent to assuming that the time between opens is distributed exponential with open rate λ.

When it comes to modelling recipients behavior, assuming having a distribution for the email open rate λ is more decent that choosing a single λ that is supposed to explain everyone’s behavior. For example, not all are interested in the same email topics. Some may be more interested in data analysis blog posts while others might prefer marketing updates. Furthermore, this distribution needs to have a few parameters in order to be flexible enough and take values in the positive real numbers. A widely used distribution which meets these requirements is the Gamma distribution. Moreover being well-studied and relatively easy to implement is extra bonus points.

→ A customer’s unobserved “lifetime” of length τ (after which he is viewed as being inactive) is exponentially distributed with dropout rate μ.

The constant small risk that at any time a recipient might churn can be modelled as an exponential lifetime where the longer it’s been since the recipient opened an email, the more probable it is to have already churned.

→ Heterogeneity in open rates across customers follows a gamma distribution with shape parameter r and scale parameter α.

→ Heterogeneity in dropout rates across customers follows a gamma distribution with shape parameters and scale parameter β.

Some recipients appear to be very committed to reading the emails they receive while others tend to interact with them quite occasionally. Rather than a single drop-out or interaction rate for everyone, it makes more sense to have a distribution of drop-out and interaction rates. So, we are after a flexible , well-established distribution with few parameters. In the Pareto/NBD case we use a Gamma, since the rate μ is in the positive real numbers.

→ The open rate λ and the dropout rate μ vary independently across customers.

Assumptions (2) and (4) give us the NBD model for the distribution of the number of opens while the customer is alive while assumptions (3) and (5) give us the Pareto.

Data Preparation

The data required to estimate the NBD/Pareto model parameters is surprisingly little and can easily be obtained from services like Mailchimp.

The initial dataset we are going to use is the form of an event log and consists of lines where each one of them corresponds to an open action performed by a recipient along with a recipient’s identifier.

This dataset is then converted into a “customer-by-sufficient-statistic-matrix” which includes three pieces of information for every person:

how many opens they made in the training period, also called calibration (frequency)

the time of their last transaction (recency)

the total time for which they were observed

In our example, we also merged actions triggered from the same user on the same day since our timing information is accurate to the day.

As with other models, we need to divide our data into two sets: Calibration data for model training and holdout data for model evaluation. The cutoff point for the separation is chosen in the middle of the total period in order to have sufficient amount of data for both calibration and holdout.

We then are going to construct a customer-by-time matrix. Each row represents a customer and the columns dates at which he opened an email. The value of each column corresponds to the times he opened an email on that particular day.

From the customer-to-time matrix, we can now create the customer-by-sufficient-statistics matrix we mentioned before.

The following plot presents the distribution of days between subsequent opens. The NBD/Pareto describe this quantity via a Gamma distribution across recipients, the shape of which is pretty close to the initial data.

Parameter Estimation

Using the CBS(customer by sufficient statistic) from the calibration period which includes information about frequency of opens, recency of last open and total time observed we can estimate the following parameters of the NBD/Pareto model:

r: gamma parameter for NBD open

Alpha: gamma parameter for NDB open

s: gamma parameter for Pareto(exponential gamma) dropout process

Beta: gamma parameter for Pareto(exponential gamma) dropout process

In order to estimate these parameters, we maximize the sum of log-likelihood for each customer and the finally the selected parameters are those determined by the algorithm as the best-fitting. The lower bound of the parameters we are going to estimate is always zero as NBD/Pareto parameters cannot be negative.

Before performing the above computation it is a good idea to initialize the parameters r and s at our best guess of the heterogeneity in the open and die rate of recipients so that the computation shall converge more easily to the right result. If no initial values are selected, BTYD R’s package will use (1,1,1,1) as a default.

For every pair of distributions’ shape parameters varying, we constructed a contour plot in order to graphically represent the algorithm’s choices regarding the parameters’ estimation best fitting. For each one of them, the non-varying parameters are kept constant to the predicted values. The values of the estimated by the algorithm parameters is indicated by the red lines.

Based on the previously estimated values for lambda, recipients’ propensity to perform an open action and gamma recipients’ propensity to drop out, the assumed distribution of heterogeneity in transaction and dropout rate are presented in the following plots. In fact, it is the graphical representation of the of the distributions the algorithm uses to model our problem.

Individual Predictions

New Customer

Using the estimated parameters of the NBD/Pareto model, we can calculate the expected number of repeated transactions that a randomly selected customer for whom we have no prior behavioral information, is expected to make in a given time period.

For example, for a time period of 1 year, a new customer is expected to perform 8 open actions.

Existing Customer

Taking into account both the NBD/Pareto estimated parameters and the recipients’ previous behavior we can calculate the expected open actions of an existing customer in a given time period.

For an existing customer, we can predict his expected behavior for a defined time period, conditional on his characteristics during the calibration period.

For example, for a time period of 1 year, the recipient with id 16 is expected to perform 4 opens.

Overall Prediction

Using the estimated parameters and the customers’ action behavior we can compute the probabilities that they are still alive at the end of the calibration period.

The value of the probability varies from 0 to 1 with values close to 0 interpreted as “probable churns”. As a rule of thumb, we can say that customers with a probability of remaining alive less that 0.5 can be considered to be “dangerous” while those with probability above 0.5 more secure.

Having binned recipients depending on the frequency of repeated open actions during the calibration period, the following plot was constructed for the comparison between predicted and actual data.

During calibration period we can see that the model performs satisfactorily on data binned by the number of repeated transactions. In all bins, the number of points predicted by the model is almost equal compared to the real one.

Model Assessment in Holdout Period

Having split the data into calibration and holdout periods as described before we can now assess the model’s performance on the holdout period.

The following plot represents the actual and conditional expected number of open actions made by customers in the holdout period, binned according to calibration period frequencies.

We can see that the constructed model tends to be pessimistic. This can be verified by the statistic summary of the difference (actual- predictions):

Min 1st Qu Median Mean 3rd Qu Max -67.150 -1.156 4.712 11.960 19.070 269.300

The difference between Median and Mean indicate that although in most cases the predictions are satisfactory, there are also a few cases where the prediction was very pessimistic compared to reality.

Conclusion

It appears that the model constructed using the NBD/Pareto approach can provide a marketer with reliable predictions regarding the future behavior of his mailing list recipients.

These conclusions can be very helpful in predicting both new customers’ for whom we have no previous data and existing customers with previously monitored behavior while at the same time provide an overview of the churn trends, leading to more informed marketing decisions.

The code used in this post is available on GitHub.