These days there is an incredible amount of attention for the algorithms side of data science. The term ‘deep learning’ is the absolute hype at the moment, with Random Forest and Gradient Boosting coming second and third as approaches that got many people fantastic scores on Kaggle.

One thing, however, is almost never talked about: what decisions are made to come to a training, testing and validation logic? How is the target variable defined? How are the features created? Of course, you hear people speak about feature creation, but that is typically as recombination or transformation of the features that have initially been chosen to be part of the dataset. How do you get to that initial dataset?

In this article, I’d like to discuss the many decisions that make up the logic to come to an analytic dataset and show common approaches. To start off, consider the following: in Marketing, one frequently builds propensity (or classification) models. The target is created as ‘customer has purchased in last three months’, and the predictors are static customer characteristics and rollups of the behavior prior to those three months. Let’s ask some questions about such a dataset:

If you predict the customers who are going to buy, doesn’t that mean that they will buy anyway? Wouldn’t you want to predict customers who you need to reach out to in order to make them buy? Indeed, this is the uplift thought, for those who are recognizing it. There may be the uplift model to solve this, but also, you may rethink the original question and instead come up with a clever experiment such that the resulting data gives an answer to that exact question, rather than going with the default target “who bought in the last 3 months’.

make them buy? Indeed, this is the uplift thought, for those who are recognizing it. There may be the uplift model to solve this, but also, you may rethink the original question and instead come up with a clever experiment such that the resulting data gives an answer to that exact question, rather than going with the default target “who bought in the last 3 months’. If your historic modeling campaign window is three months, and a customer buys something the first day of this three months, isn’t the meaning of his last observed predictor data containing different information than someone who buys at the end of the three months?

Why would you choose a three month campaign window in the first place?

If you score a new dataset, how do you handle the fact that a customer you want to score was also part of the training dataset (yet, potentially with an updated state).

How do you create a model where the predictions change if a customer’s behavior changes, rather than a static model that only differentiates between customers?

Those seem all fairly straightforward questions, yet, answers are not found in literature. In this article, I will discuss the various approaches I encountered in the wild. In addition, I will demonstrate the (visual) ways I use to make many of those choices discussable. Note that many examples revolve around purchasing, but the methods used are applicable in a wide range of domains (replace ‘purchase’ with your event of interest).

Phrasing the right question

A bank once approached me with the following question: we would like to predict the amount of cash taken out from ATMs, 30 days out, on a daily basis. Thinking about this question, the following comes to mind:

There will be periodic patterns, as people go to ATMs more when they get paid.

There will be weather patterns, as people don’t go to the ATM when it rains.

There will be location patterns, as rural will show different patterns than urban locations.

There will be calendar patterns, as (days prior to) celebration days have their own dynamic.

There will be ethnic patterns, as different ethnicities have their own celebrations and ways to spend money.

And so on. Since one mentions the word ‘time’, somehow, there’s the immediate spastic response ‘time series’. So, this means, you go ahead and start building 10.000 time series, for every ATM one.

Wait! Full stop here! The question is: “predict the amount of cash taken out from ATMs, 30 days out, on a daily basis.” Why? How will you use the results? Why 30 days? The number 30 simply seem to come to mind as a convenient number of days to look forward. Yet, their business process showed weekly money provisioning (money transport driving from ATM to ATM). So, why not start by making the 30-days out, a 7-day out prediction? Next question: why do you need a daily prediction? Again, turns out, with a daily prediction, the bank imagined they could easily work out when the ATM was going to be empty. The only thing: the person asking the question had not thought about adding predictions together also means adding the uncertainty around the predictions together. Here we are: the word ‘predict’ is heavily misused by anyone who can use the word, not being hindered by any knowledge about data science. On a side note: considering each ATM as a separate time series does not give you the benefit of sharing predictors across different ATMs, instead I would consider a model (or dataset) that contains all ATMs (longitudinal data; accounting for time as additional predictor).

What would be the right way to phrase the question? There are many different options:

Daily prediction of total money taken out (regression)

Daily total cash left in the ATM (regression)

Number of days till empty (survival)

Daily probability to hit empty (classification)

Summed amount of predicted individual cash take-outs (aggregation model)

Predict routes of the money transport (optimization)

As the 4th law of data mining says: there’s no free lunch for the data miner. Prior to experimenting, one cannot tell which of those models would result in the best approach to be used for cash replenishment. There’s one intuition, but more importantly, do not fix your model approach and think that the solution is to use the caret package in R (i.e run 150 commonly used machine learning models). All machine learning will be automated in a matter of years, and hence as a data scientist, you will not have value anymore, unless you focus on exactly the points I’m making here: it’s about creativity, it’s about understanding how to phrase the right question and how to create your training set accordingly.

These are my tips to give an initial ranking of things to try out:

Exotic models do not solve your problem; getting the question right is more important.

Try the simplest thing first.

Be prepared to rephrase your modeling experiment.

In phrasing your question, try to understand in what way, or via what mechanism, your predictors influence your targets.

Do not predict more than you need to solve the business challenge (for example: 7 days out vs. the earlier discussed 30 days out).

Understand really well how a resulting model can be implemented, and rethink that again.

well how a resulting model can be implemented, and rethink that again. Understand what data will be available for scoring (specially the timing aspect, later more).

Understand how well you need to predict (i.e. quality of the prediction) in order to improve on the current business process.

Classification questions are often easier than regression questions (for example: predict if someone has a job or not vs. how much they earn).

Google recently broke records in face identification using a model they call Facenet. Their way of looking at the data was truly new and creative: instead of (the traditional way) trying to predict a face, they paired three faces; two were the same person and one was different. It was up to the model to distinguish the pairs. These are the innovative ways of looking at data that characterizes a good data scientist.

Basic dataset creation

Let’s return to the most basic of data science application in marketing: propensity modeling. This will allow to setup of a data selection approach and a way to communicate what is being done.

In Figure 1A, the most basic selection is displayed on a timeline. Between the moment ‘now’ and ‘t-3’, one watches for the event that needs to be modeled. In this case, the green dots indicate ‘buying’ and at the end of the three months, once determines if a customer has bought or not. This will become the target. Note that the moment ‘now’ refers to the most recent historic data that one can get their hands at. Prior to ‘t-3’ is predictor space. The small ‘+’ signs indicate aggregation (roll-ups) of customer behavior, up to ‘t-3’ (actually, one needs to account for the data delay as well; later more). It is crucially important to not take any predictors beyond the ‘t-3’, as this will cause model leakage: your predictors will contain information about the target in a way that you won’t have available at the time of scoring. This type of visualization makes it very clear how your selections take place and it makes the often implicit choices explicit, and thus up for debate.

Note that ‘buying’ here refers to the interesting event one wants to model. It can be the purchase of a particular article, it can be a category of articles, or even it can be a characteristic of an article (such as color, brand, etc). Multiple models, modeling different characteristics, can then be scored to finally map back to the ‘ideal’ product recommendation for a customer. Alternative to purchase, the target can refer to a customer showing a particular behavior that seems interesting (upgrading, downgrading, browsing, requesting etc). Again, the selection logic is widely applicable, and only the imagination of the data scientist is the limit.

On particular thing that needs discussion in Figure 1A is the distance between the last available predictor moment and the moment of buying. For some customers this may be as little as one day (you buy in the first day of the three month window). Likely, the closer your predictor data is to the target, the easier it is to predict. A customer buying a product online today may have looked at an item yesterday. The model then will find this relationship (and hence this case contributes positively to the model performance), yet, how deployable is this? Scoring the model on new customers will point to campaign to customer who went to the website, yet, if they went to the website, and they didn’t buy next day, would any campaign helping a customer to purchase? One particular way of solving this is displayed in Figure 1B. Here, the last predictor moment is always three months prior to purchase (for those who purchased) or three month prior to the end of the observation windows (for those who did not purchase). In this way, you prevent the model looking performant while it’s non-deployable due to the timing issue. Also here, it gives rise to experimenting with the time prior to purchase: although the observation window can still be, say, three months (i.e. ‘now’ – ‘t-3’), the last predictor date can be experimented with. It will lead to a building a series of models, from say, six month prior to purchase, down by steps of one, to 1 month prior to purchase. You expect the model performance to go up as the time difference between the event and last predictor time decreases. This method also allows you to test for model stability and shows how much time prior to an event, one really start seeing clear signal.

So far, we spoke about a three month campaign window. Why three months? There are the following criteria to base this decision on: first of all, the event window is also what you predict forward if you score the model on new data. Using three months, marketing has enough time to roll out campaigns. Imagine using a one day campaign window. That would mean, at the moment you score, you predict customer who will buy the product tomorrow, and thus, there will be no time to send out a campaign. There’s another issue with taking a too short campaign window: the number of purchases will be very small. Balancing your data, working with cost structure or adding priors to the model are ways to deal with unbalanced data, however, general practice shows modeling becomes harder the more unbalanced the sample is. In many marketing applications, a three month window results in a 2%-5% up take rate, which seems to be a fair level of unbalance to still build valid models (with or without the balancing options; some models need it, some don’t). Widening the event window leads to other issues: although the percentage of purchasing customers increases, many of them will have predictor data far in the past. Given a uniform uptake, if your event window is one year, half of your customers will have the last predictor data that is half a year ago or longer (in case of the method from Figure 1A). I’ve modeled slow moving automotive parts with a one year window: here it seemed reasonable, since a particular car part could be sold once in two years.

Summarizing, the event window depends on the expected number of positive examples in your training set: it should be reasonable to model, which is >1% (please see this as a rough guideline and not as a hard border), it depends on how the resulting model will be used, and finally, it depends on the industry dynamics.

When the take rate is low, rather than working with a wider window, another approach is to use a sliding window. The principle is the same as explained above, however, this is done for a number of consecutive months and those are stack to form one training set. This is illustrated in Figure 2.

Figure 3 shows the data layout of a prepaid churn model. Churn in a prepaid scenario has a particular difficulty: you do not know when the churn took place (such as an end-date of a contract, or a phone call of the customer to quit the service). Typically, churn is inferred from the customer not showing on the network for a number of days (say, 60 days). This is the event window. In this example, the event window is separated from the campaign window in order to explicitly make room to conduct a campaign. Note the dot indicating ‘last seen active’. This shows the separation of the campaign window and the event window: this particular customer was still active on the last day of the campaign window, but not in the event window, and hence was classified as churn. Another new element here is the definition of the active customer: a customer is part of the training set if they are on the network prior to the start of the data window. The ‘last day in data window’ here also plays an important role: customers need to be active at least once in the week ending at ‘the last day in data window’, in order to make sure they have not already churned. Including those customers who are already not active gets a very performant model which says: if you have not been active in the last week, you will likely churn. At scoring time, your model will point to all customers who have not been active for a week, which, very likely, you can’t reach using campaigns because they already switched sim cards. Lastly, in this figure, the data delay is made explicit. This can be an important point when the campaign window is small. If it takes a full week to get the data, at the time of scoring, the predictor data is one week old, and hence, the campaign window is shortened with one week.

Figure 4 shows the data delay situation in a model that was tasked to predict the number of containers arriving in a port in the next three eight-hour shifts (shift A, B and C). Building a model to predict the number of containers at time t, based on the activity on t-1 leads to a good model, however, at scoring time, the only data available during ‘t-1’ was the data at ‘t-2’. Luckily, by pointing out the importance of understanding data delay in an early phase, we never went into building a non-functioning model based on ‘t-1’ data. The visualizations using to communicate those points are displayed in Figure 4.

Training a model on samples

The method outlined above will result in a training set. To be clear in terms of terminology: a training set is used to train and tune the model. Once the final model is ready, I consider it good practice to conduct an out of time validation. This out of time validation is discussed in the next paragraph. In this paragraph I would like to discuss how to efficiently train a model. Frequently I see people argue that models need to be trained on as much data as possible (and here we are in the midst of the Big Data hype). In some complex cases this is true, however, in the majority of industry models, I see no point in this. Best practice is to test (using data, data scientists!) how large your training set should be. In most cases, there is more than enough data available (say, your training data has >1M cases) to do something smarter with your data than just throw it all in one model.

Figure 5 outlines the procedure. The data is sorted in random order and there’s a column available to quickly select percentiles of customers (an easy, repeatable way to achieve this is taking the last two digits of customer ID, if you are working with customers). The first model is build using selection ‘Training 1’ and tested on ‘Validation 1’, next, the training set is increased (Training 2) and again validated on ‘Validation 1’. This process is continued until the evaluation measure on ‘Validation 1’ doesn’t show an increase when increasing the training data (say, this is ‘Training 3’). Now, since you have not used the whole dataset, you can take another partition of size ‘Training 3’ and build another model to test it against ‘Validation 1’. Does this give the same performance? Do the same predictors come up? This tells a lot about the stability of the model. When done with the training, the model can be now validated on larger set ‘Validation 2’. Does the model still hold? And still, we might not have used the full dataset. The partition called ‘Other use’ can now be used to build an ensemble by mixing the models built so far, and determining the combination weights (or non-linear combinations thereof) on yet the last unseen data. This approach gives rise to much smarter use of your data; rather than waiting till the estimation of your 1M row model completes, now you get the chance to quickly test and re-test and test again.

Out of time validation

Once you have a trained and tested model, you would like to bring it in production. However, the model was built in one time period, and will be scored on a different time period. Due to changing circumstances, the relation between target and predictors may change over time (this is called this drift), and this is something you would like to find out prior to bringing the model in production. In Figure 6, the out of time validation scheme is displayed. Given the three month campaign window, the exact same selection is made, but now ranging from ‘t-3’ to ‘t-6’. The model is scored on this data set, and compared with the known results. The reason that the out of time validation is backward, rather than forward is the following: if the model is trained on ‘t-3’-‘t-6’, it is validated one period further, and scored two periods further. Two periods is twice the drift, and likely the model performs more poorly than in the out of time validation. Assuming a constant drift, scoring one period backwards shows the same degradation in model performance as scoring one period forwards. Moreover, I feel it is a good idea to train the final scoring model on the most recent data available.

Scoring

Finally, we come to the whole purpose of the exercise: the moment of scoring. The predictors are calculated in the exact same way as in the training set and the model is applied to that data. Out comes the prediction for the same time window as was being trained on. Figure 7 shows an overview, and here it also becomes clear why in the modeling phase, you cannot create predictors from the time frame between ‘now’ and ‘t-3’, namely, this would imply that at the time of scoring, you have also data available from that timeframe, which in this case is future data.

An often heard complaint is that marketing (or any other business department) does not accept black box models such as neural nets or support vector machines. I have found that using those simple visualizations greatly enhance the understanding and acceptance how models can be valid. Business is more than willing to accept black box models, provided that somehow they can follow the logic that is used to validate the models. Figure 8 shows the training, testing and validation logic for the earlier mentioned container arrival example. The (black box) bagged neural network outperformed the existing method and Figure 8 helped to explain that the logic was sound.

Dynamic vs. static models

So far, the discussion revolved around the data selection criteria, but not about the features. If a model only contains demographic data (which is usually fairly static), a customer will always get the same predicted score. In the early days of marketing, this was acceptable, however, today’s models are requested to incorporate more dynamic features. Somehow, if customer changes behavior, this needs to be picked up by the model and may result in a change in the prediction. This implies that one needs to capture behavior, and change thereof as part of the predictors. Common way of doing this are to specify a larger set of time moments ‘last week, last month, last year, ever’ and per each of those moments aggregate behavior using summary functions as max, mean, sum and standard deviation. Examples are ‘max time spent on website last week’, mean time between purchases in lasts month, total spent on services last year, or volatility (standard deviation) of balance last hour. It is not uncommon to take those features and divide or difference them with each other to measure change in behavior. Examples are ‘ratio of max time spent on website last week and max time spent on website last month’.

Alternatively one can create ‘histogram predictors’, by binning a continuous variable (from a particular timeframe) and count or percentage the bins. Examples are ‘percentage of expenses < 10, percentage of expenses 10-25, percentage of expenses 25-50, etc. Subsequently, one can derive features that indicate the change over time for those bins.

A model using these types of predictors (in combination with static predictors) allows the prediction to change once a certain event happens. Note that in order to allow combinations of demographics to have their 'own' effect of behavioral change, one needs to use models that incorporate interaction between features (i.e., no main effects logistic regression).

It provides a lot of insight to score the model for a set of prototypical data and use visualization to show how a propensity score changes when one or more dynamic features change.

Purchased in the last three months vs. ever purchased

Note that so far, the discussion dealt with making a selection based on a time window. I’ve encountered cases where there’s no time available, and simply on knows if the customer has bought the product or not. Such a model is often referred to as ‘look-alike’ model (although all models can be explained in terms of finding look-a-likes). Modeling is still possible, but be aware that the predictors may occur later than the purchase event, which may result in predictor leakage. Also be aware that one can model very historic behavior. For example, take a dataset where the target is ‘customer has mobile data verses not’ without a time period. A model will relate the mobile data uptake to a set of background characteristics, however, since there’s no timing in the data, the model cannot distinguish early adopters of mobile data vs. late adopters. The result is that at scoring time, the model will (partially) point at customers that look like early adopters, while as we speak, mobile data is far beyond the early adoption stage. Often those things can be detected if one really thinks through the implications of the approach, rather than blindly try to model the first data set that comes in mind.

A transaction based approach

An entirely different approach is displayed in Figure 9. Rather than making a time selection, one recognizes that customers follow certain patterns in their subsequent transactions. The data is very close to an original transaction table. The table is sorted on customer ID and then ascending on purchase data (again, the example given is purchase, but the same logic is widely applicable). At the moment a customer is doing a transaction, this transaction represents the latest data available. In the historic data, however, one can look beyond that ‘current’ transaction and observe the next transaction. In order to predict the next transaction, one can bring that next transaction one line up and call it a target. As predictors for that target, the details of the current transaction can be used, or previous transaction details can be placed in the same row as the current transaction. The target can contain various characteristics: it can be the next product, the next price, the time to next purchase (inter purchase time), the next channel, the next payment method etc. The predictor set can also be extended: rather than previous product one can take previous price, previous channel, or some lagging variable average spent over last n purchases. Even Markov like predictors can be derived here: given the current purchase, historically, what is the transition probability to the next product. One element that should be considered here is that many machine learning models implicitly assume independent data records. If this is violated, one underestimates standards errors, with the error proportional to the dependence between observations. One option is to use models that handle this dependency correctly, such as linear mixed models (random effect models). Alternatively, one could argue that if the right predictors are used, conditional on the model, the observations are independent. The interesting aspect of this model is the fact that there’s no selection logic needed, and the scoring is done at the moment a new transaction comes in.

Concentration models vs. mechanism models

Say, you like fishing, and after fishing in a variety of ponds, you happen to just establish that in some ponds, you are more likely to catch fish. You remember those ponds by remembering the route to those ponds, however, the route itself bears no explanatory power of why in those ponds you are more successful. That is the principle behind a concentration model. Many (current) marketing models can be referred to as concentration models: within certain combinations of (simple) demographic variables, there seems to be a higher uptake. This has some stability over time, and hence, one can make use of this when scoring. The lift of those models is high enough to make the marketing business case (say, a lift of 3-5 for the first 30%), however, with the typically low base rate of campaigns, the false positive rate can be as high as 90%.

In contrast, there is a type of model that can be referred to as mechanism model. In a common predictive maintenance use case, one tries to predict to breakdown of equipment. Here’s the breakdown of a machine typically precedes by increased vibrations (caused by loose parts) or higher oil temperature (caused by soot deposit). A sensor picking up those measurements will yield high quality predictors, with very accurate models as result. The connection (vibrations - loose parts) or (higher oil temperature - soot deposit) can be establish by speaking with engineers. A way to look at this is to say that the engineers who developed the equipment knew that vibrations and oil temperature are important aspects to log, however, they lacked the ability to formally ‘convert’ the measurements to a propensity score.

In a way this is the correlation-causations discussion applied to modeling. It shows that correlation can be extremely useful (and profitable), however, when a high model quality is required (or expected), one needs to look for data that may contain traces of mechanisms. As an example, consider a bank who tries to predict mortgage uptake from a set of financial summaries. In order to improve prediction quality, they started monitoring transaction details, and more specifically, whether customers would start or extend year contracts. Clearly, a mechanism starts being in sight.

Businesses that are looking to advance in data science invariably expect models to be of ‘mechanism model’ quality, while their available data only gives rise to ‘concentration models’. Although the difference between a concentration model and a mechanism model is purely interpretational, using those terms helps talking about models, putting model performance in realistic relation with the available data.

Concluding remarks

It is difficult to give a comprehensive overview of all types of data set creations. For example, I recognize not having discussed is the case where there’s a product hierarchy available and one can take the product category level predictors as predictors to stabilize the predictions at product level, and as such, there are many others. Those use cases maybe too specific and require a better understanding of the challenge at hand.

Overall, I find that the topics discussed in this article are always assumed to be known and well understood, however, when asking deeper, methods followed are more based on habituation and imitation rather than arguments or data.

Apart from the selection logic, I hope to have shown interesting visualizations that help to guide discussions around those topics. Although the algorithm side of machine learning is ever expanding in more complex approaches, the daily practice of a data scientist to build and tune those models is based on a surprising small number of (easy) principles. Those principles can and will be automated (once more: look at the caret package in R and alike). The data scientist who does not want to become obsolete, will have to develop the creative skills to frame a business challenge in such a way that the resulting dataset will be able to give new and valuable answers.

http://olavlaudy.com/index.php?title=Data_Science_Data_Logic