Golden Principle: Your testing data/environment for model evaluation should be as close to what you’re expecting post deployment. This should inspire your choice of validation strategy

With this in mind, let’s understand the nuances of different validation strategies. For each validation strategy, I’ll talk about the following:

Implementation of the strategy

Considerations while using this strategy

Confidence intervals

Validation strategies can be broadly divided into 2 categories: Holdout validation and cross validation.

Holdout validation

Within holdout validation we have 2 choices: Single holdout and repeated holdout.

a) Single Holdout

Implementation

The basic idea is to split our data into a training set and a holdout test set. Train the model on the training set and then evaluate model performance on the test set. We take only a single holdout—hence the name. Let’s walk through the steps:

Step 1: Split the labelled data into 2 subsets (train and test).

Step 2: Choose a learning algorithm. (For ex: Random Forest). Fix values of hyperparameters. Train the model to learn the parameters.

Step 3: Predict on the test data using the trained model. Choose an appropriate metric for performance estimation (ex: accuracy for a classification task). Assess predictive performance by comparing predictions and ground truth.

Step 4: If the performance estimate computed in the previous step is satisfactory, combine the train and test subset to train the model on the full data with the same hyperparameters.

Considerations

Some things that need to be take into account while using this strategy:

Random splitting or not?

Whether to split the data randomly depends on the kind of data we have. If the observations are independent from each other, random splitting can be used. In cases where this assumption is violated, random splitting should be avoided. A typical case of this scenario is time series data, as observations are dependent on each other. For example: Today’s stock price will be dependent on yesterday’s stock price (most likely).

For time series data, splitting should be done chronologically with the more recent data in test

This also aligns with the first principle. More recent data will more likely be similar to what we can expect in production.

Stratified sampling

While splitting, we need to ensure that the distribution of features as well as target remains the same in the training and test sets.

For ex: Consider a problem where we’re trying to classify an observation as fraudulent or not. While splitting, if the majority of fraud cases went to the test set, the model won’t be able to learn the fraudulent patterns, as it doesn’t have access to many fraud cases in the training data. In such cases, stratified sampling should be done, as it maintains the proportion of different classes in the train and test set.

Stratified sampling should be used for splitting the data almost always. **In production, if we expect the distribution to be very different from what we have in our present data, stratification may not be a good choice.

Choice of test size

Keeping aside a large amount of data for the test can result in an underestimation of predictive power (high bias**). But the estimate will be more stable (low variance**), as shown in the figure below. This consideration is more relevant for smaller datasets.

**Note: Here, bias and variance are w.r.t. the estimate of predictive power and not of the model itself.