Avoid Time Loops With Cross-Validation

Using time series cross-validation and catching lookahead bias

Groundhog Day is quite the entertaining movie. For those that have never seen it, Bill Murray’s character gets caught in a strange time loop where he re-lives the same day (February 2nd — Groundhog Day), over and over again. In fact, he re-lives it so many times that he’s able to use his knowledge of what’s to come to change his behavior for his own benefit.

Those of us that have been working in areas of data science that require time-dependent data can immediately see where this is going.

Time-dependent data in the stock market

At Apteo, we work with equity predictions and financial markets. As we discussed in our introductory post, we rely heavily on machine learning (specifically deep neural networks) to predict the future of individual equities. This process, like any other prediction task that requires predictions far into the future, requires data that has a significant time-dependent structure.

Working with this type of data presents a specific set of challenges that requires very careful attention, because it’s very easy for look-ahead bias to creep into the data science process in very subtle ways. Here’s an example we recently ran into.

We train a variety of neural networks to predict future stock returns over a variety of time-frames. When we first started out, we used a very standard data science approach to address this task. We first split our dataset into a training set that contained the first 70–80% of our data (when ordered chronologically), then we used the remaining data as our test set.

The label for each instance was generated by taking the adjusted close of each stock at time t (some date in the past) and comparing that to the adjusted close of that stock N days afterwards (a date that is also in the past).

At first glance, this may seem acceptable. Our training data contains instances whose dates are before those of any instances in our test data, which means that when we evaluate our network, we evaluate it on unseen data. Right?

Of course not, otherwise we wouldn’t be publishing this post!

In fact, what was happening was that the process that we used to create our datasets introduced a subtle look-ahead bias.

An example may help to illustrate the issue.

Look-ahead bias in the evaluation process

If the last training instance in our dataset was for Apple on January 2, 2011 (since January 1st is a market holiday), that means that we would need to use the stock price of Apple from January 2, 2012 to create the label for that instance. Now what would happen if we used our trained network to predict the return on Apple’s stock for January 9, 2011?

Even though we don’t introduce any future data during the training process, during the evaluation stage, the network has knowledge about Apple’s stock price on days after January 9, 2011, which it could then use to make predictions for Apple on January 9, 2011.

We illustrate this in the image below.

A visualization of how different points in time are used to create data for training and testing

This subtle issue is actually an example of lookahead bias creeping into the evaluation stage of our process, and it caused our network to appear to be more accurate than it actually was.

So, here we see that look-ahead bias can affect both the training and evaluation stages of a model, but that’s not the only gotcha to be aware of when it comes to time-dependent data, especially in the world of investing.

Changing regimes

Another issue in finance is that the distribution of future data doesn’t always match the distribution of previous data. In finance, you’ll frequently hear this referred to as “changing regimes”. The idea here is that the market can quickly shift from choppy to bullish to bearish to recession to breakout, all in ways that have never been observed in the past.

When creating time-dependent predictions, this is problematic. Using only a single time period for testing our predictions may not capture the accuracy of our network in different historical regimes.

It’s possible for our network to be accurate today. However, one year ago, a network trained in the exact same way as the network trained today may have been extremely inaccurate.

If we had continued to use the prior method of evaluation that we had been using, we would never have seen that effect.

So now the natural question to ask is how do we account for lookahead bias and changing regimes?

Our answer lies in walk-forward cross-validation.

Walk-forward cross-validation

In machine learning that does not have a strong time-dependency, it’s quite common to use k-fold cross-validation to evaluate a trained model. The idea behind this technique is fairly simple:

Select a value for k (for us this is often 10)

For each of k iterations, create a subset of the data that has 1/k of the data points and use that as the test set

Use the remaining data as the training set

Train and evaluate your model on the training and test datasets as normal

Keep track of the metrics on each of the test sets

When all k models are trained, average the metrics from each of the test sets together to get a final value for the entire model (at this point it would be necessary to train a model on the entire dataset to get the actual model to be used in production)

For time-dependent data, the same idea is used. The difference, though, lies in how the test and training sets are created in each iteration. Instead of holding out 1/k of the data on each repetition, the start and end dates of the training dataset are walked-forward on each iteration, and the network is trained from the beginning of the entire dataset up until the ending period of the training dataset (which must account for the lookahead bias we mentioned above).

We illustrate this in the image below.

A visualization of walk-forward cross-validation, accounting for the need to use data from the future to create labels that are associated with data instances in the past

When all networks are done training, the metrics for each test set can be combined together in a weighted-average to give the cross-validated error for the model that was trained on the entire dataset.

Benefits and disadvantages

Using this strategy, we can account for different regimes and also avoid accidentally introducing lookahead bias into our evaluation process. This allows us to get fairly accurate metrics for our models’ accuracy.

This approach does have its drawbacks, though. For one, training k additional networks does take additional time and CPU resources. In addition, the selection of k can also be subjective and would require some underlying understanding of historical data distributions (though, it could be argued, that a good data scientist would already care about this anyway).

Ultimately, we find that the increased time and computational requirements are small prices to pay for a better understanding of our network’s accuracy. After all, if we had no way to properly measure our baseline performance, we wouldn’t be able to truly understand if any of the work we were doing to improve the accuracy of our networks had any impact.

In this case, we fully believe in the saying “you can’t manage what you can’t measure.”

Acks

Many thanks to our CTO, Camron, for the great visualizations. Also thanks to Hardik, Gaurav, and the rest of the folks over at qplum for doing a great job of spreading information about walk-forward predictions and ML in finance. Here’s a great video from them that has even more interesting material.