How to invalidate a backtest

This follows an earlier post which described why a new entrant into quantitative finance should expect it to be difficult to find a strategy to trade.

Let us assume that the reader has chosen a data set, but hasn’t finalised the details. The next step would be to simulate the trades the program would have made historically, and see whether it would have made money. This is called a backtest, and it is an essential part of systematic trading.

The easiest way to do this badly is through inaccuracy: The backtest might not reflect how the strategy would actually have traded.

Inaccurate backtests

The most common mistakes to make in this scenario are the following:

Free rebalancing: The backtest allows the strategy to stay at, for instance, $1M long on Apple from day to day without incurring trading costs. You might think that this would not make much difference, but it does. It gets the strategy free access to a mean reversion alpha (if the price goes up, then the strat have to sell the stock to rebalance and vice versa), and will make money if the strategy is long on average. Unless you treat rebalancing costs accurately, your backtest won’t be accurate. For the more technical reader, note that a simple regression is equivalent to a taking a free-rebalancing strategy’s z-score, so treat simple regressions with suspicion.

Forward looking: The backtest uses data that would not be available until after the trade that depends on it. This is normally just a programming bug.

The universe was chosen based on current observables. For instance, picking the current largest 100 stocks as your universe. None of the current largest 100 stocks have suffered a calamitous collapse in the last few years, or they wouldn’t be in the largest 100, so you’ve post-selected a biased universe. This will make your strategy look like it would have made money when it wouldn’t have. This is a form of forward looking.

T+0: Backtests often use data available at the close of day x to trade at the close price of day x. This sounds like just a tiny change from an intraday strategy that trades with a 5 minute delay, but a lot of strategies appear work at T+0 and not with better assumptions. If the strategy isn’t evaluated with at least the delay that the trading strategy would have, then its predicted performance is not valid.

The cost model is wrong. People tend to underestimate trading costs. Often, the trading costs are assumed to be negligible, or it is assumed that once they’ve found a strategy that works, the way it trades can be manipulated to minimise the costs. That is comparable to perfecting the way that you drive 42km, and assuming that that’s a valuable way to prepare for a marathon. The only exception to that rule is ultra-long-term strategies.

These should be avoided, since they’re all problems that can be solved with care. Please note that they all normally make the backtest appear better than real life, not just different.

What a backtest gives us

A backtest provides the reader with a graph of profit/loss (P&L) and other statistics relating to the strategy’s trading patterns, risk profile and performance. The most important statistic, though, is the level of statistical confidence that the average daily profit, after costs, is positive.

This concept can catch people out. It is not enough that the strategy had a positive average daily profit over the backtest, because the backtest P&L might have just been lucky. Instead we use the average daily profit divided by its uncertainty (standard error), and call that a z-score.

The z-score: why P-values are normally wrong

This z-score is often converted to a probability value — which can be interpreted as the probability that a similar but alpha-neutral strategy would do better. However, the conversion is usually done using dodgy assumptions. The usual way to calculate them is to use “normal” or “Gaussian” statistics to do the conversion (and assume no correlation and other things). However, this is not the only way to convert from a z-score to a P-value and if you use a more accurate way, you’ll get a different value, as shown on the right.

Basically if someone converts from a z-score to a P-value using the middle thermometer (normal distribution), they’re simply wrong. If they use a different thermometer (there are lots), then you have to get stuck in to all the assumptions and distributions that they’re using to make that conversion.

The z-score game:

There are a few pictures below. Each is split into three. The first is a raw image. Either the 2nd or 3rd will also contain the raw image, but both will contain a lot of noise on top. Your task is to guess which has the raw image in it. We’ll start off with an easy example, with a z-score of 36.6.

You probably guessed correctly: the middle image has the attractor in it, the rightmost is just noise. Now try for these slightly harder examples:

This problem is not easy. In fact, for this task, it’s fairly likely that you got it wrong. The correct answers are at the end of the blog post. The challenge is slightly unfair: the kind of noise that I’ve added makes large z-scores somewhat more likely than the Laplace distribution would (right thermometer). But a non-technical reader should be able to see that if the noise is badly behaved, then surprisingly large z-scores can happen fairly often by chance.

So, to conclude the last two sections, P-values are normally wrong, and z-scores can be quite noisy: a z-score of 1.5 doesn’t necessarily mean anything.

The importance of this is that the z-score is both our ranking system and our measure of certainty that the thing will make money.

In-sample and Out-sample.

Most strategies will not be optimal from the start, so most research programs allow for some tweaking of the strategy. For instance we might have a threshold for which we would try five different values, a holding period which with four different possibilities, and a couple of other options, like take-profit / stop-loss conditions, and three cap weighting scenarios. The strategy doesn’t seem to work that well for our initial idea, but we try a few others. We have now tried the equivalent of almost 1000 different combinations. Using our image-hidden-in-noise example, we can also try 1000 noise layers, selecting the one with the highest z-score, and we are now faced with the following figure:

So, which is the “correct” one? Well we couldn’t tell before and we can’t really tell now. It might even be that neither has the mysterious attractor in it. If we use the z-score to get some measure of how likely it is that they contain the attractor, it says “yes” to both: they’re both quite impressive z-scores. The problem is that we know that they only appear to have a high z-score because we kept looking until we saw something that appeared to be good.

For this reason, an in-sample/out-sample approach is taken where the date range is split in two: The first half of the date range is chosen to be the in-sample period. The strategy is optimised in that period. Later, when we want a final evaluation to see if we would trust our strategy to make money, we evaluate its performance in the second half. Since we didn’t fit to the 2nd half, we can’t have over fit to it either.

A common mistake to make here, though, is to try the out-sample period several times: we try one strategy in the in-sample period, and it appears to do quite well, but doesn’t work out-sample. In that case, often, the strategy is abandoned, and the researcher goes back to the drawing board. A different idea is then chosen, and the research program is repeated. The problem now is that if the 2nd or 3rd research programs all use the same out-sample period, then we’ve over used the out-sample period. We might get this figure if we tried 5 research programs:

The key point is that they both look really good (even though at least one isn’t). We would trade either strategy with these z-scores.

Another quick point, is that strategies can just stop working, which will mean that even good statistics properly done doesn’t guarantee that the strategy will work.

Conclusion:

It is very important to backtest accurately if we want to avoid wasting time and money on strategies that don’t work in the real world. Backtests give you a z-score which is a measure of certainty that the strategy makes money. The z-score is often converted to a P-value, and the conversion is almost always done wrong: Z-scores have a much larger random range than -1.5 to +1.5. If you try to optimise your strategy, then be aware that you won’t be able to tell which is better, and that the optimised strategy will appear to be better than it is. For this reason, people split their date range into an in-sample period and an out-sample period. Unfortunately it’s almost impossible to avoid over using your out-sample period, which invalidates it.

The good news:

There are several respects in which the picture has been painted as more bleak than it is.

The main one is that how unreliable a z-score is depends on the type of noise you get, and in finance, the noise (for instance the performance of a well-placed bet that turns out badly) isn’t as bad as presented above. It all depends on three things: how correlated the noise is (very correlated in the pictures above, but not that correlated from one day to the next in finance), how fat tailed the noise is (not very in the example above, but quite fat tailed in finance), how much volatility clustering you have (plenty here and in strategy P&Ls). The z-scores in the images above tend to be wrong by around 4 either way, whereas in finance, it’s probably somewhere more like 2. The right-most (Laplace distribution) thermometer in the top graph is probably about right though.

The second one is that even if it’s not possible to be certain that a strategy makes money, a high z-score is still correlated to some extent with strategy performance: A z-score of 3 certainly doesn’t guarantee that the strategy will make money, but it’s more likely to make money than a strategy with a z-score of 0.

To reiterate a point in the previous blog post, it all comes down to the economics: if a strategy has a sound reason behind it, then that counts as additional evidence that the strategy will make money. You should still do backtests, and you should still pay attention to the results, but you should have both in order to have a well informed view of your chances.

Answers:

a: left, b: left, c: left, d: left, e: right, f:right, g: right