Validation Methods For Trading Strategy Development

The bean machine, a device invented by Francis Galton, can be called the first generator of normal random variables.

There are several methods used to validate trading strategies but each has advantages and disadvantages. In this article we discuss four validation methods.

In trading strategy development validation is required because it is easy to over-fit strategies on historical data during development and get fooled by randomness. Although the best validation method is actual performance, it is an expensive method and most developers would like to assess potential before employing strategies.

Trading strategy development suffers from p-hacking. In most cases the developer attempts to modify a strategy after a failed validation test or conceive a new one. This leads to a dangerous practice of reuse of data until a strategy that passes some validation tests is developed. There are some misconceptions about multiple comparisons, data-mining bias and p-hacking, mostly due to the way these have been presented by academics with good mathematical skills but lack of trading experience.

Although most trading strategies developed via unsound methods usually fail at some point in real trading, some even immediately, high statistical significance, a subject favored by academics, is not equivalent to a good strategy. In other words, strategies with high statistical significance may also fail and they often do [Ref. 1], and this is caused by major changes in market conditions, also known as regime changes. Therefore, statistical significance is not a panacea, as purported in some academic publication, and developers should be aware of this and the fact that market regime changes also play an important role.

Out-of-sample tests

This is the most popular and also abused validation method. Briefly, out-of-sample tests require setting aside a portion of the data to be used in testing the strategy after it is developed and obtaining an unbiased estimate of future performance. However, out-of-sample tests

reduce power of tests due to a smaller sample

results are biased if strategy is developed via multiple comparisons

In other words, out-of-sample tests are useful in the case of unique hypotheses only. Use of out-of-sample tests for strategies developed via data-mining shows lack of understanding of the process. In this case the test can be used to reject strategies but not to accept any. In this sense, the test is still useful but trading strategy developers know that good performance in out-of-samples for strategies developed via multiple comparisons is in most cases a random result.

A few methods have been proposed for correcting out-of-sample significance for the presence of multiple comparisons bias but in almost all real cases the result is a non significant strategy. However, as we show in Ref. 1 with two examples that correspond to two major market regimes, highly significant strategies even after corrections for bias are applied can also fail due to changing markets. Therefore, out-of-sample tests are unbiased estimates of future performance only if future returns are distributed in identical ways as past returns. In other words, non-stationarity may invalidate any results of out-of-sample testing.

Conclusion: Out-of-sample tests apply only to unique hypotheses and assume stationarity. In this case they are useful but if these conditions are not met, they can be quite misleading.

2. Robustness tests and Stochastic Modeling

Due to a high Type-II error (false rejection) of other validation methods, practitioners have long resorted to robustness tests. Robustness tests fall under the more general subject of Stochastic Modeling.

Robustness tests usually involve small variations in parameters and/or in the entry and exit logic of a trading strategy. The objective is to determine whether the resulting distribution of some performance metric, usually the mean return or maximum drawdown, has low variance. If variance is high then it is assumed that the strategy is not robust. Different metrics can be used along with different methods for evaluating robustness.

Briefly, robustness tests in reality determine how well a strategy is over-fitted and in most cases indicate the opposite of what is desired, i.e., high robustness may be an indication of an excessive fit to historical data. For this reason it is appropriate to apply robustness tests on an out-of-sample but as a result the power of the test is low.

More importantly, any robustness tests and stochastic modeling in general are subject to data-snooping bias if used repeatedly with same data. The probability that a random strategy passes all tests even in the out-of-sample increases as the number of hypotheses increases. However, for unique hypotheses, these practical tests can reveal certain properties that are useful but vary on a case-to-case basis.

Conclusion: Robustness tests and stochastic modeling in general can assess over-fitting conditions but Type-I error (false discoveries) is high especially in the case of multiple comparisons even when applied to an out-of-sample.

3. Portfolio backtests

The main idea with this method is to avoid out-of-sample tests and use of the whole price history in developing trading strategies. By using a sufficiently high number of price series in a portfolio, the power of any test of significance increases.

A simple statistical hypothesis test may be used in this case based on the T-stat as follows [Ref. 1]:

T-stat = Sharpe × √number of years

The null hypothesis is that the strategy returns are drawn from a distribution with zero mean. If the null hypothesis is rejected, there is low probability of obtaining the strategy performance given that the null hypothesis is true.

Portfolio backtests, as well as tests on comparable securities [Ref 2.], suffer from at least two problems: the first problem is that these tests are too strict and Type-II error (missed discoveries) is high. The other and maybe more serious problem is that strategies usually fall when market conditions change despite high significance [Ref. 1.]

Statistical hypothesis testing has limited application in trading strategy development despite offering a ground for publishing academic papers. In general, the logic of the strategy and the process used to derive it are more important than any statistical tests. Statistics cannot find gold where there is none to be found.

Note that it is easy to cheat portfolio backtests by selecting the securities to use post-hoc. This is often done by some trading strategy developers, in most cases due to ignorance. The securities to use in the portfolio must be selected ex-ante. Then, if the test fails, the strategy must be rejected. Any effort to improve the strategy and repeat the tests will introduce data-snooping bias and lead to p-hacking.

Conclusion: Portfolio tests and tests on comparable securities are useful under certain conditions and given that they are not abused with p-hacking as a goal.

4. Monte Carlo simulation

Monte Carlo simulation is part of stochastic modeling but we list it here separately because of its popularity. These tests are the least robust and effective in trading strategy development and should be avoided except in the case the strategies fulfill certain requirements. For more detail see Ref. 3.

Monte Carlo simulations are especially inapplicable in the case of data-mined strategies and multiple comparisons. Actually, if these simulations are used with data-mined strategies, it is a strong indication that the developer lacks experience. In a nutshell, over-fitted strategies usually generate good Monte Carlo results. When this method is used in a loop of multiple comparisons, it loses its significance completely.

Conclusion: Use Monte Carlo tests only when it is appropriate to do so. For more details see Ref. 3.

Summary

In this article we briefly discussed four popular trading strategy validation methods. The choice of validation method depends on the nature of the strategy and application and interpretation of results is more of an art than a science. Most validation tests done by practitioners but also academics suffer from either multiple comparisons bias or fail under changing market conditions. The nature of markets is such that there is no robust test to assess strategy robustness. This is what makes trading strategy development very difficult but also an interesting and challenging task.

References

1.Harris, M (2016), Limitations of Quantitative Claims About Trading Strategy Evaluation, SSRN, https://papers.ssrn.com/sol3/papers.cfm?abstract_id=2810170

2. Harris, M (2015), Fooled by Technical Analysis: The perils of charting, backtesting and data-mining, Price Action Lab. Available at http://www.priceactionlab.com/Blog/the-book/

3. Harris, M (2017), Fooled By Monte Carlo Simulation, Medium.com, http://bit.ly/2y9gYBq, Last accessed: September 18, 2017.

This article was originally published in Price Action Lab Blog

If you have any questions or comments, happy to connect on Twitter:@mikeharrisNY

About the author: Michael Harris is a trader and best selling author. He is also the developer of the first commercial software for identifying parameter-less patterns in price action 17 years ago. In the last seven years he has worked on the development of DLPAL, a software program that can be used to identify short-term anomalies in market data for use with fixed and machine learning models. Click here for more.

Disclaimer