Updates 2019: In this first Machine Learning for Trading post, we’ve added a section on feature selection using the Boruta package, equity curves of a simple trading system, and some Lite-C code that generates the training data. Don’t forget to download the code and data used throughout the Machine Learning for Trading series.



Way back in the day when I first got into the markets, one of the first books I read was David Aronson’s Evidence Based Technical Analysis. The nerdy engineer in me was hooked by the ‘Evidence Based’ part. This was soon after I had digested a trading book claiming a basis in chaos theory (a link which turned out to be BS). Apparently, using complex-sounding terms in trading book titles lends a boost of credibility…. and book sales. I’m a victim of marketing.

Evidence Based Technical Analysis promotes a scientific approach to trading, including a detailed method for the assessment of data-mining bias in your backtest results. There’s also a discussion around the reasons why many traders embrace subjective beliefs over objective methods. Having seen this first hand many, many times, it’s a fascinating read!

Regular readers know I’m super interested in using machine learning for trading applications. Imagine my delight when I discovered that David Aronson had co-authored a new book with Timothy Masters titled Statistically Sound Machine Learning for Algorithmic Trading of Financial Instruments – which I’ll now refer to as SSML. While it is intended as a companion to Aronson’s (free) software platform for trading strategy development, it has a bunch of practical tips for anyone using machine learning for trading in the financial markets. I’ve used most of his ideas in R.

So, Kris, how does this backstory of your reading habits benefit me?

Well, SSML was a survival guide of sorts during my early forays into machine learning for trading. I want to walk you through some of those early experiments, focusing on the more significant and practical learnings I picked up along the way. Maybe all this can be a source of inspiration for your research, or a cornerstone in your ML trading journey.

This first post will focus on feature engineering and also introduce the data mining approach. Machine Learning for Trading Part 2 will focus on algorithm selection and ensemble methods for combining the predictions of numerous learners.

Let’s get started!

The data mining approach

Data mining is just one approach to extracting profits from the markets and is different from a model-based approach.

Rather than constructing a mathematical representation of price, returns or volatility from first principles, data mining involves searching for patterns first and then fitting a model to those patterns after the fact. Both model-based and data mining approaches have pros and cons, and I contend that using both approaches can lead to a valuable source of portfolio diversity.

The Financial Hacker summed up the advantages and disadvantages of the data mining approach nicely:

The advantage of data mining is that you do not need to care about market hypotheses. The disadvantage: those methods usually find a vast amount of random patterns and thus generate a vast amount of worthless strategies. Since mere data mining is a blind approach, distinguishing real patterns – caused by real market inefficiencies – from random patterns is a challenging task. Even sophisticated reality checks can normally not eliminate all data mining bias. Not many successful trading systems generated by data mining methods are known today.

David Aronson himself cautions against putting blind faith in data mining methods:

Though data mining is a promising approach for finding predictive patterns in data produced by largely random complex processes such as financial markets, its findings are upwardly biased. This is the data mining bias. Thus, the profitability of methods discovered by data mining must be evaluated with specialized statistical tests designed to cope with the data mining bias.

I would add that the implicit assumption behind the data mining approach is that the patterns identified will continue to repeat in the future. Of course, the validity of this assumption is unlikely to ever be certain.

Data mining is a term that can mean different things to different people depending on the context. When I refer to a data mining approach to trading systems development, I am referring to the use of statistical learning algorithms to uncover relationships between feature variables and a target variable (in the regression context, these would be referred to as the independent and dependent variables, respectively). The feature variables are observations that are assumed to have some relationship to the target variable and could include, for example, historical returns, historical volatility, various transformations or derivatives of a price series, economic indicators, and sentiment barometers. The target variable is the object to be predicted from the feature variables and could be the future return (next day return, next month return etc), the sign of the next day’s return, or the actual price level (although the latter is not really recommended, for reasons that will be explained below).

Although I differentiate between the data mining approach and the model-based approach, the data mining approach can also be considered an exercise in predictive modelling. Interestingly, the model-based approaches that I have written about previously (for example ARIMA, GARCH, Random Walk etc) assume linear relationships between variables. Modelling non-linear relationships using these approaches is (apparently) complex and time consuming. On the other hand, some statistical learning algorithms can be considered ‘universal approximators’ in that they have the ability to model any linear or non-linear relationship. It was not my intention to get into a philosophical discussion about the differences between a model-based approach and a data mining approach, but clearly there is some overlap between the two.

In the near future, perhaps in a future Trading for Machine Learning post, I’ll write about my efforts to create a hybrid approach that attempts a synergistic combination of classical linear time series modelling and non-linear statistical learning – trust me, it is actually much more interesting than it sounds. Watch this space.

Variables and feature engineering

The prediction target

The first and most obvious decision to be made is the choice of target variable. In other words, what are we trying to predict? For one-day ahead forecasting systems, profit is the usual target. I used the next day’s return normalized to the recent average true range, the implication being that in live trading, position sizes would be inversely proportionate to the recent volatility. In addition, by normalizing the target variable in this way, we may be able to train the model on multiple markets, as the target will be on the same scale for each.

Choosing predictive variables

In SSML, Aronson states that the golden rule of feature selection is that the predictive power should come primarily from the features and not from the model itself. My research corroborated this statement, with many (but not all) algorithm types returning correlated predictions for the same feature set. I found that the choice of features had a far greater impact on performance than the choice of model. The implication is that spending considerable effort on feature selection and feature engineering is well and truly justified. I believe it is critical to achieving decent model performance.

Many variables will have little or no relationship with the target variable and including these will lead to overfitting or other forms of poor performance. Aronson recommends using chi-square tests and Cramer’s V to quantify the relationship between variables and the target. I actually didn’t use this approach, so I can’t comment on it. I used a number of other approaches, including ranking a list of candidate features according to their Maximal Information Coefficient (MIC) and selecting the highest ranked features, Recursive Feature Elimination (RFE) via the caret package in R, an exhaustive search of all linear models, and Principal Components Analysis (PCA). We’ll cover each of these below.

Some candidate features

Following is the list of features I investigated as part of this research. Most were derived from SSML. This list is by no means exhaustive and only consists of derivatives and transformations of the price series. I haven’t yet tested exogenous variables, such as economic indicators, the price histories of other instruments and the like, but I think these are deserving of attention too. The following list is by no means exhaustive, but provides a decent starting point:

1-day log return

Trend deviation: the logarithm of the closing price divided by the lowpass filtered price

Momentum: the price today relative to the price x days ago, normalized by the standard deviation of daily price changes.

ATR: the average true range of the price series

Velocity: a one-step-ahead linear regression forecast on closing prices

Linear forecast deviation: the difference between the most recent closing price and the closing price predicted by a linear regression line

Price variance ratio: the ratio of the variance of the log of closing prices over a short time period to that over a long time period.

Delta price variance ratio: the difference between the current value of the price variance ratio and its value x periods ago.

The Market Meanness Index: A measure of the likelihood of the market being in a state of mean reversion, created by the Financial Hacker.

MMI deviation: The difference between the current value of the Market Meanness Index and its value x periods ago.

The Hurst exponent

ATR ratio: the ratio of an ATR of a short (recent) price history to an ATR of a longer period.

Delta ATR ratio: the difference between the current value of the ATR ratio and the value x bars ago.

Bollinger width: the log ratio of the standard deviation of closing prices to the mean of closing prices, that is a moving standard deviation of closing prices relative to the moving average of closing prices.

Delta Bollinger width: the difference between the current value of the Bollinger width and its value x bars ago.

Absolute price change oscillator: the difference between a short and long lookback mean log price divided by a 100-period ATR of the log price.

Thus far I have only considered the most recent value of each variable. I suspect that the recent history of each variable would provide another useful dimension of data to mine. I left this out of the feature selection stage since it makes more sense to firstly identify features whose current values contain predictive information about the target variable before considering their recent histories. Incorporating this from the beginning of the feature selection stage would increase the complexity of the process by several orders of magnitude and would be unlikely to provide much additional value. I base that statement on a number of my own assumptions, not to mention the practicalities of the data mining approach, rather than any hard evidence.

Transforming the candidate features

In my experiments, the variables listed above were used with various cutoff periods (that is, the number of periods used in their calculation). Typically, I used values between 3 and 20 since Aronson states in SSML that lookback periods greater than about 20 will generally not contain information useful to the one period ahead forecast. Some variables (like the Market Meanness Index) benefit from a longer lookback. For these, I experimented with 50, 100, and 150 bars.

Additionally, it is important to enforce a degree of stationarity on the variables. David Aronson again:

Using stationary variables can have an enormous positive impact on a machine learning model. There are numerous adjustments that can be made in order to enforce stationarity such as centering, scaling, and normalization. So long as the historical lookback period of the adjustment is long relative to the frequency of trade signals, important information is almost never lost and the improvements to model performance are vast.

Aronson suggests the following approaches to enforcing stationarity:

Scaling: divide the indicator by the interquartile range (note, not by the standard deviation, since the interquartile range is not as sensitive to extremely large or small values).

Centering: subtract the historical median from the current value.

Normalization: both of the above. Roughly equivalent to traditional z-score standardization, but uses the median and interquartile range rather than the mean and standard deviation in order to reduce the impact of outliers.

Regular normalization: standardizes the data to the range -1 to +1 over the lookback period (x-min)/(max-min) and re-centered to the desired range.

In my experiments, I generally adopted regular normalization using the most recent 50 values of the features.

Data Pre-Processing

If you’re following along with the code and data provided (see note in bold above), I used the data for the GBP/USD exchange rate (sampled daily at midnight UTC, for the period 2009-2016), but I also provided data for EUR/USD (same sampling regime) for further experimentation.

Before you continue.... Want to see how we trade for a living with algos — so you can too? Learn where to start and see how systematic retail traders generate profit long-term: Get the Free Intro to Algo Trading PDF Enter your email and it's yours! Get the Intro to Algo Trading PDF We'll also send you our best free training and relevant promotions. No spam or 3rd parties. Unsubscribe anytime.

Removing highly correlated variables

It makes sense to remove variables that are highly correlated with other variables since they are unlikely to provide additional information that isn’t already contained elsewhere in the feature space. Keeping these variables will also add unnecessary computation time, increase the risk of overfitting and bias the final model towards the correlated variables.

caret ‘s function for examining pairwise correlations between variables – caret::findCorrelation() – with a cutoff of 0.3, these are the remaining variables and their pairwise correlations:

Feature selection via Maximal Information Running‘s function for examining pairwise correlations between variables –– with a cutoff of 0.3, these are the remaining variables and their pairwise correlations: The maximal information coefficient (MIC) is a non-parametric measure of two-variable dependence designed specifically for rapid exploration of many-dimensional data sets. While MIC is limited to univariate relationships (that is, it does not consider variable interactions), it does pick up non-linear relationships between dependent and independent variables. Read more about MIC here. I used the minerva package in R to rank my variables according to their MIC with the target variable (next day’s return normalized to the 100-period ATR). Here’s the code output:

### MIC RESULTS # MMIFaster 0.09817869 # deltaPVR5 0.10107728 # bWidthSlow 0.10196236 # deltaATRrat10 0.10228334 # apc5 0.10346916 # deltaATRrat3 0.10473520 # mom10 0.10593616 # trend 0.10610100 # HurstMod 0.10703185 # HurstFast 0.10810217 # atrRatSlow 0.10818756 # deltaMMIFastest10 0.10863479 # bWdith3 0.11014629 # HurstFaster 0.11493763 # ATRSlow 0.12458435

These results show that none of the features have a particularly high MIC with respect to the target variable, which is what I would expect from noisy data such as daily exchange rates sampled at an arbitrary time. However, certain variables have a higher MIC than others. In particular, the long-term ATR and the 20-period Hurst exponent and the 3-period Bollinger width outperform the rest of the variables. Recursive feature elimination These results show that none of the features have a particularly high MIC with respect to the target variable, which is what I would expect from noisy data such as daily exchange rates sampled at an arbitrary time. However, certain variables have a higher MIC than others. In particular, the long-term ATR and the 20-period Hurst exponent and the 3-period Bollinger width outperform the rest of the variables. I also used recursive feature elimination (RFE) via the caret package in R to isolate the most predictive features from my list of candidates. RFE is an iterative process that involves constructing a model from the entire set of features, retaining the best performing features, and then repeating the process until all the features are eliminated. The model with the best performance is identified and the feature set from that model declared the most useful.

I performed cross-validated RFE using a random forest model. Here are the results:

#### Results # The top 5 variables (out of 14): # ATRSlow, trend, HurstMod, deltaATRrat10, bWidthSlow

In this case, the RFE process has emphasized variables that describe volatility and trend, but has decided that the best performance is obtained by incorporating 14 of 15 variables into the model. Here’s a plot of the cross validated performance of the best feature set for various numbers of features (noting that k-fold cross validation may not be the ideal cross-validation method for financial time series):



I am tempted to take the results of the RFE with a grain of salt. My reasons are: The RFE algorithm does not fully account for interactions between variables. For example, assume that two variables individually have no effect on model performance, but due to some relationship between them they improve performance when both are included in the feature set. RFE is likely to miss this predictive relationship. The performance of RFE is directly related to the ability of the specific algorithm (in this case random forest) to uncover relationships between the variables and the target. At this stage of the process, we have absolutely no evidence that the random forest model is applicable in this sense to our particular data set. Finally, the implementation of RFE that I used was the ‘out of the box’ caret version. This implementation uses root mean squared error (RMSE) as the objective function, however I don’t believe that RMSE is the best objective function for this data due to the significant influence of extreme values on model performance. It is possible to have a low RMSE but poor overall performance if the model is accurate across the middle regions of the target space (corresponding to small wins and losses), but inaccurate in the tails (corresponding to big wins and losses) In this case, the RFE process has emphasized variables that describe volatility and trend, but has decided that the best performance is obtained by incorporating 14 of 15 variables into the model. Here’s a plot of the cross validated performance of the best feature set for various numbers of features (noting that k-fold cross validation may not be the ideal cross-validation method for financial time series):I am tempted to take the results of the RFE with a grain of salt. My reasons are: In order to address (3) above, I implemented a custom summary function so that the RFE was performed such that the cross-validated absolute return was maximized. I also applied the additional criterion that only predictions with an absolute value of greater than 5 would be considered under the assumption that in live trading we wouldn’t enter positions unless the prediction exceeded this value. The results are as follows:

# The top 5 variables (out of 15): # ATRSlow, trend, HurstMod, bWidthSlow, atrRatSlow

The results are a little different to those obtained using RMSE as the objective function. The focus is still on the volatility and trend indicators, but in this case the best cross validated performance occurred when selecting only 2 out of the 15 candidate variables. Here’s a plot of the cross validated performance of the best feature set for various numbers of features:



The model clearly performs better in terms of absolute return for a smaller number of predictors. This is consistent with Aronson’s assertion that with this approach we should stick with at most 3-4 variables otherwise overfitting is almost unavoidable.

The performance profile of the model tuned on absolute return is very different to that of the model tuned on RMSE, which displays a consistent improvement as the number of predictors is increased. Using RMSE as the objective function (which seems to be the default in many applications I’ve come across) would result in a very sub-optimal final model in this case. This highlights the importance of ensuring that the objective function is a good proxy for the performance being sought in practice.

In the RFE example above, I used 5-fold cross validation, but I haven’t held out a test set of data or estimated performance with an inner cross validation loop. Note also that k-fold cross validation may not be ideal for financial time series thanks to the autocorrelations present. Models with in-built feature selection The results are a little different to those obtained using RMSE as the objective function. The focus is still on the volatility and trend indicators, but in this case the best cross validated performance occurred when selecting only 2 out of the 15 candidate variables. Here’s a plot of the cross validated performance of the best feature set for various numbers of features:The model clearly performs better in terms of absolute return for a smaller number of predictors. This is consistent with Aronson’s assertion that with this approach we should stick with at most 3-4 variables otherwise overfitting is almost unavoidable.The performance profile of the model tuned on absolute return is very different to that of the model tuned on RMSE, which displays a consistent improvement as the number of predictors is increased. Using RMSE as the objective function (which seems to be the default in many applications I’ve come across) would result in a very sub-optimal final model in this case. This highlights the importance of ensuring that the objective function is a good proxy for the performance being sought in practice.In the RFE example above, I used 5-fold cross validation, but I haven’t held out a test set of data or estimated performance with an inner cross validation loop. Note also that k-fold cross validation may not be ideal for financial time series thanks to the autocorrelations present.

A number of machine learning algorithms have feature selection in-built. Max Kuhn’s website for the caret package contains a list of such models that are accessible through the caret package. I’ll apply several and compare the features selected to those selected with other methods. For this experiment, I used a diverse range of algorithms that include various ensemble methods and both linear and non-linear interactions: Bagged multi-adaptive regressive splines (MARS)

Boosted generalized additive model (bGAM)

Lasso

Spike and slab regression (SSR)

Regression tree

Stochastic gradient boosting (SGB) For each model, I did only very basic (if any) hyperparameter tuning within caret using time series cross validation with a train window length of 200 days and a test window length of 20 days. Maximization of absolute return was used as the objective function. Following cross-validation, caret actually trains a model on the full data set with the best cross-validated hyperparameters – but this is not what we want if we are to mimic actual trading behaviour (we are more interested in the aggregated performance across each test window, which caret very neatly allows us to access – details on this below when we investigate a trading system).

The table below shows the top 5 variables for each algorithm:

Variable Types Variable Size (bytes) Range Step Size Example Use var 8 -1.8e308 to 1.8e308 2.2e−308 Prices, technical indicators int 4 -2,147,483,648 to 2,147,483,647 1 Counting char 1 0 to 256 1 Individual text characters string Number of characters plus 1 N/A N/A Text bool 4 true or false N/A Decisions vars or var* 8*(Series length) -1.8e308 to 1.8e308 2.2e−308 Time series of prices, returns

We can see that 10-day momentum is included in the top 5 predictors for every algorithm I investigated except one, and was the top feature every time it was selected. The change in the ratio of the ATR lookbacks featured 7 times in total. A responsive absolute price change oscillator was selected 4 times, and in one form or another (10- and 20-day variables once each and 100-day variable thrice). The 5-day change in the price variance ratio was a notable mention, being included in the top variables 3 times. The table below summarizes the frequency with which each variable was selected:



9 of the 15 variables that passed the correlation filter were selected in the top 5 by at least one algorithm.

Model selection using glmulti

The glmulti package fits all possible unique generalized linear models from the variables and returns the ‘best’ models as determined by an information criterion (Aikake in this case). The package is essentially a wrapper for the glm (generalized linear model) function that allows selection of the ‘best’ model or models, providing insight into the most predictive variables. By default, glmulti builds models from the main interactions, but there is an option to also include pairwise interactions between variables. This increases the computation time considerably, and I found that the resulting ‘best’ models were orders of magnitude more complex than those obtained using main interactions only, and results were on par.



### glmulti.analysis # Method: h / Fitting: glm / IC used: aicc # Level: 1 / Marginality: FALSE # From 100 models: # Best IC: 22669.8709723958 # Best model: # [1] "target ~ 1 + trend + atrRatSlow" # Evidence weight: 0.0323562049748589 # Worst IC: 22673.0396363133 # 23 models within 2 IC units. # 92 models to reach 95% of evidence weight.

### model aicc weights # 1 target ~ 1 + trend + atrRatSlow 22669.87 0.03235620 # 2 target ~ 1 + apc5 + trend + atrRatSlow 22670.72 0.02111157 # 3 target ~ 1 + MMIFaster + trend + atrRatSlow 22670.73 0.02104022 # 4 target ~ 1 + bWdith3 + trend + atrRatSlow 22670.87 0.01961736 # 5 target ~ 1 + trend + atrRatSlow + bWidthSlow 22670.96 0.01878187 # 6 target ~ 1 + deltaMMIFastest10 + trend + atrRatSlow 22671.04 0.01799825 # 7 target ~ 1 + trend + atrRatSlow + ATRSlow 22671.20 0.01667710 # 8 target ~ 1 + atrRatSlow 22671.23 0.01644049 # 9 target ~ 1 + apc5 + MMIFaster + trend + atrRatSlow 22671.40 0.01504698 # 10 target ~ 1 + deltaPVR5 + trend + atrRatSlow 22671.44 0.01474030 # 11 target ~ 1 + MMIFaster + trend + atrRatSlow + bWidthSlow 22671.49 0.01439135 # 12 target ~ 1 + HurstFast + trend + atrRatSlow 22671.55 0.01396476 # 13 target ~ 1 + HurstMod + trend + atrRatSlow 22671.61 0.01357259 # 14 target ~ 1 + mom10 + trend + atrRatSlow 22671.64 0.01337186 # 15 target ~ 1 + apc5 + bWdith3 + trend + atrRatSlow 22671.65 0.01329943 # 16 target ~ 1 + MMIFaster + deltaMMIFastest10 + trend + atrRatSlow 22671.66 0.01325762 # 17 target ~ 1 + deltaATRrat3 + trend + atrRatSlow 22671.68 0.01309016 # 18 target ~ 1 + deltaATRrat10 + trend + atrRatSlow 22671.69 0.01303342 # 19 target ~ 1 + HurstFaster + trend + atrRatSlow 22671.70 0.01294480 # 20 target ~ 1 + bWdith3 + MMIFaster + trend + atrRatSlow 22671.72 0.01283320 # 21 target ~ 1 + apc5 + trend + atrRatSlow + bWidthSlow 22671.74 0.01268352 # 22 target ~ 1 + apc5 + deltaMMIFastest10 + trend + atrRatSlow 22671.83 0.01216110 # 23 target ~ 1 + bWdith3 + trend + atrRatSlow + bWidthSlow 22671.85 0.01205142

glmulti

Generalized linear model with stepwise feature selection

Finally, I used a generalized linear model with stepwise feature selection:

### GLM Stepwise Feature Selection Results # Coefficients: # (Intercept) trend atrRatSlow # -1.907 -3.632 -5.100 # # Degrees of Freedom: 2024 Total (i.e. Null); 2022 Residual # Null Deviance: 8625000 # Residual Deviance: 8593000 AIC: 22670

The final model selected 2 of the 15 variables: the ratio of the 20- to 100-day ATR, difference between a short-term and long-term trend indicator. The final model selected 2 of the 15 variables: the ratio of the 20- to 100-day ATR, difference between a short-term and long-term trend indicator.

Boruta: all relevant feature selection Boruta finds relevant features by comparing the importance of the original features with the importance of random variables. Random variables are obtained by permuting the order of values of the original features. Boruta finds a minimum, mean and maximum value of the importance of these permuted variables, and then compares these to the original features. Any original feature that is found to be more relevant than the maximum random permutation is retained.

We retain the models whose AICs are less than two units from the ‘best’ model. Two units is a rule of thumb for models that, for all intents and purposes, are likely to be on par in terms of their performance:Notice any patterns here? many of the top models selected the ratio of the 20- day to 100-day ATRs, as well as the difference between a short-term and long-term trend indicator. Perhaps surprisingly sparse are the momentum variables. This is confirmed with this plot of the model averaged variable importance (averaged over the best 1,000 models):Note that these models only considered the main, linear interactions between each variable and the target. Of course, there is no guarantee that any relationship is linear, if it exists at all. Further, there is the implicit assumption of stationary relationships amongst the variables which is unlikely to hold. Still, this method provides some useful insight.One of the great things aboutis that it facilitates model-averaged predictions – more on this when I delve into ensembles in part 2 of this series.

Boruta does not measure the absolute importance of individual features, rather it compares each feature to random permutations of the original variables and determines the relative importance. This theory very much resonates with me and I intuit that it will find application in weeding out uninformative features from noisy financial data. The idea of adding randomness to the sample and then comparing performance is analogous to the approach I use to benchmark my systems against a random trader with a similar trade distribution.

The box plots in the figure below show the results obtained when I ran the Boruta algorithm for the 15 filtered variables for 1,000 iterations. The blue box plots show the permuted variables of minimum, mean and maximum importance, the green box plots indicate the original features that ranked higher than the maximum importance of the random permuted variables, and the variables represented by the red box plots are discarded.



### Boruta all relevant feature selection # Boruta performed 369 iterations in 5.385172 mins. # 8 attributes confirmed important: apc5, atrRatSlow, ATRSlow, bWidthSlow, # deltaMMIFastest10 and 3 more; # 7 attributes confirmed unimportant: bWdith3, deltaATRrat10, deltaATRrat3, deltaPVR5, # HurstFast and 2 more;

Discussion of feature selection methods

It is important to note that any feature selection process naturally invites a degree of selection bias. For example, from a large set of uninformative variables, a small number may randomly correlate with the target variable. The selection algorithm would then rank these variables highly. The error would only be (potentially) uncovered through cross validation of the selection algorithm or by using an unseen test or validation set. Feature selection is difficult and can often make predictive performance worse since it is easy to over-fit the feature selection criterion. It is all too easy to end up with a subset of attributes that works really well on one particular sample of data, but not necessarily on any other. There is a fantastic discussion of this at the Statistics Stack Exchange community that I have linked here because it is just so useful.

These results are largely consistent with the results obtained through other methods, perhaps with the exception of the inclusion of the MMI and Hurst variables. Surprisingly, the long-term ATR was the clear winner.Side note: The developers state that “Boruta” means “Slavic spirit of the forest.” As something of a Slavophile myself, I did some googling and discovered that this description is quite a euphemism. Check out some of the items that pop up in a Google image search!

It is critical to take steps to minimize selection bias at every opportunity. The results of any feature selection process should be cross validated or tested on an unseen hold out set. If the hold out set selects a vastly different set of predictors, something has obviously gone wrong – or the features are worthless. The approach I took in this post was to cross validate the results of each test that I performed, with the exception of the Maximal Information Criterion and glmulti approaches. I’ve also selected features based on data for one market only. If the selected features are not robust, this will show up with poor performance when I attempt to build predictive models for other markets using these features.

I think that it is useful to apply a wide range of methods for feature selection and then look for patterns and consistencies across these methods. This approach seems to intuitively be far more likely to yield useful information than drawing absolute conclusions from a single feature selection process. Applying this logic to the approach described above, we can conclude that the 10-day momentum, the ratio of the 10- to 20-day ATR, the trend deviation indicator, and the absolute price change oscillator are probably the most likely to yield useful information since they continually show up in most of the feature selection methods that I investigated. Other variables that may be worth considering include the long-term ATR and the change in a responsive MMI.

In Trading for Machine Learning Part 2, I’ll describe how I built and combined various models based on these variables.

Principal Components Analysis

An alternative to feature selection is Principal Components Analysis (PCA), which attempts to reduce the dimensionality of the data while retaining the majority of the information. PCA is a linear technique: it transforms the data by linearly projecting it onto a lower dimension space while preserving as much of its variation as possible. Another way of saying this is that PCA attempts to transform the data so as to express it as a sum of uncorrelated components.

Again, note that PCA is limited to a linear transformation of the data, however there is no guarantee that non-linear transformations won’t be better suited. Another significant assumption when using PCA is that the principal components of future data will look those of the training data. It’s also possible that the smallest component, describing the least variance, is also the only one carrying information about the target variable, and would likely be lost when the major variance contributors are selected.

To investigate the effects of PCA on model performance, I cross validated 2 random forest models, the first using the principal components of the 15 variables, and the other using all 15 variables in their raw form. I chose the random forest model since it includes feature selection and thus may reveal some insights about how PCA stacks up in relation to other feature selection methods. For both models, I performed time series cross validation on a training window of 200 days and a testing window of 20 days.

In order to infer the difference in model performance, I collected the results from each resampling iteration of both final models and compared their distributions via a pair of box and whisker plots:



The model built on the raw data outperforms the model built on the data’s principal components in this case. The mean profit is higher and the distribution is shifted in the positive direction. Sadly, however, both distributions look only slightly better than random and have wide distributions.