Machine learning is a buzzword often thrown about when discussing the future of finance and the world. You may have heard of neural networks solving problems in facial recognition, language processing, and even financial markets, yet without much explanation. It is easy to view this field as a black box, a magic machine that somehow produces solutions, but nobody knows why it works. It is true that machine learning techniques (neural networks in particular) pick up on obscure and hard to explain features, however there is more room for research, customization, and analysis than may first appear.

Today we'll be discussing at a high level the various factors to be considered when researching investing through the lens of machine learning. The contents of this notebook and further discussions on this topic are heavily inspired by Marcos Lopez de Prado's book Advances in Financial Machine Learning. If you would like to explore his research further, his website is available here.

Garbage in -> garbage out. This is the mantra of computer science, and modeling doubly so. A model is only as good as the data it accepts, so it is vital that researchers understand the nature of their data. This is the foundation of an algorithm, and it will succeed or fail on the merits of its data.

In general, unstructured and unique data are more useful than pre-packaged data from a vendor, as they haven't been picked clean of alpha by other asset managers. This is not offered on the Quantopian platform, but you can upload data for your own use with the Self-Serve feature. If you do not have unique data, the breadth of offerings on the Quantopian platform gives us plenty to work with. Data can vary in what it describes (fundamentals, price) and in frequency (monthly, minutely, tick, etc). Listed below are the chief types of data, in order of increasing diversity:

Fundamental Data - Company financials, typically published quarterly.

Market Data - All trading activity in a trading exchange or venue.

Analytics - Derivative data, analysis of various factors (including all other data types). Purchased from vendor.

Alternative Data - Primary information, not made from other sources. Satellite imagery, oil tanker movements, weather, etc.

The data structures used to contain trading information are often referred to as bars. These can vary greatly in how they are constructed, though there are shared characteristics. Common variables are open, high, low, and close prices, the date/time of the trade, and an indexing variable. A common index is time; daily bars are a structure that each represent one trading day, minute bars represent one minute of trading, etc. Time is held constant. Trading volume is another option, where each bar is indexed with a consistent number of shares traded (say, 200K volume bars). A third option is value traded, where the index is dollars (shares traded * price per share).

Time Bars:

Bars indexed by time intervals, minutely, daily, etc. OHLCV (Open, High, Low, Close, Volume) is standard.

Tick Bars:

Bars indexed by orders, with each set # of orders (usually just 1) creating a distinct bar. Order price, size, and the exchange the order was executed on are common. Unavailable on Q platform.

Volume Bars:

Bars indexed by total volume, with each set # of shares traded creating a distinct bar. We can transform minute bars into an approximation for volume bars, but ideally we would use tick bars to maintain information for all parameters across bars.

Dollar Bars:

Similar to volume bars, except measuring the total value (in $) traded hands. An example would be $100,000 bars, with each bar containing as precisely as possible that dollar value.

Alternative data structures exhibit statistical properties to different degrees, with volume and dollar bars typically expressing greater stationarity of returns than time and tick bars. These properties play a big role when considering which bars to use in a machine learning framework, as we will discuss next.

Much of the literature on machine learning, and statistical inference in general, makes the assumption that observations are independent and identically distributed (iid). Independent means that the occurance of one observation has no affect on any other and identical means that our variables are derived from the same probability distribution (e.g. have the same variance, mean, skew, etc).

Unfortunately, these properties are rarely found in financial time series data. Consider pricing; today's price is highly dependent on yesterday's, the mean price over some time interval is constantly changing, and the volatility of prices can change rapidly when important information is released. Returns, on the other hand, remove most of these relationships. However, variance (i.e. volatility) of returns are still changing over time as the market goes through different volatility regimes, thus are not identically distributed.

The different bar types (and additional data structures) exhibit varying statistical properties. This is important to consider when applying machine learning or other statistical inference techniques, as they assume that inputs are iid sampled (or stationary, in time series). Using dollar bars in lieu of time bars can make the difference between a weak and overfit algorithm versus a consistently profitable one. This is just one step in the search for stationarity, however, and we must have other tools in our arsenal.

The note above about independence of price series vs return series illuminates one concept: the tradeoff between memory and stationarity. The latter is a necessary attribute for inference, but provides no value without the former. In the extreme, consider transforming any series into strictly 1's; you've successfully attained stationarity, but at the cost of all information contained in the original series. A useful intuition is to consider degrees of differentiation from an original series, where greater degrees increase stationarity and lower memory. Returns are 1-step differentiated, the example of all 1's is fully differentiated, the price series has zero differentiation. Lopez de Prado proposes an alternative method, named fractional differentiation, that aims to find the optimal balance between our opposing factors; the minimum non-integer differentiation necessary to achieve stationarity. This retains the maximum amount of information in our data. For a thorough description read chapter 5 of de Prado's Advances in Financial Machine Learning. With this implemented, and our data sufficiently prepared, we are almost ready to whip out the machine learning algorithms. Finally, we have to label our data.

Most machine learning classifiers require labeled data (those that don't are powerful but difficult to engineer, coming with high risk of overfitting.) We intend to predict the future performance of a security, so it seems fair to label each observation based on its ensuing price performance. It is tempting to just use whether the returns were positive or negative over a fixed time window. This method, however, leads to many labels referring to non-meaningful, small price changes. Moreover, real trading is often done with limit orders to take profits or stop losses. Marcos Lopez de Prado proposed a labeling strategy the he calls the Triple-Barrier Method, which combines our labeling desires with real market behaviour. When a trade is made, investors may choose to pre-emptively set orders to execute at certain prices. If the current price of security \(s\) is $5.00, and we want to control our risk, we might set a stop-loss order at $4.50. If we want to take profits before they vanish, we may set a profit-taking order at $5.50. These orders are set to automatically close the position when the price reaches either limit. The stop-loss and profit-taking orders represent the two horizontal barriers of the Triple Barrier Method, while the third, vertical, barrier is simply time-based: if a trade is stalling, you may want to close it out within \(t\) days, regardless of performance.

The classifier outputs a value of either -1 or 1 for each purchase-date and security given, depending on which barrier is first hit. If the top barrier is reached first, the value is set to 1 because a profit was made. If instead the bottom barrier is hit, losses were locked in and the value is set to -1. If the purchase times out before either limit is broken and the vertical barrier is hit, the value is set in the range (-1, 1) scaled by how close the final price was to a barrier (alternatively, if you want to label strictly sufficient price changes, 0 can be output here).

Once you have a model trained for setting the side of a trade (labeled by the Triple-Barrier Method), you can train a secondary model to set the size of a trade. This accepts the primary model as input. Learning the direction and size of a trade simultaneously is much more difficult than learning each separately, plus this approach allows modularity (the same sizing model may work for the long/short versions of a trade). We must again label our data, via a method de Prado calls Meta-Labeling. This strategy assigns labels to trades of either 0 or 1 (1 if it takes the trade, 0 if not) with a probability attributed to them. This probability is used to calculate the size of the trade.

Useful considerations for binary classification tests (like the Triple-Barrier Method and Meta-Labeling) are sensitivity and specificity. There exists a trade-off between Type 1 (false positive) and Type 2 (false negative) errors, as well as true positives and true negatives. F1-Score measures the efficiency of a classifier as the harmonic average between precision (ratio between TP and TP+FP) and recall (ratio between TP and FN). Meta-labeling helps maximize F1-scores. We first build a model with high recall, regardless of precision (learns direction, but with many superfluous transactions). Then we correct for low precision by applying meta-labeling to the predictions of the primary model. This filters out false positives and scales our true positives by their calculated accuracy.

Now that we've discussed the considerations when structuring financial data, we are finally ready to discuss how programs actually learn to trade!

A machine learning archetype known as ensemble learning has been shown time and again to be robust and efficient. These algorithms make use of many weak learners (e.g. decision trees) combined to create a stronger signal. Examples of this include random forests, other bagged (bootstrap-aggregated) classifiers, and boosted classifiers. These produce a feature space that can be pruned to decrease the prevelance of overfitting. This discussion assumes you are at some level familiar with machine learning methods (particularly ensemble learners.) If you are not, scikit-learn's tutorials are a fabulous starting point.

This is a popular ensemble learning method that aggregates many individual learners that are prone to overfitting if used in isolation (decision trees are common), into a lower variance 'bag' of learners. The rough recipe is as follows:

Generate N training datasets through random sampling with replacement.

Fit N estimators, one on each training set. Fit independently from each other, trained in parallel.

Take the simple average of the forecasts of each of the N models, and voilà! You have your ensemble forecast. (If a classifier problem with discrete options, it's majority-rule voting rather than a simple average. If a prediction probability is involved, the ensemble forecast uses a mean of the probabilities).

The chief advantage of bagging is reducing variance to address overfitting. The variance is a function of the number \(N\) of bagged classifiers, the average variance of a single estimator's prediction, and the average correlation among their forecasts.

Lopez de Prado also presents sequential bootstrapping, a new bagging method that produces samples with higher degrees of independence from each other (in the hope of approaching an IID dataset). This further reduces the variance of the bagged classifiers.

Designed to reduce the variance/overfitting potential of decision trees. Random forests are an implementation of bagging (the 'forest' being the aggregation of many trees), with an extra layer of randomness: when optimizing each node split, only a random subsample (without replacement) of the attributes will be evaluated, to further decorrelate the estimators.



Courtesy of Medium

Feature importance analysis allows us to prune the features in our noisy financial time-series dataset that do not contribute to performance. Once features are discovered, we can experiment on them. Are they always important, or only in some specific environments? What triggers a change in importance over time? Can those regime switches be predicted? Are those important features also relevant to other financial instruments? Are they relevant to other asset classes? What are the most relevant features across all financial instruments? What is the subset of features with the highest rank correlation across the entire investment universe? Pruning our feature space is an important part of optimizing our models for performance and risk of overfitting, like every other consideration above. In general, there is a lot to explore here, and it is outside of the scope of this already lengthy post.

Once our model has picked up on some features, we want to assess how it performs. Cross-validation is the standard technique for performing this analysis, but does require some healthy finagling for application in finance, as is the theme today. Cross-validation (CV) is a technique that splits observations drawn from an IID process into a training set and a testing set, the latter of which is never used to train the algorithm (for fairly obvious reasons); only to evaluate it. One of the most popular methods is \(k\)-fold, where the data is split into \(k\) equally-sized bins (or folds), one of which being used to test the results of training on the remaining \(k-1\) bins. This process is repeated \(k\) times, such that each bin is used as the testing bin exactly once. All combinations are iterated over.

This method, however, has problems in finance, as data are not IID. Errors can also result from multi-testing and selection bias due to multiple sets being used for both training and testing. Information is leaked between bins because our observations are correlated (if \(X_{t+1}\) depends on \(X_t\) because they are serially correlated, and they are binned separately, the bin containing the latter value will contain information from the former). This enhances the perceived performance of a feature that describes \(X_t\) and \(X_{t+1}\), whether it is valuable or irrelevant. This leads to false discoveries and overstating returns. Lopez de Prado presents solutions to these problems, that also allow more learning to be done on the same amount of data. One in particular, that he calls Purged \(K\)-Fold CV, simply purges any values in the training set whose labels overlap in time with the testing set. Another deletion strategy he calls embargo is used to eliminate from the training data any signals that were produced during the testing set (essentially just deleting some # of bars directly following the testing data). He also provides a framework to find the optimal \(k\) value assuming little/no leakage between training and testing data.

So, finally, we have everything we need. Prune our data, label it for direction, train an ensemble learning algorithm, label for size, train another algorithm on that, and combine. Integrate the results of this with the Quantopian IDE and backtester and send it to the contest! This may seem like a lot, and it is, but much of the programming is modular. With the groundwork laid, you just might find yourself churning out more impressive strategies than ever before.



Courtesy of Wikipedia

This discussion drew heavily from Advances in Financial Machine Learning by Marcos Lopez de Prado. This is an excellent resource if you are already familiar at a high level with investment management, machine learning, and data science. If you are not, the Quantopian lecture series is a great place to start, especially combined with Kaggle competitions and the scikit-learn machine learning tutorials. Don't be afraid to just play around, it can be fun!

Once you're comfortable with all of that, go through Lopez de Prado's book and work on implementing these methods with a data structure you created, and plugging the result into a few different specially-calibrated machine learning algorithms. If you have a predictive model, test it out-of-sample for a good long while (though your methodology should have prevented overfitting) and see if it sticks. If it does, congratulations! Pit your strategy against others in the contest and we might give you a call.

Well, that was a lot! If you are new to machine learning in general, in finance, or just want to learn more in general, please take advantage of the resources discussed here. We hope to provide significantly more in-depth resources for these topics in the future. This represents a good start to producing a truly thought-out and well-researched strategy, and we hope you make some amazing things with it. Good luck!