Tiny RL — Technical/RL/Policy

In this example we will make use of gradient descent to maximise a reward function.

The Sharpe ratio will be used as the reward function. The Sharpe ratio is used as an indicator to measure the risk adjusted performance of an investment over time. Assuming a risk-free rate of zero, the Sharpe ratio can be written as:

Further, to know what percentage of the portfolio should buy the asset in a long only strategy, we can specify the following function which will generate a value between 0 and 1.

The input vector is the following t where rt is the percent change between the asset at time t and t+1, and M is the number of time series inputs.

This means that at every step the model will be fed its last position and a series of historical price changes that is used to calculate the next position. Once we have a position at each time step, , we can calculate our returns R at each time-step using the following formula. In this example, δ is the transaction cost.

To perform gradient descent, one must compute the derivative of the Sharpe ratio with respect to theta or using the chain rule and the above formula. It can be written as.

Code and Resources:

Data, Code

Hat tip, Teddy Kokker

Tiny VIX CMF— StatArb/RL/Policy

CBOE Volatility Index (VIX) and Futures on the Euro STOXX 50 Volatility Index (VSTOXX) are liquid and so are exchange-traded-notes/exchange-traded-funds (ETNs/ETFs) on VIX and VSTOXX. Prior research shows that the future curves exhibit stationary behaviour with mean reversion toward a contang

First, one can imitate the futures curves and ETN price histories by building a model and then use that model to manage the negative roll yield. The Constant Maturity Futures (CMF) can be specified as follows:

One can then go on to define the value of the ETN so that you take the roll yield into account. I want to focus on maturity and instrument selection, and therefore ignored the roll yield and simply focused on the CMFs. But, if you are interested, the value of the ETN can be obtained as follows.

where r is the interest rate.

Unlike the Tiny VIX CMF approach, this strategy makes use of numerical analyses before a reinforcement learning step. First, out of all seven securities (J), establish a matrix of 1 and 0 combinations for simulation purpose to obtain a matrix of combinations.

Then use a standard normal distribution to randomly assign weights to each value in the matrix. Create an inverse matrix and do the same. Now normalise the matrix so that each row equals one in order to force neutral portfolios. The next part of the strategy is to run this random weight assignment simulation N (600) number of times depending on your memory capacity as this whole trading strategy is serialised.

Thus, each iteration (N) produces normally distributed long and short weights (W) that have been calibrated to initial position neutrality (Long Weights = Short Weights); the final result is 15,600 trading strategies.

The next part of this system is to filter out strategies with the following criteria. Select the top X percent of strategies for their highest median cumulative sum over the period. From that selection, select the top Y percent for the lowest standard deviation.

Of that group, select Z percent again for the highest median cumulative sum strategies. X, Y and Z are risk-return parameters that can be adjusted to suit your investment preferences. In this example, they are set at 5%, 40% and 25% respectively. It is possible to efficiently select these parameters by adding them to the reinforcement learning action space.

Of the remaining strategies, iteratively remove highly correlated strategies until only 10 (S) strategies remain. With that remaining 10 strategies, which have all been selected using only training data, use the training data again to formulise a reinforcement learning strategy using a simple MLP neural network with two hidden layers to select the best strategy for the specific month by looking at the last 6 months returns of all the strategies, i.e., 60 features in total.

Finally test the results on an out of sample test set. Note in this strategy no hyperparameters selection was done on a development set, as a result, it is expected that results can further be improved.

Data, Code

Hat tip Andrew Papanicolaou

Agent Strategy — Price/RL/Various Sub-methods

Here, 20+ reinforcement learning sub-methods are developed using different algorithms, the first three in the code supplement do not make use of RL; their rules are determined by arbitrary inputs. This includes a turtle-trading agent, a moving-average agent, and a signal-rolling agent.

The rest of the coding notebook contains progressively more involved reinforcement learning agents. The notebook investigates, among others, policy gradient agents, q-learning agents, actor-critic agents, and some neuro-evolution agents and their variants.

With enough time, all these agents can be initialised, trained and measured for performance. Each agent individually generates a chart that contains some of the performance information as shown in Exhibit 2.

Exhibit 2: Example of a Reinforcement Learning Strategy’s Performance

In this section we will look at three of the most popular methods, being Q-learning, Policy Gradient, and Actor-Critic. Some quick mathematical notes: s=states, a=actions, r=rewards. In addition, action value functions Q, state-value functions V, and advantage functions A, are defined as:

Q-learning: is an online action-value function learning with an exploration policy, e.g., epsilon-greedy. You take an action, observe, maximise, adjust policy and do it all again.

Policy Gradients: here you maximise the rewards by taking actions where higher rewards are more likely.

Actor-Critic is a combination of policy gradient and value-function learning. In this example, I will focus on the online as opposed to the batch model.

Code (Data Self-Contained)

SUPERVISED LEARNING

Supervised learning (SL) techniques are used to learn the relationship between independent attributes and a designated dependent attribute. SL refers to the mathematical structure describing how to make a prediction yi given xi.

Instead of learning from the environment like RL, SL methods learn the relationships in data. All supervised learning tasks are divided in classification or regression tasks. Classification models are used to predict discrete responses (e.g., Binary 1, 0; Multi-class 1, 2, 3). Regression is used for predicting continuous responses. (e.g., 3.5%, 35 times, $35,000). In the examples that follow, we will both use classification and regression models.

Industry Factor — Factor/SL/Lasso

In this example, we will look at the use of machine learning tools to analyse industry return predictability based on lagged industry returns across the economy (Rapach, Strauss, Tu, & Zhou, 2019). A strategy that longs the highest and shorts the lowest predicted returns, returns an alpha of 8%.

In this approach, one has to be careful about multiple testing and post-selection bias. A LASSO regression is eventually used in a machine learning format to weight industry importance; but before that we should first formulate a standard predictive regression framework:

where,

In addition, the lasso objective 𝜰T can be expressed as follows, where ϑi is the regularisation parameter.

The LASSO regression generally performs well in selecting the most relevant predictor variables. Some argue that the LASSO penalty term over shrinks the coefficient for the selected predictors. In that scenario, one can use the selected predictors and re-estimate the coefficients using OLS.

This sub model — an OLS regression model in this case — can be replaced by any other machine learning regressor. In fact, the main and sub-model can both be machine learning regressors, the first selecting the features and second predicting the response variable based on those features.

Data, Code

Global Oil — Systematic Macro/SL/Elastic Net

When oil exits a bear market then the currency of oil producing nations should also rebound. With this strategy, we will investigate the effect the price of oil has on the Norwegian krone (NOK) and identify whether a profitable trading strategy can be executed. To start we need a ‘stabiliser currency’ to regress against.

The currency should be unrelated to the currency under investigation. Something like the Japanese yes (JPY) is a good candidate. From here on, one would use the price of the NOK and Brent as measured against JPY to identify whether the Norwegian currency is under or overvalued.

I will use an elastic net regression as the machine learning technique. It is a good tool when multicollinearity is an issue. An elastic net is a regularised regression method that combines both L1 (Lasso) and L2 (Ridge) penalties. The estimates from the elastic net method are defined by.

The loss function becomes strongly convex as a result of the quadratic penalty term therefore providing a unique minimum. Now that the predictors are in place, one has to set up a pricing signal; one sigma two-sided is the common practice in arbitrage. We short if it spikes above the upper threshold and long on the lower threshold. The stop-loss will be set at two standard deviations. At that point, one can expect the interpretation of the underlying model to be wrong and therefore choose to exit the position.

Data, Code

Deep Trading — Technical/SL/Various DL

There are 30 different neural network sub-methods investigated here. This includes Vanilla RNN, GRU, LSTM, Attention, DNC, Byte-net, Fairseq, and CNN methods. The mathematics of the different frameworks are vast and would take too much space to include here. I have not turned any of the methods into trading strategies yet.

Here, I am simply predicting the future price of the stock, so the models can easily be transformed into directional trading strategies from this point. You can construct the trading policies by hand or rely on reinforcement learning strategies to ‘develop’ the best trading policies.

Exhibit 4: Architecture of RNN, GRU and LTSM cells.

Exhibit 4 can help us to understand the major differences between the sub-methods. A Vanilla recurrent neural network (RNN) uses the simple multiplication of inputs (xt) and previous outputs (ht-1) passed through a tanh activation function.

A Gated Recurrent Unit (GRU) introduces the additional concept of a gate that decides whether to pass a previous output (ht-1) to a next cell in an attempt to solve the vanishing gradient problem. It is simply an additional mathematical operation performed on the same inputs.

With the Long Short-Term Memory Unit (LSTM) an additional gate is introduced to the GRU method. Again, these are additional mathematical operations on the same inputs. Moving from RNN to LSTM we are simply introducing more ‘control knobs’ for the flow and mixing of input data to establish the final weights.

The LSTM method is designed to focus on establishing weights that maintain information that persist for longer periods of time. The code of these three methods and many others are available in the online supplement.

Code (Data Self-Contained)

Stacked Trading — Technical/SL/Stacked

This is purely experimental, it involves the training of multiple models (base-learners or level 1 models), after which they are weighted using an extreme gradient boosting model (metamodel or level 2 model). In the first stacked model, which I will refer to as EXGBEF, we use autoencoders to create additional features.

In the second model, DFNNARX, autoencoders are used to reduce the dimensions of existing features. In the second model, I include additional economic (130+ time series) and fund variables to the stock price variables. Similar to the Deep Trading example, we have price movement predictions, but we have not developed a trading policy yet. Exhibit 5 graphically shows the concept of stacking.

Exhibit 5: Architecture of Stacked Models

The training data X has m observations, and n features. There are M different models that are trained on X. Each model provides predictions ŷ for the outcome y which are then cast into a second level training data X^(l2) which is now m x M sized. The M predictions become features for this second level data. A second level model (or models) can then be trained on this data to produce the final outcomes ŷ-fin which will be used for predictions. With stacking it can help to use out-of-sample training data at each modelling level, otherwise the nth level model will be biased to use only the best performing model in the previous modelling level.

Code (Data Self-Contained)

SUPERVISED LEARNING VS REINFORCEMENT LEARNING

The general pipeline for supervised machine learning trading involves the acquisition of data, processing of data, prediction, policy development, backtesting, parameter optimization, live paper simulation and finally trading of the strategy.

The basic supervised learning task involves some form of price prediction. This includes regressors that predict the price level and classifiers that predict price direction and magnitude in predefined classifications for future time steps.

Supervised machine learning models, especially neural networks, can keep up with changing market regimes as long as it is able to do online training[1]. The reason supervised learning processes tend to fail is because the iterative steps from ML prediction through to policy development, backtesting and parameter optimization are fragile, slow and prone to error.

A further issue is that the performance simulation turns up too late in the game after much hard work has been done. Also, the policy does not develop ‘intelligently’ with the machine learning model.

The benefit of reinforcement learning algorithms is that the final objective function can be the realised/unrealised profit and loss, but also values like the Sharpe Ratio, maximum drawdown, and value at risk measures.

Reinforcement learning only has four or so steps as opposed to the seven or eight of supervised learning. RL allows for end-to-end optimization on what maximises rewards. The RL algorithm directly learns a policy. RL has to take an action in an interactive environment.

Compared to supervised learning which answers the question, “will the asset increase in price tomorrow?”; reinforcement learning answers the question, “should I buy the asset today?”. The reinforcement learning algorithm is therefore already packaged as a trading strategy.

This does not mean that it is necessarily hard to create a trading strategy out of a supervised learning task, for example, one can simply buy all assets that are predicted to increase in price tomorrow.

Therefore, the reinforcement learning process draws on a larger process of automation. Similar to supervised strategy development, you still have to ensure that the model works, here instead of backtesting you use a simulated environment or paper trading.

Remember that the focus should remain on out-of-sample performance at the end of the day, so be sure to deflate your performance metrics appropriately to control for multiple-testing.

In a nutshell, RL comprises data analysis, agents training in a simulated environment, paper trading, and then finally live trading. In each of the last three steps the agent gets exposed to an environment.

The simplest RL approach is a discrete action space with three actions, buy, hold, and sell. Unlike supervised models, reinforcement models specify an action as opposed to a prediction, however the decision masks an underlying prediction.

So, if RL provides all these miraculous benefits, why is it barely used in industry. Well even though RL can lead to a great strategy in fewer steps with less human involvement, it takes longer to train and is very computationally intensive.

RL needs a lot of data, even more so than supervised machine learning. It can also be expensive to test if you can’t reconstruct a good simulated environment.

In finance this is mostly not a big issue, but this does become an issue when accurate environment feedback is necessary; in which case you might have to revert to the real environment when the simulated environment won’t cut it; in which case it can become very expensive. Lastly, the bigger the action space the harder it is to optimise an RL agents[1].

It is likely that supervised learning would still rule the pack in the foreseeable future. Supervised learning is already quite flexible, and we should expect to see a lot of innovations to bring the experience of developing strategies closer to that of reinforcement learning without forsaking the benefits of supervised learning.

For example, researchers in SL have for a long time looked at embedding policy decisions into SL algorithms. Researchers in finance have also written about creating models that predict the best position sizes and entry and exit points (de Prado, 2018). Bringing the trading policy and rules closer to the ML model and closer to a form of automated intelligence.

Let us consider a few more disadvantages of reinforcement learning. First, RL’s convergence to an optimal value is not guaranteed; the famous Bellman update can only guarantee the optimal value if every state is visited an infinite number of times and every action is tried an infinite amount of times within each state, so essentially never.

You of course don’t need a truly optimal value; approximate optimality is fine. The big issue is that the sample size needed to obtain a good level of approximate optimality increases with the size of the state and action space. Further, without any assumptions there is no better way than to explore the space randomly, so progress at first is small and slow.

Continuous states and actions are a serious problem; how are we supposed to visit an infinite number of states, an infinite number of times for an infinite number of continuous values with small and slow-time steps?

Some of the best approximations can only be done through the generalised nature of supervised learning. Generalisation can also be adopted in RL using function approximation as opposed to storing infinite values in an infinitely large table.

It is worth nothing that this function approximation is still orders of magnitude harder than normal supervised learning problems, the reason being that you start the model off with no data, and as you collect data the action value changes and the ground truth labels also remain unfixed; a point previously labelled as good, might look bad in the longer run.

To get closer to the true function, the agent has to keep exploring. This exploration in uncertain dynamics means that RL is way more sensitive to hyper-parameters and random seeds than SL as it does not train on a fixed data set and is dependent on network output, exploration mechanism, and environment randomness.

Thus, the same run can produce different results. But do notice how great it is that you are never given any samples from the ‘true’ target function, yet you are able to learn by optimising on a goal, that is why RL is so popular.

I simultaneously expect to see a lot of improvement on the RL trading front, so that RL adopts the advantages of SL trading methods while not forgoing its own strengths. Conceptually RL offers a kind of paradigm shift where we are not overtly focused on predictive power, which is an auxiliary task, but rather the optimization of actions which is and has always been the primary goal.

SL and RL algorithms indirectly pick up on well-known trading strategies without having to predefine and identify them. For example, the gradient step that leads the machine agent to buy more of what did the best yesterday are indirectly creating a momentum investing strategy. We can expect machine learning to become part of the toolkit of all asset managers in the future.

SUMMARY

Around 40 years ago Richard Dennis and William Eckhardt put systematic trend following systems on a roll, 15 years later statistical arbitrage made its way onto the scene, 10 years later high frequency trading started to stick its head out, in the meantime, machine learning tools was introduced to make statistical arbitrage much easier and more accurate.

Machine learning today, among other things, assist investment managers to refine the accuracy of their predictions⁠ — by using supervised learning, improve the quality of their decisions⁠ — by using reinforcement learning, and enhance their problem discovery skills⁠ — by using unsupervised learning.

Technological adoption within portfolio management moves fast and over the decades we have seen technologies come and go. It is likely that this cycle in quantitative finance will persist and that it also applies to machine learning in asset management, with one caveat, machine learning is also practically revolutionary, instead of just maximising alpha it also minimises overheard costs.

Machine learning is already having large economic effects on many financial domains and it is poised to grow further. Advanced machine learning models present myriad advantages in flexibility, efficiency, and enhanced prediction quality.

In this article we have paid special attention to how machine learning can be used to improve various types of trading strategies. We started by identifying important components to asset management in the context of machine learning, one of which is portfolio construction, which itself was divided into trading and weight optimization sections.

The trading strategies were classified according their respective machine learning frameworks, i.e., reinforcement, supervised and unsupervised learning. The article finished with a section explaining the difference between reinforcement learning and supervised learning, both conceptually and in relation to their respective advantages and disadvantages. The next article in this series will be on weight optimization strategies.

References

Britten‐Jones, M. (1999). The sampling error in estimates of mean‐variance efficient portfolio weights. The Journal of Finance, 54(2), 655–671.

de Prado, M. L. (2018). The 10 reasons most machine learning funds fail. The Journal of Portfolio Management, 44(6), 120–133.

de Prado, M. L. (2016). Building diversified portfolios that outperform out of sample. The Journal of Portfolio Management, 42(4), 59–69.

Rapach, D. E., Strauss, J. K., Tu, J., & Zhou, G. (2019). Industry return predictability: A machine learning approach. The Journal of Financial Data Science, 1(3), 9–28.

Author Derek Snow — Is a doctoral candidate of Finance at the University of Auckland and previously a visiting PhD at NYU Tandon and the University of Cambridge.

LinkedIn, Twitter.