Algorithmic trading has similar problems to those in machine learning. Today, I’m going to show how to apply Bayesian optimization for tuning trading strategy hyperparameters.

Let’s suppose you created a trading strategy with a few hyperparameters. This strategy is profitable on a backtesting. You want to deploy the strategy to production mode, but there is one question left: “Is this set of parameters optimal?”.

Typically, a grid-search approach is used to search for optimal hyperparameters. This approach is also used in machine learning, but this requires a lot of computations, often in the wrong parameter space.

Another approach is a random search, which can perform slightly better than grid-search.

I found a great explanation on Quora:

The problem: The goal is to find an approximate minimum to some ‘expensive’ function. Such a function accepts a real valued vector 𝐱∈ℝ, returns a scalar and takes a damn long time doing it. Also, no gradient calculation is available. So let’s imagine a simple one dimensional case:

The dotted lines imply we can’t see this function — we may only choose to evaluate it at certain points. Since this function is costly, let’s say we have a fixed budget of evaluations — we may only evaluate it, say, 10 times. With that, how would you find the best possible minimum? This is the idea: Sample some input-outputs (less than 10) and use them to guess the true function with something called a ‘Gaussian Process’ (or ‘GP’ here). Then use that guessed function to determine where to evaluate next. Evaluate that point, add it to our set of input-outputs and infer the guessed function once again. Repeat this until you’ve exhausted your budget of evaluations (or some other stopping criteria). If the GP is any good at guessing the true function, we’ll do better than random sampling. To get a feel for the GP, let’s sample four points from our expensive function, hand these over to the GP and have it infer the rest of the function. That might look like this:

The solid green line is our guess of the true function. Each additional band of green is another half standard deviation on the output distribution. So now the question becomes: given all this useful guessed information, what point should we check next? In answering this, there are two things we care about: We should evaluate points we think will yield a low output value. That is, we should evaluate on points where the solid green line is low. We should check areas we know less about. So in the graph above, it would be wiser to check somewhere between .65 and .75 than to check between .15 and .25 since we have a pretty good idea as to what’s going on in the latter zone. Said differently, we should check regions which will reduce the variance in our guess the most. Balancing these two is the exploration-exploitation trade-off. Do you look in new spots or exploit the goldmines you’ve discovered thus far? We make our preference explicit with an acquisition function. This is a function of 𝑥x that will yield a number that tells us how well we are accomplishing these two goals. This function is cheap, so we can optimize it and use that 𝑥x as our next point to search next. So which acquisition function? Well, there are a handful of options, but I’ll go with expectation of improvement. That is, evaluate the next point where the expected improvement is highest. So if: 𝜇(𝑥)μ(x) is the guessed value of the function at 𝑥x (the green line). 𝜎(𝑥)σ(x) is the standard deviation of output at 𝑥x (proportional to the green bands). Then our acquisition/expectation of improvement (call it 𝐴(𝑥)A(x)) is:

where Φ(⋅) and N(⋅) refers to the CDF and PDF of a standard normal distribution, respectively. It’s not important to understand this formula exactly — just know that it’s some balance of a low 𝜇(𝑥)μ(x) and a high 𝜎(𝑥)σ(x). To see how this operates, check it out next to our previous graph:

So this would tell us to check at 𝑥=1 since that point has the highest activation. And then repeat. That’s it! Like I said earlier, if the GP is any good at guessing the true function, it’ll beat random sampling.

This article compares different approaches for hyperparameters tuning in machine learning. It suggests Bayesian optimization as a better approach than random, grid-search or manual.

Bayesian optimization

This video by Siraj Raval also has a good explanation:

If you want to dive deeper into this topic you should read this article.

Practical example

We will consider one strategy that I discussed earlier. This strategy has 4 hyperparameters, let’s define required libraries:

Bayesian optimization is implemented in hyperopt package. You can find an introduction to this package in the great article by Will Koehrsen.

Usually, researchers use backtesting to find optimal hyperparameters and out-of-sample test for evaluating. I like this approach, but I think we can get a more significant result if we use a few backtesting periods (folds)

When a fold is closer to a out-of-sample period, then the fold has a greater weight in the final weighted score.

Also, this function should penalize a final value if some fold has a negative performance. As a score function in a trading, you can use the Sortino ratio, or any function that estimates the quality of the strategy. For our purpose, the lower value is better. This means we should invert the Sortino ratio value.

In this example, we will use:

Five folds by one month for the training (optimization) step

One month for the testing period, and three weeks for validation (out-of-sample period)

The testing period allows selecting pairs that will be tested on the validation period.

Training period: 2018–06–01 to 2018–10–31

Testing period: 2018–11–1 to 2018–11–30

Validation period: 2018–12–1 to 2018–12–21

Let’s define the periods, hyperparameter space, and a weighted_mean function. This code allows to define a different length of folds by period_weights .

The next step is defining the score function that runs the algorithm and measures the performance (in this case, Sortino ratio).

This function is too long because also contains nested functions of the algorithm from the previous article. The alternative variant for passing parameters from hyperopt package to the Catalyst framework is using the os.environ module. Let me know in comments if you know the better way. Maybe guys from Enigma Project advise the right answer.

The core is the objective function that measures the performance of a few folds by weighted_mean function, otherwise gets the maximum of the folds (penalizing).

The next code runs the optimization with 300 iterations.

The output looks like this:

This table contains the top of trials by score. The last line is the set of best hyperparameters. I carried out this experiment for ETH/BTC, LTC/BTC, XMR/BTC, and ZEC/ETC using Bitfinex data.

Theoretically, you can visualize objective value in 3D space, like dimension 1 — dimension 2 — objective value, and the best solution has the global minimum:

3D space of objective function

This hyperparameters table shows the best values of parameters for each asset:

Best hyperparameters

Assets selection (testing step)

On this step, we should run the strategy with fixed hyperparameters for particular asset on testing period. The asset with positive performance will be selected for the next step (out-of-sample test).

This code allows to do this, you should just change the hyperparameters and set the cryptocurrency.

As you see, two assets (ETH/BTC and XMR/BTC) got a positive performance.