Hyperopt tutorial for Optimizing Neural Networks’ Hyperparameters

What is Hyperopt?

Hyperopt is a way to search through an hyperparameter space. For example, it can use the Tree-structured Parzen Estimator (TPE) algorithm, which explore intelligently the search space while narrowing down to the estimated best parameters.

It is hence a good method for meta-optimizing a neural network which is itself an optimisation problem: tuning a neural network uses gradient descent methods, and tuning the hyperparameters needs to be done differently since gradient descent can’t apply. Therefore, Hyperopt can be useful not only for tuning hyperparameters such as the learning rate, but also to tune more fancy parameters in a flexible way, such as changing the number of layers of certain types, or the number of neurons in a layer, or even the type of layer to use at a certain place in the network given an array of choices, each with nested tunable hyperparameters.

This is an oriented random search, in contrast with a Grid Search where hyperparameters are pre-established with fixed steps increase. Random Search for Hyper-Parameter Optimization (such as what Hyperopt do) has proven to be an effective search technique. The paper about this technique sits among the most cited deep learning papers. To sum up, it is more efficient to search randomly through values and to intelligently narrow the search space rather than looping on fixed sets of values for the hyperparameters.

Note that this blog post is also available on our GitHub as a Notebook. It contains code that can be run with Jupyter.

Star Fork

How to define Hyperopt parameters?

A parameter is defined with a certain uniformrange or else a probability distribution, such as:

hp.randint(label, upper)

hp.uniform(label, low, high)

hp.loguniform(label, low, high)

hp.normal(label, mu, sigma)

hp.lognormal(label, mu, sigma)

There is also a few quantized versions of those functions, which rounds the generated values at each step of “q”:

hp.quniform(label, low, high, q)

hp.qloguniform(label, low, high, q)

hp.qnormal(label, mu, sigma, q)

hp.qlognormal(label, mu, sigma, q)

It is also possible to use a “choice” which can lead to hyperparameter nesting:

hp.choice(label, ["list", "of", "potential", "choices"])

hp.choice(label, [hp.uniform(sub_label_1, low, high), hp.normal(sub_label_2, mu, sigma), None, 0, 1, "anything"])

Visualisations of the parameters for probability distributions can be found below. Then, more details on choices and parameter nesting will come.

In [1]:

Note on the above charts (especially for the loguniform and uniform distributions): the blurred line averaging the values fades out toward the ends of the signal since it is zero-padded. The line ideally would not fade out by using techniques such as mirror-padding.

On the loguniform and lognormal distributions

Those are the best distributions for modeling the values a learning rate. That’s because we want to observe changes in the learning rate according to changing it with multiplications rather than additions, e.g.: when adjusting the learning rate, we’ll want to try to divide it or multiply it by 2 rather than adding and substracting a finite value.

To proove this, let’s generate a loguniform distribution for a multiplier of the learning rate, centered at 1.0. Dividing 1 by those values should yield the same distribution.

In [2]:

Example — optimizing for finding the minimum of:

f(x) = x^2 - x + 1

Let’s now define a simple search space and solve for f(x) = x^2 - x + 1 , where x is an hyperparameter.

In [3]:

Found minimum after 1000 trials:

{'x': 0.500084824485627}

Example with a dict hyperparameter space

Let’s solve for minimizing f(x, y) = x^2 + y^2 using a space using a python dict as structure. Later, this will neable us to nest hyperparameters with choices in a clean way.

In [4]:

With choices, Hyperopt hyperspaces can be represented as nested data structures, too

Yet, we have defined spaces as a single parameter. But that is 1D. Normally, spaces contain many parameters. Let’s define a more complex one and with one nested hyperparameter choice for an uniform float:

In [5]:

Let’s now record the history of every trial

This will require us to import a few more things, and return the results with a dict that has a “status” and “loss” key at least. Let’s keep in our return dict the evaluated space too as this may come in handy if we save results to disk.

In [6]:

Found minimum after 1000 trials

{‘x’: 0.1330891919905135, ‘y’: -0.22753380990535327}

Here are the space and results of the 3 first trials (out of a total of 1000):

What interests us most is the ‘result’ key of each trial(here, we show 7):

Up next: saving results to disk while optimizing for resuming a stopped hyperparameter search

Note that the optimization could be parallelized by using MongoDB and storing the trials’ state here. Althought this is a built-in feature of hyperopt, let’s keep things simple for our examples here.

Indeed, the TPE algorithm used by the fmin function has state which is stored in the trials and which is useful to narrow the search space dynamically once we have a few trials. It is then interesting to pause and resume a training, and to apply that to a real problem.

This is what’s done inside the hyperopt_optimize.py file of the GitHub repository for this project. There, as an example, we optimize a convolutional neural network for solving the CIFAR-100 problem.

It is also possible to glance at the results and the effect of each hyperparameter on the accuracy:

You might as well like this other blog post of mine on how to use Git Large File Storage (Git LFS) to handle the versioning of huge files when working with machine learning projects.

Star Fork