Another way to look at the dropout is through Kernel Density Estimation. Here we can see that there is a slight tendency towards higher val_acc with dropout 0 or 0.1, as well as less tendency to have low val_acc (around the 0.6 mark).

The first action item for the next round of scanning is to get rid of the higher dropout rates altogether and focus on values between 0 and 0.2. Let’s take a look at learning rate more closely next. Note that learning rates are normalized across optimizers to a scale where 1 represents the Keras default value of that optimizer.

The situation is pretty clear; the smaller learning rates work well for both loss function, and the difference is particularly pronounced with logcosh. But because binary cross-entropy is clearly outperforming on all learning rate levels, it will be the loss of our choice for the remainder of the experiment. A sanity check is still needed though. How about if what we’re seeing is not factoring in over-fitting towards the training data? How if val_loss is all over the place and we’re just getting carried away looking at one side of the picture? A simple regression analysis show that’s not the case. Other than a few outliers, everything is packed nicely in the lower left corner where we want it. The tendency is that both train and validation loss is close to zero.

I think for now we know enough; it’s time to set up the next round of the experiment! As a point of reference, the parameter space for the next experiment looks like this:

In addition to refining the learning rate, dropout, and batch size boundaries, I’ve added kernel_initializer ‘uniform.’ Remember that at this stage the objective is to learn about the prediction task, as opposed to being too focused on finding the solution. The key point here is experimentation and learning about the overall process, in addition to learning about the specific prediction challenge.

Round 2 — Increase the Focus on Result

Initially, the less we focus on the result (and more on the process), the more likely we’re going to get a good result. It’s like playing chess; if at first you’re too focused on winning the game, you will not focus on the opening and mid-game. Competitive chess is won in the endgame, based on playing a strong beginning and middle. If things go well, the second iteration in the hyperparameter optimization process is the middle. We’re not entirely focused on winning the game yet, but it helps to have the eye on the prize already. In our case, the results from the first round (94.1% validation accuracy) indicate that with the given dataset, and the set parameter boundaries, there are predictions to be made.

In this case, the prediction task here is to say if breast cancer is benign or malignant. This type of predictions is a kind of a big deal in the sense that both false positives and false negatives do matter. Getting the prediction wrong will have some negative effect on the person’s life. In case you are interested, there is a bunch of papers written on this dataset, and some other relevant info, which you can all find here.

The result for the second round is 96% validation accuracy. The below correlation shows that the only thing sticking out at this point, is the number of epochs, so for the third round that’s one thing I’m going to change.

If you look at the correlation alone, there is a danger of missing something in the bigger picture. In hyperparameter optimization, the big picture is about individual values within a given parameter, and their interconnectedness with all other values. Now that we’ve eliminated the logcosh loss function, and have just one loss (binary_crossentropy) in the parameter space, I want to learn a little bit about how the different optimizers are performing in the context of the epochs.

It is exactly like the correlation suggests regarding epochs (now on the x-axis). Because RMSprop underperforms in both 100 and 150, let’s also drop that from the next round.

Before moving on, let’s consider very briefly a fundamental question related to hyperparameter optimization as an optimization challenge. What is it that we’re trying to achieve? The answer can be summarized using two simple concepts;

Prediction optimum

Result entropy

Prediction optimum is where we have a model that is both precise and generalized. Result entropy is where the entropy is as close to zero (minimal) as possible. Result entropy can be understood as a measure of similarity between between all the results within a result set (one round of going through n permutations). The ideal scenario is where the prediction optimum is 1, which is 100% prediction performance and 100% generality, and the resulting entropy is 0. This means that no matter what we do within the hyperparameter space, we only get the perfect result every time. This is not feasible for several reasons but is helpful to keep in mind regarding the objectives of the process of optimizing the process of hyperparameter optimization. The other way to look at answering the question is through three levels of consideration;

The prediction task, where the goal is to find a model that provides a solution for the task The hyperparameter optimization task, where the goal is to find the best model (with least effort) for the prediction task The hyperparameter optimization task optimization task, where the goal is to find the best approach to best approach to finding the best model for the prediction task

You might then ask if this leads us to an infinite progression where we then need optimizers on top of optimizers, and the answer is yes. In my view, what makes the hyperparameter optimization problem interesting, is the way it leads us to the solution for the problem of “models that build models.” But that would take us far from the scope of this article.

With the second, and particularly the third aspects in mind, we need to consider the computational efficiency of the process. The less we waste the computational resource, the more we have of it for finding the best possible result regarding the aspects one and two. Consider the below graphs in this light.

The second round KDE looks much better in the sense of having the resources allocated where we need them to be. They are closer to 1 on the x-axis, and there is very little in terms of “spillage” towards the 0. Whatever compute resources are going into the scan, they’re are doing important work. The ideal picture here is one of a single straight line with the x value of 1.

Round 3 — Generalization and Performance

Let’s get right to it. The peak validation accuracy is now 97.1%, and it looks like we’re going in the right direction. I made the mistake of just adding 175 epochs as max, and based on the below; it looks like we have to go further than that. At least with this configuration. Which makes me think…maybe for the last and final round, we should try something surprising.

As it was discussed in the foreword, it’s important to consider generalization as well. Every time we look at the result, there is the effect where our insights start to affect the experiment. The net result is that we start to get less generalized models that work well with the validation dataset, but might not work well with a “real-life” dataset. In this case, we don’t have a good way to test for this kind of bias, but at least we can take measures to assess the degree of pseudo-generalization with what we have. Let’s see training and validation accuracy first.

Even though this does not give us an affirmative confirmation of having a well-generalized model, in fact, it falls short from it a great deal; the regression analysis result could not be much better. Then let’s look at loss.

It’s even better. Things are looking good. For the last round, I’m going to increase the number of epochs, but I’m also going to try another approach. So far I’ve only had very small batch sizes, which take a lot of time to process. In the third round, I only included batch sizes 1 through 4. For the next, I’m going to throw in 30 or something, and see what that does.

A few words about early stopping. Keras provides a very convenient way to use callbacks through EarlyStopping functionality. As you might have noticed, I’m not using that. Very generally speaking, I would recommend using it, but it is not as trivial as everything we’ve done here so far. Getting the settings right in the way where it’s not limiting your ability to find the best possible results is not straightforward. The most important aspect has to do with metrics; I would want to have a custom metric created first, and then use that as my EarlyStopping mode (instead of using val_acc or val_loss). That said, EarlyStopping, and callbacks in general, provide a very powerful way to add to your hyperparameter optimization process.

Round 4 — Final Results are In

Before diving into the results, let’s look at one more visualization from the results of the last round. This time 5-dimensional. I wanted to see the remaining parameters — kernel initializer, batch size, hidden layers, and epochs — all in the same picture compared against validation accuracy and loss. First accuracy.

Mostly it’s neck-to-neck, but some things do stand out. The first thing is that if a hidden layer value (hue) is down, in most cases its one hidden layer. For batch sizes (columns) it’s hard to say, as is for kernel initializer (rows). Let’s next take a look at the validation loss on the y-axis, and see if we can learn more from there. And remember, here we’re looking for smaller values; we’re trying to minimize the loss function with each parameter permutation.

Uniform kernel initializer is doing a great job in keeping the loss down throughout all epoch, batch size, and hidden layer variations. But because the results are a little inconsistent, I’ll keep both initializers until the end.

And the Winner Is…

The winning combination is it came from the last minute idea to try bigger batch size to save time and in fewer epochs too):

The highest result for the small batch sizes was validation accuracy 97.7%. With the larger batch size approach, there is also the upside of having the model converge very fast. At the end of this article, I will provide a video where you can see it for yourself. To be honest, once I saw how well the bigger batch size worked, I did set up a separate test just focusing on that. It took less than a minute to set it up as all I needed to change was the batch size (and for this smaller epochs), and the scan finished in 60 minutes. Regarding the plots, there is nothing much to see, as more or less all the results were near 100%. There is one more thing though I want to share, as it relates to the idea of entropy from a different standpoint than what we had already discussed. Entropy can be an effective way to assess overfitting (and therefore a proxy for generalization). In this case, I measure the val_loss and val_acc entropy, using KL divergence, against training loss and accuracy respectively.

Summary of the Process

Start as simply and broadly as possible

Try to learn as much as possible about the experiment and your hypothesis

Try not to focus on the final result much for the first iterations

Make sure that your performance metric is right

Remember that performance is not enough, as it tends to lead you away from generality

Each iteration should reduce parameter space and model complexity

Don’t by afraid to try things, it is an experiment after all

Use methods you can understand e.g. clearly visualized descriptive statistics

Here is the code complete notebook for the last round. And the video I had promised…