Introduction Author: Sayak Paul In this article, we’ll review and compare a plethora of weight initialization methods for neural nets. We will also outline a simple recipe for initializing the weights in a neural net. Try various weight initialization methods in Colab → A summary of the different weight initialization methods

A summary of the different weight initialization methods

Why does weight initialization matter? A neural net can be viewed as a function with learnable parameters and those parameters are often referred to as weights and biases. Now, while starting the training of neural nets these parameters (typically the weights) are initialized in a number of different ways - sometimes, using contant values like 0’s and 1’s, sometimes with values sampled from some distribution (typically a unifrom distribution or normal distribution), sometimes with other sophisticated schemes like Xavier Initialization. The performance of a neural net depends a lot on how its parameters are initialized when it is starting to train. Moreover, if we initialize it randomly for each runs, it’s bound to be non-reproducible (almost) and even not-so-performant too. On the other hand, if we initialize it with contant values, it might take it way too long to converge. With that, we also eliminate the beauty of randomness which in turn gives a neural net the power to reach a covergence quicker using gradient-based learning. We clearly need a better way to initialize it. Careful initialization of weights not only helps us to develop more reproducible neural nets but also it helps us in training them better as we will see in this article. Let’s dive in! The different weight initialization schemes We are going to study the effects of the following weight initialization schemes: Weights initialized to all zeros

Weights initialized to all ones

Weights initialized with values sampled from a uniform distribution with a fixed bound

Weights initialized with values sampled from a uniform distribution with a careful tweak

Weights initialized with values sampled from a normal distribution with a careful tweak Finally, we are going to see the effects of the default weight initialization scheme that comes with tf.keras. Experiment setup: The data and the model To make the experiments quick and consistent let’s fixate on the dataset and a simple model architecture. For doing experiments like this, my favorite dataset to start off with is the FashionMNIST dataset. We will be using the following model architecture: The model would take a flattened feature vector of shape (784, ) and after passing through a set of dropout and dense layers, it would produce a prediction vector of shape (10, ) which correspond to the probabilities of the 10 different classes present in the FashionMNIST dataset. This would be the model architecture we will be using for the all experiments. We will be using the sparse_categorical_crossentropy as the loss function and the Adam optimizer. Method 1: Weights initialized to all zeros Let’s first throw a weight vector of all zeros to our model and see how it performs in 10 epochs of training. In tf.keras, layers like Dense, Conv2D, LSTM have two arguments - kernel_initializer and bias_initializer. This is where we can pass in any pre-defined initializer or even a custom one. I would recommend you to take a look at this documentation which enlists all the available initializers in tf.keras. We can set the kernel_initializer arugment of all the Dense layers in our model to zeros to initialize our weight vectors to all zeros. Since the bias is a scalar quantity, even if we set it to zeros it won’t matter that much as it would for the weights. In code, it would look like so: tf.keras.layers.Dense(256, activation='relu', kernel_initializer=init_scheme, bias_initializer='zeros') Model performance with all weights initialized to all zeros

Model performance with all weights initialized to all zeros

Method 2: Weights initialized to all ones

Method 2: Weights initialized to all ones

Method 3: Weights initialized with values sampled from a uniform distribution If you think from a mathematical view point, a neural net is nothing but a chain of functions applied on top of each other. In these functions, we generally multiply an input vector with a weight vector and add a bias term to the product vector (think of broadcasting). We then pass the final vector through an activation function and then proceed from there. Ideally we would want the values of the weight vector to be in such a way that they do not end up in causing a data loss in the input vector. Ultimately we are multiplying the weight vector with the input vector, so we need to be very careful. So, it’s often a good practice to keep the values of the weight vector to be as small as possible but not very small so that they end up causing numerical instabilities. In the earlier experiments, we saw that initializing our model with constant values is not a good idea. So, let’s try initializing them with unique small numbers having [0,1] range. We can do this by sampling values from a [uniform distribution](uniform distribution). A uniform distribution looks like so: A uniform distribution within [-5, 5] range Here’s the catch with uniform distributions - the values from a uniform distribution have the equal chance of being sampled. Initializing a tf.keras Dense layer with a uniform distribution is a bit more involved than the previous two schemes. We would make use of the tf.keras.initializers.RandomUniform(minval=min_val, maxval=max_val, seed=seed) class here. In this case, we would be supplying 0 as the minval and 1 as the maxval. seed could be any integer of your choice. Let’s see how it performs! Let’s see how it performs!

Let’s see how it performs!

The Results So Far - Ones vs Uniform Weights Accuracy Trade Off

The Results So Far - Ones vs Uniform Weights Accuracy Trade Off

Quick Tangent: The recipe for initializing weights As we saw in the previous experiment that having some randomness when initializing the weights in a neural net can clearly help. But could we control this randomness and provide some meaningful information to our model? What if we could pass some information about the inputs we would feed to the model and have the weights somehow dependent on that? We can do this! The following rule (from Udacity’s lesson on Weight Initialization) helps us in doing so: Method 4: Weights initialized with values sampled from a uniform distribution with a careful tweak So, what we would do is instead of sampling values from a uniform distribution of [0,1] range, we would replace the range with [-y,y]. We have got a number of ways in which we could do this in tf.keras but I found the following way to be more customizable and more readable. # iterate over the layers of a given model for layer in model.layers: # check if the layer is of type `Dense` if isinstance(layer, tf.keras.layers.Dense): # shapes are important for matrix mult shape = (layer.weights[0].shape[0], layer.weights[0].shape[1]) # determine the `y` value y = 1.0/np.sqrt(shape[0]) # sample the values and assign them as weights rule_weights = np.random.uniform(-y, y, shape) layer.weights[0] = rule_weights # weights layer.weights[1] = 0 # bias Let’s see how this performs

Let’s see how this performs

Method 5: Weights initialized with values sampled from a normal distribution with a careful tweak Let’s start with why - why use normal distribution here? Earlier, I mentioned that smaller weight values might be better for a network to train well. Now, in order to keep these initial weight values close to 0 normal distribution would be better suited than uniform distribution since in a uniform distribution, there is an equal probability for a number to get sampled. But for a normal distribution, that’s not the case. We would take a normal distribution having a mean of 0 and the standard deviation would be set to y. As can be seen in the following figure (which mimics a normal distribution) most of the values would be concentrated in the mean value region. In our case, this mean value would be 0 so, it might work as we are thinking. A sample normal distribution The code for initializing the weights with this scheme would be pretty much similar, we are going to swap the uniform rule with a normal one - # iterate over the layers of a given model for layer in model.layers: # check if the layer is of type `Dense` if isinstance(layer, tf.keras.layers.Dense): # shapes are important for matrix mult shape = (layer.weights[0].shape[0], layer.weights[0].shape[1]) # determine the `y` value y = 1.0/np.sqrt(shape[0]) # sample the values and assign them as weights rule_weights = np.random.normal(0, y, shape) layer.weights[0] = rule_weights # weights layer.weights[1] = 0 Here's how it performs

Here's how it performs