In this article, we will learn about autoencoders in deep learning. We will show a practical implementation of using a Denoising Autoencoder on the MNIST handwritten digits dataset as an example. In addition, we are sharing an implementation of the idea in Tensorflow.

1. What is an autoencoder?

An autoencoder is an unsupervised machine learning algorithm that takes an image as input and reconstructs it using fewer number of bits. That may sound like image compression, but the biggest difference between an autoencoder and a general purpose image compression algorithms is that in case of autoencoders, the compression is achieved by learning on a training set of data. While reasonable compression is achieved when an image is similar to the training set used, autoencoders are poor general-purpose image compressors; JPEG compression will do vastly better.

Autoencoders are similar in spirit to dimensionality reduction techniques like principal component analysis. They create a space where the essential parts of the data are preserved, while non-essential ( or noisy ) parts are removed.

There are two parts to an autoencoder

Encoder: This is the part of the network that compresses the input into a fewer number of bits. The space represented by these fewer number of bits is called the “latent-space” and the point of maximum compression is called the bottleneck. These compressed bits that represent the original input are together called an “encoding” of the input. Decoder: This is the part of the network that reconstructs the input image using the encoding of the image.

Let’s look at an example to understand the concept better.

Figure 1: 2-layer autoencoder

In the above picture, we show a vanilla autoencoder — a 2-layer autoencoder with one hidden layer. The input and output layers have the same number of neurons. We feed five real values into the autoencoder which is compressed by the encoder into three real values at the bottleneck (middle layer). Using these three real values, the decoder tries to reconstruct the five real values which we had fed as an input to the network.

In practice, there are a far larger number of hidden layers in between the input and the output.

There are various kinds of autoencoders like sparse autoencoder, variational autoencoder, and denoising autoencoder. In this post, we will learn about a denoising autoencoder.

2. Denoising Autoencoder

Figure 2: Denoising autoencoder

The idea behind a denoising autoencoder is to learn a representation (latent space) that is robust to noise. We add noise to an image and then feed this noisy image as an input to our network. The encoder part of the autoencoder transforms the image into a different space that preserves the handwritten digits but removes the noise. As we will see later, the original image is 28 x 28 x 1 image, and the transformed image is 7 x 7 x 32. You can think of the 7 x 7 x 32 image as a 7 x 7 image with 32 color channels.

The decoder part of the network then reconstructs the original image from this 7 x 7 x 32 image and voila the noise is gone!

How does this magic happen?

During training, we define a loss (cost function) to minimize the difference between the reconstructed image and the original noise-free image. In other words, we learn a 7 x 7 x 32 space that is noise free.

Download Code

To easily follow along this tutorial, please download the iPython notebook code by clicking on the button below. It’s FREE! Download Code

3. Implementation of Denoising Autoencoder

This implementation is inspired by this excellent post Building Autoencoders in Keras.

3.1 The Network

The images are matrices of size 28 x 28. We reshape the image to be of size 28 x 28 x 1, convert the resized image matrix to an array, rescale it between 0 and 1, and feed this as an input to the network. The encoder transforms the 28 x 28 x 1 image to a 7 x 7 x 32 image. You can think of this 7 x 7 x 32 image as a point in a 1568 ( because 7 x 7 x 32 = 1568 ) dimensional space. This 1568 dimensional space is called the bottleneck or the latent space. The architecture is graphically shown below.

Figure 3: Architecture of encoder model

The decoder does the exact opposite of an encoder; it transforms this 1568 dimensional vector back to a 28 x 28 x 1 image. We call this output image a “reconstruction” of the original image. The structure of the decoder is shown below.

Figure 4: Architecture of decoder model

Let’s dive into the implementation of an autoencoder using tensorflow.

3.2 Encoder

The encoder has two convolutional layers and two max pooling layers. Both Convolution layer-1 and Convolution layer-2 have 32-3 x 3 filters. There are two max-pooling layers each of size 2 x 2.

encoder = Sequential([ # convolution Conv2D( filters=32, kernel_size=(3,3), strides=(1,1), padding='SAME', use_bias=True, activation=lrelu, name='conv1' ), # the input size is 28x28x32 MaxPooling2D( pool_size=(2,2), strides=(2,2), name='pool1' ), # the input size is 14x14x32 Conv2D( filters=32, kernel_size=(3,3), strides=(1,1), padding='SAME', use_bias=True, activation=lrelu, name='conv2' ), # the input size is 14x14x32 MaxPooling2D( pool_size=(2,2), strides=(2,2), name='encoding' ) # the output size is 7x7x32 ])

Figure 5: Encoder block diagram

3.3 Decoder

The decoder has two Conv2d_transpose layers, two Convolution layers, and one Sigmoid activation function. Conv2d_transpose is for upsampling which is opposite to the role of a convolution layer. The Conv2d_transpose layer upsamples the compressed image by two times each time we use it.

decoder = Sequential([ Conv2D( filters=32, kernel_size=(3,3), strides=(1,1), name='conv3', padding='SAME', use_bias=True, activation=lrelu ), # updampling, the input size is 7x7x32 Conv2DTranspose( filters=32, kernel_size=3, padding='same', strides=2, name='upsample1' ), # upsampling, the input size is 14x14x32 Conv2DTranspose( filters=32, kernel_size=3, padding='same', strides=2, name='upsample2' ), # the input size is 28x28x32 Conv2D( filters=1, kernel_size=(3,3), strides=(1,1), name='logits', padding='SAME', use_bias=True ) ])

Figure 6: Decoder Block Diagram

The resultant encoder-decoder model class is represented as:

# model class definition class EncoderDecoderModel(Model): def __init__(self, is_sigmoid=False): super(EncoderDecoderModel, self).__init__() # assign encoder sequence self._encoder = encoder # assign decoder sequence self._decoder = decoder self._is_sigmoid = is_sigmoid # forward pass def call(self, x): x = self._encoder(x) decoded = self._decoder(x) if self._is_sigmoid: decoded = tf.keras.activations.sigmoid(decoded) return decoded

Finally, we calculate the loss of the output using cross-entropy loss function and use Adam optimizer to optimize our loss function.

3.4 Why do we use a leaky ReLU and not a ReLU as an activation function?

We want gradients to flow while we backpropagate through the network. We stack many layers in a system in which there are some neurons whose value drop to zero or become negative. Using a ReLU as an activation function clips the negative values to zero and in the backward pass, the gradients do not flow through those neurons where the values become zero. Because of this the weights do not get updated, and the network stops learning for those values. So using ReLU is not always a good idea. However, we encourage you to change the activation function to ReLU and see the difference.

# define leaky ReLU function def lrelu(x, alpha=0.1): return tf.math.maximum(alpha*x, x)

Therefore, we use a leaky ReLU which instead of clipping the negative values to zero, cuts them to a specific amount based on a hyperparameter alpha. This ensures that the network learns something even when the pixel value is below zero.

3.5 Load the data

Once the architecture has been defined, we load the training and validation data.

As shown below, Tensorflow allows us to easily load the MNIST data. The training and testing data loaded is stored in variables train_imgs and test_imgs respectively. Since its an unsupervised task we do not care about the labels.

# load mnist dataset (train_imgs, train_labels), (test_imgs, test_labels) = tf.keras.datasets.mnist.load_data() # fit image pixel values from 0 to 1 train_imgs, test_imgs = train_imgs / 255.0, test_imgs / 255.0

3.6 Data Analysis

Before training a neural network, it is always a good idea to do a sanity check on the data.

Let’s see how the data looks like. The data consists of handwritten numbers ranging from 0 to 9, along with their ground truth labels. It has 55,000 train samples and 10,000 test samples. Each sample is a 28×28 grayscale image. Let’s view the data details:

# check data array shapes: print("Size of train images: {}, Number of train images: {}".format(train_imgs.shape[-2:], train_imgs.shape[0])) print("Size of test images: {}, Number of test images: {}".format(test_imgs.shape[-2:], test_imgs.shape[0]))

The output is:

Size of train images: (28, 28), Number of train images: 60000 Size of test images: (28, 28), Number of test images: 10000

The visualization of train and test image examples:

# plot image example from training images plt.imshow(train_imgs[1], cmap='Greys') plt.show() # plot image example from test images plt.imshow(test_imgs[0], cmap='Greys') plt.show() plt.close()

Output:

Figure 7: Train and test MNIST images

3.7 Preprocessing the data

The images are grayscale and the pixel values range from 0 to 255. We apply following preprocessing to the data before feeding it to the network.

Add a new dimension to the train and test images, which will be fed into the network.

# prepare training reference images: add new dimension train_imgs_data = train_imgs[..., tf.newaxis] # prepare test reference images: add new dimension test_imgs_data = test_imgs[..., tf.newaxis]

Add noise to both train and test images which we then feed into the network. Noise factor is a hyperparamter and can be tuned accordingly.

# add noise to the images for train and test cases def distort_image(input_imgs, noise_factor=0.5): noisy_imgs = input_imgs + noise_factor * np.random.normal(loc=0.0, scale=1.0, size=input_imgs.shape) noisy_imgs = np.clip(noisy_imgs, 0., 1.) return noisy_imgs # prepare distorted input data for training train_noisy_imgs = distort_image(train_imgs_data) # prepare distorted input data for evaluation test_noisy_imgs = distort_image(test_imgs_data)

Let’s illustrate the noisy images :

# plot distorted image example from training images image_id_to_plot = 0 plt.imshow(tf.squeeze(train_noisy_imgs[image_id_to_plot]), cmap='Greys') plt.title("The number is: {}".format(train_labels[image_id_to_plot])) plt.show() # plot distorted image example from test images plt.imshow(tf.squeeze(test_noisy_imgs[image_id_to_plot]), cmap='Greys') plt.title("The number is: {}".format(test_labels[image_id_to_plot])) plt.show() plt.close()

Output:

Figure 8: Noisy train and test MNIST images

3.8 Train and evaluate the model

The network is ready to get trained. We specify the number of epochs as 25 with batch size of 64. This means that the whole dataset will be fed to the network 25 times. We will be using the test data for validation.

# define custom target function for further minimization def cost_function(labels=None, logits=None, name=None): loss = tf.nn.sigmoid_cross_entropy_with_logits(labels=labels, logits=logits, name=name) return tf.reduce_mean(loss) # init the model encoder_decoder_model = EncoderDecoderModel() # training loop params num_epochs = 25 batch_size_to_set = 64 # training process params learning_rate = 1e-5 # default number of workers for training process num_workers = 2 # initialize the training configurations such as optimizer, loss function and accuracy metrics encoder_decoder_model.compile(optimizer=tf.compat.v1.train.AdamOptimizer(learning_rate=learning_rate), loss=cost_function, metrics=None ) results = encoder_decoder_model.fit( train_noisy_imgs, train_imgs_data, epochs=num_epochs, batch_size=batch_size_to_set, validation_data=(test_noisy_imgs, test_imgs_data), workers=num_workers, shuffle=True )

After 25 epochs we can see our training loss and validation loss is quite low which means our network did a pretty good job. Let’s now see the loss plot between training and validation data using the introduced utility function plot_losses(results) .

3.10 Training Vs. Validation Loss Plot

We’ve defined the utility function for plotting the losses:

# funstion for train and val losses visualizations def plot_losses(results): plt.plot(results.history['loss'], 'bo', label='Training loss') plt.plot(results.history['val_loss'], 'r', label='Validation loss') plt.title('Training and validation loss',fontsize=14) plt.xlabel('Epochs ',fontsize=14) plt.ylabel('Loss',fontsize=14) plt.legend() plt.show() plt.close() # visualize train and val losses plot_losses(results)

The result is:

Figure 9: Training and validation losses

From the above loss plot, we can observe that the validation loss and training loss are both steadily decreasing in the first ten epochs. This training loss and the validation loss are also very close to each other. This means that our model has generalized well to unseen test data.

We can further validate our results by observing the original, noisy and reconstruction of test images.

3.11 Results

Figure 10: Representation of MNIST images on different stages

From the above figures, we can observe that our model did a good job in denoising the noisy images that we had fed into our model.

Subscribe & Download Code

If you liked this article and would like to download code (iPython notebook), please subscribe to our newsletter. You will also receive a free Computer Vision Resource Guide. In our newsletter, we share OpenCV tutorials and examples written in C++/Python, and Computer Vision and Machine Learning algorithms and news.

Subscribe Now