I recently read this paper on the Inception Architecture. I am compiling the notes here for the same. This cool idea of taking notes of research papers for future reference is inspired by Jae Duk Seo.

Rethinking the Inception Architecture for Computer Vision

Little History of Conv Nets before starting

Before Convolutional Neural Networks were used to classify images, feature extraction and classification was used in Computer Vision to classify images. Feature Extraction is a time consuming process, and the features have to be selected carefully according to the data. Hence, it becomes a cumbersome task to tailor made features according to the images every time.

After the rise of Deep Learning, Convolutional Neural Networks started getting used for image classification, image segmentation, and other Computer Vision tasks.

What is ImageNet?

ImageNet is formally a project aimed at (manually) labeling and categorizing images into almost 22,000 separate object categories for the purpose of computer vision research.

However, when we hear the term “ImageNet” in the context of deep learning and Convolutional Neural Networks, we are likely referring to the ImageNet Large Scale Visual Recognition Challenge, or ILSVRC for short.

The goal of this image classification challenge is to train a model that can correctly classify an input image into 1,000 separate object categories.

Models are trained on ~1.2 million training images with another 50,000 images for validation and 100,000 images for testing.

These 1,000 image categories represent object classes that we encounter in our day-to-day lives, such as species of dogs, cats, various household objects, vehicle types, and much more. You can find the full list of object categories in the ILSVRC challenge here.

When it comes to image classification, the ImageNet challenge is the de facto benchmark for computer vision classification algorithms.

AlexNet — This was one of the first CNN architectures to be used. It is composed of 5 Convolutional layers followed by 3 Fully Connected layers. This used ReLu activation function for the first time and solved the exploding and vanishing gradients problem. Also, it used dropout as a regularization technique. It consisted 11x11, 5x5,3x3, convolutions, max pooling, data augmentation, SGD with momentum. It attached ReLU activations after every convolutional and fully-connected layer.

VGG16 — VGGNet was introduced in the paper Very Deep Convolutional Networks for Large-Scale Image Recognition. And it was indeed a very deep architecture for it consisted of 16 convolutional layers.

It uses only 3x3 filters,max pooling, and has a very uniform architecture.

This means that architectural improvements in deep convolutional architecture can be utilized for improving performance for most other computer vision tasks that are increasingly reliant on high quality, learned visual features.

Introduction

Basically AlexNet gave good accuracy even for tasks that would generally require formulation of special features if we take the old approach. But AlexNet has a huge number of parameters which makes it computationally very expensive. And although VGGNet’s architecture is more uniform and simple, it also has 3x more params than AlexNet.

GoogLeNet (The Inception Architecture) has a low computational cost. Thus it can be used where memory or computing power is limited (big data scenarios). But since this architecture is more complex, it is difficult to make changes to it and make it better. We’ll see some techniques in this paper.

General Design Principles

Some general principles for CNN architecture based on experimental evidence are:

Avoid bottlenecks with extreme compression. The size should decrease gradually. Increasing activations per tile will result in disentangled features and lead to faster training. Reducing the dimension of the input before spatial aggregation can be done without much loss of representation. Distribute the computational budget between height and width of the network.

Factorizing Convolutions with Large Filter Size

Basically in the paper Going Deeper with Convolutions, the Inception architecture was introduced which concatenated the filter outputs from various filter sizes. Apart from concatenation, it also introduced the dimensionality reduction technique. (Will discuss a bit further)

But, convolutions with large filter sizes like 5x5, 7x7 are computationally very expensive. What if we replace the 5x5 filter with two 3x3 filter layers like this -

As you can visualize, this two layer network does the same job as a 5x5 convolution, but with less parameters, and therefore less computation. Hence we end up with (9+9/25) time less computation.