Let’s face it. All of us are bad at spelling (maybe you’re the exception 🤷). Thier? Their? Definetly? Definitely?

In either case, our brains still understand that this is the word their and definitely.

Let’s try taking this example of a face.

A normal face

What if I start moving the mouth to the forehead and the eyes to the chin? It might be harder to tell, but it’s still a face.

A deformed face

Convolutional Neural Networks (CNNs) do a great job of classifying images (like the image above) by understanding the different features of the face. They do this by examining thousands of sample images and learning through its mistake. Convolutional layers near the start help detect basic features (i.e. edges) and deeper layers in the network help detect more detailed features (i.e. the eyes, mouth, etc.)

CNN architecture layout

However, you might be wondering, how does the network do this efficiently with all that data?

Max-pooling Layers

Between each convolution layer exists a max-pooling layer. These layers take the most active neurons from each convolution layer and pass those to the next convolutional layer. This means the less active neurons are dropped. Dropping these neurons are the reason why spatial information gets lost as the data progresses through the network.

It’s ironic how well max-pooling layers function. By dropping neurons, you would expect the accuracy to drop, but this practice works so well that Geoffrey Hinton even said,

“The pooling operation used in convolutional neural networks is a big mistake and the fact that it works so well is a disaster” — Geoffrey Hinton

This also presents another problem with CNNs. What goes on in each hidden layer and how do these layers communicate with each other?

We need another method that can help us perform the functions of a CNN while improving the architecture for more robust classifications. And, that’s where capsule neural networks come into play.

How do they work?

Capsule networks use capsules, compared to neurons in a standard neural network. Capsules encapsulate all the important information of an image which outputs a vector. Compared to neurons, which output a scalar quantity, capsules have the ability to keep track of the direction of the feature. Therefore, if we start to change the position of the feature, the value of the vector will stay the same but the direction will point in the change in position.

Pros

1. Performs very well on smaller datasets (i.e. MNIST) 2. Easier to interpret more robust images 3. Keeps all the information (pose, texture, location, etc.)

Cons

1. Doesn’t outperform larger datasets (i.e. CIFAR10) 2. Routing-by-agreement algorithm requires more time to compute

Architecture:

A capsule network (CapsNet) contains an encoder and a decoder. Together, the encoder and decoder contain 6 layers.

Encoder Architecture

The first three layers are the encoders and are responsible for taking the input image and converting it into a 16-dimensional vector. The three layers that create the encoder of a CapsNet are the following:

1. Convolutional neural network 2. PrimaryCaps Network 3. DigitCaps Network

The first layer is responsible for extracting the basic features of the image (i.e. finding the strokes that make up the number 4).

The second layer (PrimaryCaps) is responsible for taking these basic features and finding more detailed patterns between them (i.e. the spatial relationship between each stroke). The number of capsules in this layer varies depending on the dataset, but for the MNIST dataset, there are 32 capsules to maintain the most relevant info.

Just like the second layer, the number of capsules varies in the third layer (DigitCaps layer). Particularly, for the MNIST dataset, we have 10 capsules; one capsule for each digit.

MNIST handwritten digits

To determine which capsule from the PrimaryCaps goes to the DigitCaps, the weights of the lower-level capsule (PrimaryCaps) must agree with the weights of the higher level capsule (DigitCaps). (We’ll discuss more of this later.)

Decoder Architecture

At the end of the encoder, what we have is a 16-D vector. This gets passed to the decoder. The decoder has three fully connected layers. The main job of the decoder is to take the 16-D vector and attempt to reconstruct the image from scratch using that data. This is amazing because it makes the network more robust by generating predictions based on its own knowledge.

Decoder architecture breakdown

The components:

With that in mind, there are four main computations in a capsule neural network.

1. Matrix multiplication 2. Scalar weighting 3. Dynamic routing algorithm 4. Vector-to-vector nonlinearity

Matrix multiplication.

We perform a weight matrix multiplication between the info passed from the first to the second layer to encode the information of understanding spatial relationships. This encoded information represents the probability of classifying the correct labels.

Scalar weighting.

Remember how I talked about how the weights of the lower-level capsule have to agree with the higher level? In this stage, the capsules from the lower level adjust its weights in order to match the weights of the higher level.

You might be wondering: the higher-level capsules are receiving a bunch of inputs, how do they pick the right lower capsules?

Well, the higher-level capsules graph the distribution of the weights and accept the greatest distribution/cluster to pass on.

How do they communicate?

Dynamic routing algorithm.

Using the dynamic routing algorithm, we can have the layers communicate with each other. The algorithm focuses on the idea of “routing by agreement”, which is what we talked about earlier of having the weight/contents of the capsules agreeing. The dynamic routing algorithm allows us to better pass data between layers in the network, which in theory, will boost the time and space complexity. In practice, we use three routing iterations because over iterating will lead to overfitting and poor performance.

Contents of lower capsules agreeing with higher-level capsules

Vector-to-Vector Nonlinearity.

After performing the dynamic routing algorithm and passing the capsules to the correct location and reconstructing the image, the last step that’s needed is to compress the information. We need some way to condense all this information to something that we can reuse.

How we do this is using a squash function.

The squash function takes the final vector size and compresses the length into size less than 1, while still maintaining the direction of the vector. Here’s the math representation of the squash function:

The left side (red box) of the formula performs the scaling that we discussed early. And, the right side (green box) assigns a unit length to the squashed vector.

Quick Recap: This vector output (v_j) represents the probability of that given feature being recognized by the capsule.

Loss function:

Just like a traditional network, the networks learn from their mistakes. We can represent this mistake through a loss function that looks something like this:

Breakdown of CapsNet loss function [Source]

This loss function will return 0 if the network predicts the correct class with a probability of 0.9 or higher. Otherwise, if the confidence is less than 0.9, the loss function will return a number between 0 to 1.

Takeaways

Capsule networks maintain the majority of the information in an image compared to max-pooling layers

compared to max-pooling layers The dynamic routing algorithm allows layers in the network to communicate with each other through routing-by-agreement.

allows layers in the network to communicate with each other through Capsule networks are still early in development

Capsule networks lack in performance for larger datasets

If you enjoyed my article or learned something new, please make sure to: