From pixabay.com

This article is about One-shot learning especially Siamese Neural Network using the example of Face Recognition. I’m going to share with you what I learned about it from the paper FaceNet: A Unified Embedding for Face Recognition and Clustering and from deeplearning.ai. This way, you may save more time to go deeper into this topic if you are more interested in it. So let’s get started!

One Shot learning

In order to understand the reason why we have one-shot learning, we need to talk about deep learning and data. Normally, in deep learning, we need a large amount of data and the more we have, the better the results get. However, it will be more convenient to learn only from few data because not all of us are rich in terms of how much data we have.

Also, the brain doesn’t need thousands of pictures of the same object in order to be able to recognize it. But let’s not talk about the brain analogy because it is far more complicated and powerful, and many things are involved in our process of learning and memorization such as feelings, prior knowledge, and interactions, etc.

The idea here is that we need to learn an object class from only a few data and that’s what One-shot learning algorithm is.

Face recognition

In face recognition systems, we want to be able to recognize a person’s identity by just feeding one picture of that person’s face to the system. And, in case, it fails to recognize the picture, it means that this person’s image is not stored in the system’s database.

To solve this problem, we cannot use only a convolutional neural network for two reasons: 1) CNN doesn’t work on a small training set. 2) It is not convenient to retrain the model every time we add a picture of a new person to the system. However, we can use Siamese neural network for face recognition.

Siamese neural network

Siamese neural network has the objective to find how similar two comparable things are (e.g. signature verification, face recognition..). This network has two identical subnetworks, which both have the same parameters and weights.

from C4W4L03 Siamese Network. Credit to Andrew Ng

The image above is a good example of face recognition using Siamese network architecture from deeplearning.ai. As you can see, the first subnetwork’s input is an image, followed by a sequence of convolutional, pooling, fully connected layers and finally a feature vector (We are not going to use a softmax function for classification). The last vector f(x1) is the encoding of the input x1. Then, we do the same thing for the image x2, by feeding it to the second subnetwork which is totally identical to the first one to get a different encoding f(x2) of the input x2.

To compare the two images x1 and x2, we compute the distance d between their encoding f(x1) and f(x2). If it is less than a threshold (a hyperparameter), it means that the two pictures are the same person, if not, they are two different persons.

distance function between two encoding of x1 and x2.

And this is working for any two images xi and xj.

So, how can we learn the parameters in order to get a good encoding for the input image?

We can apply gradient descent on a triplet loss function which is simply a loss function using three images: an anchor image A, a positive image P(same person as the anchor), as well as a negative image N (different person than the anchor). So, we want the distance d(A, P) between the encoding of the anchor and the encoding of the positive example to be less than or equal to the distance d(A, N) between the encoding of the anchor and the encoding of the negative example. In other words, we want pictures of the same person to be close to each other, and pictures of different persons to be far from each other.

The Triplet Loss minimizes the distance between an anchor and a positive, both of which have the same identity, and maximizes the distance between the anchor and a negative of a different identity. from the paper FaceNet: A Unified Embedding for Face Recognition and Clustering

The problem here is that the model can learn to make the same encoding for different images, which means that distances will be zero, and unfortunately, it will satisfy the triplet loss function. For this reason, we are adding a margin alpha (hyperparameter), to prevent this from happening, and to always have a gap between A and P versus A and N.

Triplet loss function:

The max means as long as d(A, P) — d(A, N)+ alpha is less than or equal to zero, the loss L(A, P, N) is zero, but if it is greater than zero, the loss will be positive, and the function will try to minimize it to zero or less than zero.

The Cost function is the sum of all individual losses on different triplets from all the training set.

Training set:

The training set should contain multiple pictures of the same person to have the pairs A and P, then once the model is trained, we’ll be able to recognize a person with only one picture.

How do we choose the triplets to train the model?

If we choose them randomly, it will be so easy to satisfy the constraint of the loss function because the distance is going to be most of the time so large. And the gradient descent will not learn much from the training set. For this reason, we need to find A, P, and N so that A and P are so close to N. Our objective is to make it harder to train the model to push the gradient descent to learn more.

References: