Training a Speaker Embedding from Scratch with Triplet Learning

Introduction

For the past few months, I have been researching on building an end to end speaker identification system with deep learning. The area of research I focused on is metric learning. Knowing almost nothing going in, I have significantly underestimated the effort required (surprise!). While the effort is ongoing, I have learnt much along and have some preliminary results I am excited to share.

In this blog post, I will explain what metric learning is, what is speaker identification, how metric learning applies in this context, and share the lessons I learnt from applying those knowledge to build a speaker embedding function.

tl;dr I share my learnings on how I built a speaker embedding function with deep learning in this blog post

A bit of background on metric Learning

Metric learning is about learning a function that translates an input such as an image to another hyperspace, commonly referred to as latent space, where one can compare different instances on some metrics, usually semantic similarity. That is, metric learning is not about learning to classify an image to a predetermined class; rather, metric learning is about learning a similarity function that we can use to compare how similar two inputs are. Why would we want to such a function? This is a useful technique when the input space is large and it’s not feasible to have a sample from every class to train a classifier.

Perhaps the most well known application of metric learning are word vectors. A word vector function takes words as inputs and output embeddings in a vector space where words are clustered around semantic similarity. For example, in a well trained word embedding, “queen” and “woman” will be closer together than “king” and “woman” as the training implicitly learns the concept of gender. Another interesting application of metric learning is finding similar or duplicate images. FaceNet [1], which does person identification using face images, is an example of such application.

Speaker identification problem in detail

Now that we covered the basics of metrics learning with examples such as word vectors and faceNet, let’s look at the speaker recognition problem in more detail.

In the research literature, speaker identification is divided into 2 categories. If we are identifying a speaker using a fixed phrase such as “Ok Google” or “Alexa”, this is referred to as text-dependent speaker identification. If we are identifying a speaker using only the voice characteristics without relying on specific words, this is known as text independent speaker identification. For instance, a well learnt text independent embedding should be able to identify my voice whether I say the words “Hello World” or “Goodbye World”. I decided to focus on text independent identification from the beginning since it solves the speaker identification problem in the general settings. See Google’s paper on building “OK Google” text dependent speaker identification at [2].

In summary, speaker identification in this post means recognizing the speaker identity based on a sample of a speaker’s voice, independent of the spoken words. To be able to accomplish this, we need to train a function that transform a speaker voice to an embedding space where similar voices are clustered near each other and different voices are far away from each other. With this function, we will then be able to identity speakers based on its neighbors in the embedding hyperspace.

Quick thoughts on data processing and dev environments

With the problem defined, the next step to tackle is data and development environment. You probably heard before that machine learning problem is 90% data problem and 10% machine learning. You may also have thought this was an exaggeration. I did and I was wrong. In the context of engineering effort for this experiment, I spent a significant amount of time setting up scalable data processing pipelines and repeatable machine learning experiments. This effort itself is worth a separate blog post so I won’t belabor the effort further. I ended up going with pytorch as the deep learning framework, chose Mel-frequency cepstral coefficients (MFCCS) as the data representation, used HDF5 with data versioning, recorded my experiment parameters in a yaml config, and used tensorboard as the visualization tool.

Along the same vein, finding quality datasets for this experiment took nontrivial efforts as well. According to Deep Speaker [3] from Baidu, the models they trained relied on a dataset with one million plus speakers. As far as I know, there is no public dataset with nearly as many speakers. With that in mind, I started gathering my own dataset and currently have about 10,000 speakers across multiple languages; however, for the experiments and visualizations I share in this blog post, I used publicly available dataset comprising of TIMIT, VCTK, and dev and test LibriSpeech dataset. These dataset include approximately 880 English speakers, which I split into 680 training speakers and 200 test speakers. Note that there are other public speaker dataset such as Mozilla’s common voice and VoxCeleb. I have not had a chance a to explore them in detail yet but I plan to do so later.

From Contrastive Loss to Triplet Loss

With the problem and dataset defined, let’s look in more detail how we can utilize metric learning to tackle the speaker recognition problem. Typically, we will utilize a siamese network architecture. This means we will create a deep neural network with a fully connected layer as the output layer and train the network to minimize the difference between voice embeddings from the same speaker and maximize the difference between voice embeddings from different speakers. For a good introduction to siamese architecture, see [4].

Fig 0 — Siamese Network Schematic from [4]. X are high dimensional inputs, Gw(X) is the embedding output, and W is the shared weight

As shown in fig 0 above, siamese network is an architecture and does not prescribe what each layer of the neural network encompasses. What is important is that two paths of the network share the same weights to make output embedding directly comparable. With a neural network architecture, the next step is to train the network with a suitable loss function. In the metric learning space, there are 2 common loss functions: contrastive loss and triplet loss.

Training with contrastive loss involves taking a batch of sample pairs as inputs to train the network to separate different class embeddings with a minimum distance called margin while trying to minimize the distance between embeddings of the same class. The distance function is not prescribed but is usually euclidean or cosine. During training, one trains over a generated list of pairs such as (x, y) where x might be from the same class as y half the time. Contrastive loss is defined in fig 1 below. For an example of contrastive loss in the literature, see [5].