Speaker Diarization with Kaldi

With the rise of voice biometrics and speech recognition systems, the ability to process audio of multiple speakers is crucial. This article is a basic tutorial for that process with Kaldi X-Vectors, a state-of-the-art technique.

In most real-world scenarios speech does not come in well defined audio segments with only one speaker. In most of the conversations that our algorithms will need to work with, people will interrupt each other and cutting the audio between sentences won’t be a trivial task.

In addition to that, in many applications we will want to identify multiple speakers in a conversation, for example when writing a protocol of a meeting. For such occasions, identifying the different speakers and connect different sentences under the same speaker is a critical task.

Speaker Diarization is the solution for those problems. With this process we can divide an input audio into segments according to the speaker’s identity. It can be described as the question “who spoke when?” in an audio segment.

Attributing different sentences to different people is a crucial part of understanding a conversation. Photo by rawpixel on Unsplash

History

The first ML-based works of Speaker Diarization began around 2006 but significant improvements started only around 2012 (Xavier, 2012) and at the time it was considered a extremely difficult task. Most methods back then were GMMs or HMMs based (Such as JFA) that didn’t involve any Neural-Networks.

A really big breakthrough happened with the release of LIUM, an open-source software dedicated to speaker diarization that was written in Java. For the first time there was an freely distributed algorithm that could perform that task with reasonable accuracy. The algorithm in the core of LIUM is a complex mechanism that combines GMM with I-Vectors, a method that used to have state of the art results in speaker recognition tasks.

The entire process in the LIUM toolkit. An repetitive multi-part process with a lot of combined models.

Today, such systems of complex multi-part algorithms are being replaced with neural-networks in many different domains such as Image Segmentation and even Speech Recognition.

X-Vectors

A recent breakthrough was published in 2017 by D. Snyder, D. Garcia-Romero, D. Povey, and S. Khudanpur in an article named “Deep Neural Network Embeddings for Text-Independent Speaker Verification” that presented a model that will later be named “X-Vectors”.

A diagram of the proposed neural network, The different parts of the networks are highlighted on the right. From The original article.

In that method the input of the network is the raw audio in the form of MFCC. Those features are fed into a Neural Network that can be partitioned into four parts:

Frame-level layers - These layers are essentially a TDNN (Time Delay Neural Network). The TDNN is an architecture that was invented in the 90s before the increased popularity of neural networks and was then “rediscovered” in 2015 as a key part in speech recognition systems. This network is essentially a Fully-Connected neural network that takes into account a sliding window in time across the sample. It is considered to be much faster then LSTM\GRU. Statistics Pooling - Because each frame gives us a vector we will need to sum those vectors in some manner. In this implementation we take the mean and the standard deviation of all vectors and concatenate them to one vector that represents the entire segment. Fully connected layers - The vector is being fed into two fully connected layers (with 512 and 300 neurons) that we will use later. The second layer will have a ReLU non-linearity. Softmax Classifier - A simple Softmax classifier that takes the output after the ReLU and classifies the segment to one of the different speakers.

A visualization of a TDNN, The first part of the X-Vectors System.

The real power of the X-Vectors isn’t (only) in classifying the different speakers, but also its use of the two Fully connected layers as embedded representation of the entire segment. In the article they use these representations to classify an entirely different dataset from the dataset that they trained on. They first created the embedding for each new audio sample and then classified them with an PLDA backend similarity metric.

Diarization with X-Vectors

After we understood that we can use these embeddings as a representation of the speaker in each audio sample, we can see how that representation can be used to segment sub-parts of an audio sample. That method is described in the article “Diarization is Hard: Some Experiences and Lessons Learned for the JHUTeam in the Inaugural DIHARD Challenge”. The DIHARD challenge was especially hard because it contained 10 diverse audio domains ranging from TV-shows to Phone calls to conversations between children and in addition to that there were extra 2 domains that appeared only in the validation set.

In the article, they described many practices which brought their diarization algorithms to their current state of the art level. Although using different techniques (like Variational Bayes) improved the accuracy of the model dramatically it was essentially based on the same X-Vectors embedding and PLDA Backend.

From the DIHARD Callenge article. You can see the major improvements of using X-Vectors. Previous works are in blue and the state of the art results are in red.

How to do that with Kaldi

First of all, If you haven’t used Kaldi before I highly recommend reading my first article about using Kaldi. It’s hard to start using the system without experience with Speech recognition systems.

Secondly, you don’t need to re-train the X-Vectors network or the PLDA backend, you can just download them from the official site. If you still want to perform a full training from scratch you can follow the call_home_v2 recipe in the Kaldi project.

Now that you have a model, whether you’ve created the model or pretrained it, I will go through the different parts of the Diarization process. This walkthrough is adapted from the different comments at GitHub, mainly the one by david, and the documentation.

Preparing the data

You’ll first need to have a normal wav.scp and segments file, in the same way as in an ASR project. If you want an easy way to create such a file you can always use the compute_vad_decision.sh script and then the vad_to_segments.sh script on the output. If you don’t want to divide your audios, just map the segments to utterances from start to end.

Next, You’ll need to create an utt2spk file that will map segments to utterances. you can do that simply in Linux by running the command awk ‘{$1, $2}’ segments > utt2spk . Next, to create the other necessary files just put all the files in one folder and run the fix_data_dir.sh script.

Create Features

Now you’ll need to create the features for the audio that will later be the inputs for your X-Vectors extractor.

We will start with Creating the MFCC&CMVN in the same way as in an ASR project. Note that you’ll need to have a mfcc.conf file that is matching the training you had. If you used the pre-trained model use these files.

For the MFCC creation run the following command:

steps/make_mfcc.sh --mfcc-config conf/mfcc.conf --nj 60 \

--cmd "$train_cmd_intel" --write-utt2num-frames true \

$data_dir exp/make_mfcc $mfccdir

And then for the CMVN run this command:

local/nnet3/xvector/prepare_feats.sh — nj 60 — cmd \ "$train_cmd_intel" $data_dir $cmn_dir $cmn_dir

After finishing the data use utils/fix_data_dir.sh $data_dir to fix the data directory and then move the segments file to the CMVN directory with cp $data_dir/segments $cmn_dir/ and after that, fix the CMVN directory again with utils/fix_data_dir.sh $cmn_dir .

Create the X-Vectors

The next phase will be to create the X-Vectors for your data. I refer here to the export folder where you have the X-Vectors as $nnet_dir if you downloaded that from Kaldi website use the path “exp/xvectors_sre_combined” and then run that command:

diarization/nnet3/xvector/extract_xvectors.sh --cmd \ "$train_cmd_intel --mem 5G" \

--nj 60 --window 1.5 --period 0.75 --apply-cmn false \

--min-segment 0.5 $nnet_dir \

$cmn_dir $nnet_dir/exp/xvectors

Notice that in this example we are using a window of 1.5 seconds with 0.75 seconds shift of each window. lowering the shift might help to capture more details.

Score X-Vectors with PLDA

Now you’ll need to score the pair-wise similarity between the X-Vectors with the PLDA backend. Do that with following command:

diarization/nnet3/xvector/score_plda.sh \

--cmd "$train_cmd_intel --mem 4G" \

--target-energy 0.9 --nj 20 $nnet_dir/xvectors_sre_combined/ \

$nnet_dir/xvectors $nnet_dir/xvectors/plda_scores

Diarization

The last part will be to cluster the PLDA scores you’ve created. Luckily there is also a script for that. But, You can do that in two ways, in a supervised way and in an unsupervised way.

in the supervised way you’ll need to state how many speakers are in each utterance. It’s especially easy when you’re working with a phone call that essentially only has two speakers, or with a conference meeting with a known number of speakers. To cluster the scores in a supervised way you’ll first need to create a file that maps the utterances from the wav.scp file to the number of speakers in that utterance. the file should be named reco2num_spk and should look something like this:

rec1 2

rec2 2

rec3 3

rec4 1

An important note is that you need to map each utterance by the number of speakers and not each segment. After you’ve created the reco2num_spk file you can run the following command:

diarization/cluster.sh --cmd "$train_cmd_intel --mem 4G" --nj 20 \

--reco2num-spk $data_dir/reco2num_spk \

$nnet_dir/xvectors/plda_scores \

$nnet_dir/xvectors/plda_scores_speakers

If you don’t know the number of speakers that each utterance has you can always run the clustering in an unsupervised way and try to tune the threshold in the script. A good starting value will be 0.5. To cluster in an unsupervised way use the same script but in the following way:

diarization/cluster.sh --cmd "$train_cmd_intel --mem 4G" --nj 40 \

--threshold $threshold \

$nnet_dir/xvectors/plda_scores \

$nnet_dir/xvectors/plda_scores_speakers

The outcome

After clustering you’ll have an output file named rttm in the $nnet_dir/xvectors/plda_scores_speakers directory. The file will look something like this:

SPEAKER rec1 0 86.200 16.400 <NA> <NA> 1 <NA> <NA>`

SPEAKER rec1 0 103.050 5.830 <NA> <NA> 1 <NA> <NA>`

SPEAKER rec1 0 109.230 4.270 <NA> <NA> 1 <NA> <NA>`

SPEAKER rec1 0 113.760 8.625 <NA> <NA> 1 <NA> <NA>`

SPEAKER rec2 0 122.385 4.525 <NA> <NA> 2 <NA> <NA>`

SPEAKER rec2 0 127.230 6.230 <NA> <NA> 2 <NA> <NA>`

SPEAKER rec2 0 133.820 0.850 <NA> <NA> 2 <NA> <NA>`

In that file the 2nd column is the recording-id from the wav.scp file, the 4th column is the start time of the current segment, the 5th column is the size of the current segment and the 8th column is the ID of the speaker in that segment.

And with that, we’ve completed the Diarization process! We can now try to use speech recognition techniques to determine what each speaker said or use speaker verification techniques to validate if we know any of the different speakers.