Networking event — costs and benefits

How do we get such a mask, you ask? One part of the answer is the network architecture and postprocessing that yields the given set of masks. The other part is the objective function that tells the model how to learn which masks work and which don’t.

Figure 4. Source-contrastive estimation network architecture. During inference, the output of the dense layer undergoes clustering (left branch); during training, it is passed to the source-contrastive loss function (SCE; right branch).

If We have two bidirectional LSTM layers at the start, with the recurrence going over the time dimension of the spectrogram. Then a feedforward layer generates an embedding — a matrix with a 40-component vector assigned to each time-frequency bin. The code implementing this in TensorFlow is available on GitHub — here’s the juiciest bit, excerpted from a larger class definition:

What do you do with that?

The insight from deep clustering and deep attractor networks is that representing time-frequency bins in a dense vector space lets you nudge them around, so that components belonging to different speakers are easily separated. When you deploy the model on a mixture, it transforms the spectrogram into a space where the difference between speakers is pretty clear-cut, as Figure 5 shows.

Figure 5. Low-dimensional view of embedding for a two-speaker mixture

The vectors in this new space can be clustered using any old algorithm. We used K-means clustering. Cluster memberships can then be used to create masks by assigning 1s to all the time-frequency components in the cluster and -1 to all the other bins.

Putting on training wheels

Getting the vectors in the embedding space to separate, however, is the tricky part, bringing us to the cost function at the heart of SCE. In deep attractor networks, the goal is to get each embedding close to a centroid of the speaker’s overall volume in the embedding space. In deep clustering, the story is less geometrically neat but the embeddings are optimized to approximate a bin-to-bin similarity matrix. We thought we could do even better.

The “source-contrastive” part of “source-contrastive estimation” makes a nod to noise-contrastive estimation, the engine behind (some instantiations of) word2vec. The intuition is that an embedding can be optimized by simultaneously pushing it toward a target vector representing the speaker it belongs to and by pushing it away from other vectors, representing the other speakers in the mixture. This effects a contrast between bins that should be different and draws bins together when they share an identity. And although speaker identities are used during training, they are not needed during inference — neat!

Here’s another question: just where do we put the target vectors for the various speakers? We can actually just learn that along the way, too. As bin embeddings get pushed around and away from each other, so too will the speaker embeddings. As a side benefit, you should end up with similar-sounding speakers closer to each other in the embedding space and different-sounding ones further apart.

And does it do anything?

Our experiments, documented in our arXiv manuscript, indicate that SCE outperforms deep clustering in a number of settings, including mixtures of two women and of one man and one woman. This includes an improvement in signal-to-distortion ratio (SDR) of 1 or more dB when using SCE, compared to other techniques, as shown in Figure 6.

Figure 6. Source-to-distortion ratio for different separation techniques, across mixture conditions.

Another thing we think is really cool about our approach is that it cuts down on training time. Doing updates to a deep clustering model requires some hefty matrix computations (even after the impressive magic tricks its inventors pulled off to make their updates faster). Like its cousin noise-contrastive estimation, SCE presents a relatively lightweight means of estimating an embedding that separates speakers from each other. In our experiments, average wall clock time per minibatch in SCE was almost half that of deep clustering!

There is a lot more to do — right now, we are concerned that the margin between speakers in the embedding space is not wide enough, making it hard to separate ambiguous components. We also suspect that the information contained in our model could be put to work for other purposes — such as speaker identification or diarization. We are also hard at work on ways to make the separation performance of SCE robust to noise, so that it can actually work in real-world scenarios. In any case, we are very encouraged by our results to date, even if it is a long way from solving the cocktail party problem in general. If you are intrigued, please read our manuscript and check out our GitHub repo!