In a recent blog post, Google announced they have open-sourced their speaker diarization technology, which is able to differentiate people’s voices at a high accuracy rate. Google is able to do this by partitioning an audio stream that includes multiple participants into homogeneous segments per participant.

Partitioning speech into homogeneous segments has many applications. Chong Wang, a research scientist at Google, explains:

By solving the problem of "who spoke when", speaker diarization has applications in many important scenarios, such as understanding medical conversations, video captioning and more.

Being able to accurately segment conversations improves the quality of both online and offline diarization systems. This benefit has many practical benefits in the healthcare industry as a recent Annals Family Medicine Journal reported that:

Doctors often spend ~6 hours in an 11-hour workday in the Electronic Health Records (EHR) on documentation. Consequently, one study found that more than half of surveyed doctors report at least one symptom of burnout.

Training voice dictation systems using supervised learning methods has historically been challenging. Wang explains why:

Training these systems with supervised learning methods is challenging — unlike standard supervised classification tasks, a robust diarization model requires the ability to associate new individuals with distinct speech segments that weren't involved in training. Importantly, this limits the quality of both online and offline diarization systems.

Using online speaker diarization on streaming audio input allows for the detection of different speakers as represented in the following image by different colors in the bottom axis.

Image source: https://ai.googleblog.com/2018/11/accurate-online-speaker-diarization.html

Google has developed a research paper called Fully Supervised Speaker Diarization where they introduced a new model that uses supervised speaker labels in a more effective manner over traditional approaches. Within this model, an estimation takes place which identifies the number of speakers that participate in a conversation, which increases the amount of labeled data.

As part of NIST SRE 2000 CALLHOME benchmarking, Google’s techniques achieved a diarization error rate (DER) as low as 7.6% where DER is defined as a "percentage of the input signal that is wrongly labeled by the diarization output." The recent results are improvements over the 8.8% DER achieved using a clustering-based method or the 9.9% DER achieved using deep neural network embedding methods.

Modern speaker diarization systems usually leverage clustering algorithms like k-means or spectral clustering. Wang explains some of the drawbacks of using these approaches:

Since these clustering methods are unsupervised, they could not make good use of the supervised speaker labels available in the data. Moreover, online clustering algorithms usually have worse quality in real-time diarization applications with streaming audio inputs. The key difference between our model and common clustering algorithms is that in our method, all speakers’ embeddings are modeled by a parameter-sharing recurrent neural network (RNN), and we distinguish different speakers using different RNN states, interleaved in the time domain.

To illustrate how this model works, consider four different speakers (the model can support an unknown number of speakers), each represented with their own color (blue, yellow, pink and green). Each speaker will have their own RNN instance where the initial state is shared across all speakers. A speaker will continue to update their RNN until a different speaker begins to speak. For example, the blue speaker may initiate the conversation until it transitions over to the yellow speaker. During both of these timeframes, each speaker will be updating their RNN while they are speaking. This occurs across all participants as the conversation transitions from one speaker to another.

Image source: https://ai.googleblog.com/2018/11/accurate-online-speaker-diarization.html

Wang explains why using RSS states is important:

Representing speakers as RNN states enables us to learn the high-level knowledge shared across different speakers and utterances using RNN parameters, and this promises the usefulness of more labeled data. In contrast, common clustering algorithms almost always work with each single utterance independently, making it difficult to benefit from a large amount of labeled data.

The outcome of using an approach based upon RSS states is there are time-stamped speaker labels that identify who spoke and for how long. In addition, this approach is suitable for latency-sensitive applications.

Moving forward, Google will continue to further reduce DER and integrate contextual information to perform offline decoding. To learn more about speaker diarization technology, Google has published a paper and made its source code available on GitHub.