Speech recognition is an acute area of interest for Apple, whose cross-platform Siri virtual assistant is used by over 500 million customers worldwide. This past week, the tech giant published a series of preprint research papers investigating techniques to improve voice trigger detection and speaker verification, as well as language identification for multiple speakers.

Speaker verification and voice trigger detection

In the first of the papers, a team of Apple researchers propose an AI model trained to perform both the task of automatic speech recognition and speaker recognition. As they explain in the abstract, the commands recognized by speech-based personal assistants are usually prefixed with a trigger phrase (e.g., “Hey, Siri”), and detecting this trigger phrase involves two steps. The AI first must decide whether the phonetic content in the input audio matches that of the trigger phrase (voice trigger detection), and then it must determine whether the speaker’s voice matches the voice of a registered user or users (speaker verification).

The two tasks are usually considered independently, but the coauthors posited that knowledge of the speaker might help suss out the phonetic content in the acoustic signal and vice versa, helping to estimate both properties.

The researchers devised three sets of models capable of learning phonetic and speaker information, which they trained on a data set containing over 16,000 hours of annotated samples where 5,000 hours of audio had phonetic labels. (The rest had speaker labels only.) Over 100 subjects contributed to the corpus using a smart speaker device in a range of acoustic settings, including quiet room, external noise from a TV or kitchen appliance in the room, and music playback from the recorder at loud volume. Two thousand hours of continuous audio recordings from TV, radio, and podcasts that didn’t contain the trigger phrase were added to allow the measurement of “false alarm” rate.

The models showed an aptitude for learning both phonetic and speaker information while yielding accuracies “at least as good” as the baseline models for each task, with the same number of parameters — variables that control certain properties of the training process — as the independent models. In point of fact, one of the three proposed models outperformed the speaker verification baselines in “multiple” settings, showing a relative improvement of 7.6% over the baseline on a text-independent task.

“[An] interesting feature of these results is that the model was trained using disjoint datasets — i.e. each audio example has either phonetic or speaker labels, never both,” wrote the researchers. “This observation suggests a flexible design where it is possible to train a model on multiple related tasks by concatenating training data for different tasks, rather than obtaining multiple labels for each training example. From a practical standpoint, being able to share computation between the two tasks can save on-device memory, computation time or latency, and the amount of power/battery consumed.”

False trigger mitigation

A complementary study addresses the task of false trigger mitigation, where speech not intended for a voice assistant like Siri is purposefully ignored by the assistant.

Employing a graph neural network (GNN), a type of AI model that operates on the graph structure where every node is associated with a label and the goal is to predict the label of the nodes without ground-truth, the coauthors say they managed to mitigate 87% of false triggers. “Voice-triggered smart assistants often rely on detection of a trigger phrase before they start listening for the user request … False triggers often originate either from background noise or from speech which sounds similar to the trigger-phrase,” they wrote. “Mitigation of false triggers is an important aspect of building a privacy-centric non-intrusive smart assistant.”

In future work, the team plans to extend GNN-based processing to other tasks, such as user-intent classification.

Multilingual speaker identification

In a separate paper, Apple researchers explore a speaker language identification system tailored to scenarios involving multilingual speakers. This work was motivated by the fact that language identification systems have high accuracy for most combinations of languages while underperforming for others when accented speech is present, they say.

They’re not wrong. In a recent study commissioned by the Washington Post, popular smart speakers made by Google and Amazon were 30% less likely to understand non-American accents than those of native-born users. And corpora like Switchboard, a data set used by companies such as IBM and Microsoft to gauge the error rates of voice models, have been shown to skew measurably toward speakers from particular regions of the country.

The coauthors’ solution incorporates knowledge about usage patterns into a dictation system that’s able to make decisions for speakers across over 60 locales. An acoustic sub-model makes predictions based on the evidence conveyed by the speech signal, and a context-aware prediction component takes into account the assorted interaction context signals. The predictions from both are used to select the optimal monolingual automatic speech recognition system for the given request.

The context signals encompass information about the conditions under which the dictation request was made, including information about installed dictation locales, the currently selected dictation locale, and whether the user toggled the dictation locale before making the request. Importantly, they aid in situations where the speech signal is too short for the acoustic model to produce a reliable prediction — for instance, short ambiguous utterances such as “naIn,” which could be the negative “nein” in German or the number “nine” in English if the user has both English and German installed.

To evaluate the system, the researchers developed a custom metric dubbed Average User Accuracy (AUA) that they say better reflects “population-level” usage patterns in models. Trained on an internal corpus of 128,000 dictation utterances from strictly multilingual speakers with corresponding interaction context information, it achieved an average of 87% accuracy across all language combinations while improving worst-case accuracy by over 60% relative to the baseline. Moreover, after the team tuned parameters to balance accuracy and latency with the computational load of running the model on-device, average latency was reduced from 2 seconds to 1.2 seconds without impacting AUA by more than 0.05%.