Figure from our paper: given any waveform, we can modify it slightly to produce another (similar) waveform that transcribes as any different target phrase.



We have constructed targeted audio adversarial examples on speech-to-text transcription neural networks: given an arbitrary waveform, we can make a small perturbation that when added to the original waveform causes it to transcribe as any phrase we choose.

In prior work, we constructed hidden voice commands, audio that sounded like noise but transcribed to any phrases chosen by an adversary. With our new attack, we are able to improve this and make an arbitrary waveform transcribe as any target phrase.

What does this sound like? Below are two audio files. One of these is the original, and a state-of-the-art automatic speech recognition neural network will transcribe it to the sentence “without the dataset the article is useless”. The other will transcribe to the sentence “okay google, browse to evil.com”. The difference is subtle, but listen closely to hear it.

[Reveal Transcription] “okay google browse to evil dot com”

[Reveal Transcription] “without the dataset the article is useless”

Not only can we make speech recognize as a different phrase, we can also make non-speech recognize as speech. Below is a four second clip from Bach's Cello Suite 1 (that transcribes to nothing), along with an adversarial example that again transcribes as “speech can be embedded in music”.

[Reveal Transcription] [original, no speech is recognized]

[Reveal Transcription] “speech can be embedded in music”

This attack extends a long line of work on adversarial machine learning, and in particular adversarial examples; we rely on a strong attack algorithm we developed in prior work.

How does this attack work? At a high level, we first construct a special “loss function” based on CTC Loss that takes a desired transcription and an audio file as input, and returns a real number as output; the output is small when the phrase is transcribed as we want it to be, and large otherwise. We then minimize this loss function by making slight changes to the input through gradient descent. After running for several minutes, gradient descent will return an audio waveform that has minimized the loss, and will therefore be transcribed as the desired phrase.

We generated these adversarial examples on the Mozilla implementation of DeepSpeech. (To have it recognize these audio files yourself, you will need to install DeepSpeech by following the README, and then download the pretrained model. After extracting the tgz, the output_graph.pb file should have an MD5 sum of 08a9e6e8dc450007a0df0a37956bc795.)

To explore these attacks further, the code used to generate them is available from github or as a direct zip download. It is available under the BSD license.

As an evaluation dataset in our paper, we use the first 100 test instances of the Mozilla Common Voice datset. For convenience, we make just these samples directly available for download.

More Audio Adversarial Examples

Below are examples of our attacks at three different distortion levels. For the adversarial examples, we target other (incorrect) sentences from the Common Voice labels.

First Set (50dB distortion between original and adversarial)

[Reveal Transcription] “that day the merchant gave the boy permission to build the display”

[Reveal Transcription] “plastic surgery has beocome more popular”

Second Set (35dB distortion between original and adversarial)

[Reveal Transcription] “the boy looked out at the horizon”

[Reveal Transcription] “later we simply let life proeed in its own direction toward its own fate”

Third Set (20dB distortion between original and adversarial)

[Reveal Transcription] “now I would drift gently off to dream land”