WHEN SPEAKERS ARE ALL EARS

Understanding when smart speakers mistakenly record conversations

Daniel J. Dubois (Northeastern University), Roman Kolcun (Imperial College London), Anna Maria Mandalari (Imperial College London), Muhammad Talha Paracha (Northeastern University), David Choffnes (Northeastern University), Hamed Haddadi (Imperial College London)

Last updated: 07/21/2020

NEWS

SUMMARY

Voice assistants such as Amazon’s Alexa, OK Google, Apple’s Siri, and Microsoft’s Cortana are becoming increasingly pervasive in our homes, offices, and public spaces. While convenient, these systems also raise important privacy concerns—namely, what exactly are these systems recording from their surroundings, and does that include sensitive and personal conversations that were never meant to be shared with companies or their contractors? These aren’t just hypothetical concerns from paranoid users: there have been a slew of recent reports about devices constantly recording audio and cloud providers outsourcing to contractors transcription of audio recordings of private and intimate interactions. Recent shifts to working from home make these issues more acute, as business conversations previously confined to places of work may be recorded by home devices.

Anyone who has used voice assistants knows that they accidentally wake up and record when the “wake word” isn’t spoken—for example, depending on the accent, “I’m sorry” may sound like the wake word “Hey Siri”, which causes Apple’s Siri-enabled devices to start listening. There are many other anecdotal reports of everyday words in normal conversation being mistaken for wake words. For the past year, our team has been conducting research to go beyond anecdotes through the use of repeatable, controlled experiments that shed light on what causes voice assistants to mistakenly wake up and record. Below, we provide a brief summary of our approach, findings so far, and their implications.

GOALS AND APPROACH

Goals: The main goals of our research are to detect if, how, when, and why smart speakers are unexpectedly recording audio from their environment (we call this activation). We are also interested in whether there are trends based on certain non-wake words, type of conversation, location, and other factors.

Approach: When figuring out what smart speakers listen and wake up to, we need to expose them to spoken words. And if we are to uncover any patterns in what causes devices to wake up, we further need repeatable, native-speaker, conversational audio—along with corresponding text that was spoken at each moment. In theory, we could accomplish this using researchers who speak from scripts, but this would take an enormous amount of time and would cover only a small number of people’s voices.

Instead, we came up with a much simpler approach: we turn to popular TV shows containing reasonably large amounts of dialogue. Namely, our experiments use 134 hours of Netflix content from a variety of themes/genres, and we repeat the tests multiple times to understand which non-wake words consistently lead to activations and voice recording.

Show Category Gilmore Girls Comedy, Drama Grey’s Anatomy Medical drama The L Word Drama, Romance The Office Comedy Greenleaf Drama Dear White People Comedy, Drama Riverdale Crime, Drama, Mystery Jane the Virgin Comedy Friday Night Tykes Reality TV Big Bang Theory Comedy, Romance The West Wing Political Drama Narcos Crime drama

We also need ways to detect when smart speakers are recording audio. For this we use several approaches, including capturing video feeds of the devices (to detect lighting up when activated), network traffic (to detect audio data sent to the cloud), and self-reported recordings from smart speakers’ cloud services (when available). We remove cases where the wake word was spoken in TV shows. Finally, we use closed caption text from each TV show episode to automatically extract which spoken words caused each activation.

Testbed:

We focused only on voice assistants installed on the following stand-alone smart speakers:

Google Home Mini (wake word: OK/Hey/Hi Google )

Apple Homepod (wake word: Hey, Siri )

Harman Kardon Invoke by Microsoft (wake word: Cortana )

2 Amazon Echo Dot 2nd generation (wake words: Alexa, Amazon, Echo, Computer )

2 Amazon Echo Dot 3rd generation (wake words: Alexa, Amazon, Echo, Computer )

To conduct our measurements, we needed to build a custom monitoring system consisting of smart speakers, a camera to detect when they light up, a speaker to play the audio from TV shows, a microphone to monitor what audio the speakers play (such as responses to commands), and a wireless access point that records all network traffic between the devices and the Internet. Our main testbed has been deployed at Northeastern University’s Mon(IoT)r Lab in Boston (US). A copy of the testbed has also been deployed at Imperial College London (UK), which we used to repeat all the US experiments to see whether smart speakers marketed and configured for the UK market behave differently from the US ones.

A picture of our US testbed: camera on the top to detect activations, speakers on the left to play video material, smart speakers under test on the right.

An example of video capture of an activation: the Amazon Echo dot device in the center is lighting up, signaling a voice activation on 11/24/2019 at 09:52:22.

KEY FINDINGS

Below is a list of some of our findings, with links to more details below. Everything described below is based on activations when the wake word was not spoken. Of course, all our findings pertain only to the source material (audio from selected TV shows) and we cannot make claims about more general trends.

Are these devices constantly recording our conversations? In short, we found no evidence to support this. The devices do wake up frequently, but often for short intervals (with some exceptions).

How frequently do devices activate? If we consider individual shows, a notable case is Google Home Mini, which while playing The West Wing exhibited 0.95 average activations per hour. If we consider all the shows, the devices that activated the most were the Invoke/Cortana and Echo Dot 2nd generation with “Echo” wake word (0.40 activations per hour), followed by Homepod (0.38 activations per hour).

How consistently do they activate during a conversation? The majority of activations do not occur consistently. We repeated our experiments 12 times (4 times for Invoke/Cortana), and most activations appeared in less than 25% of our experiments, meaning that the most common behavior is that the same audio sometimes activates the device and sometimes does not. This could be due to some randomness in the way smart speakers detect wake words, for example due to the random information loss that occurs when converting analog audio from the microphones to digital audio. Another explanation is that smart speakers may learn from previous mistakes and change the way they detect wake words. Even if a minority, there are also notable cases of consistent activations. For example, 20.7% of Google Home Mini activations and 17.7% of Homepod activations appear in more than 75% of our experiments.

Do they have any secret wake words? We did not find any clear evidence of consistent undocumented wake words that are malicious or completely unrelated to the real ones. Instead, we found evidence of attempts from the manufacturers to detect some variations of their known wake words. However, it is also possible for someone (for example the author of a TV commercial or YouTube video) to “craft” wake words by exploiting wake word similarities and use them to activate user-owned smart speakers without the user suspecting that it was intentional.

We did not find any clear evidence of consistent undocumented wake words that are malicious or completely unrelated to the real ones. Instead, we found evidence of attempts from the manufacturers to detect some variations of their known wake words. However, it is also possible for someone (for example the author of a TV commercial or YouTube video) to “craft” wake words by exploiting wake word similarities and use them to activate user-owned smart speakers without the user suspecting that it was intentional. Are there specific TV shows that cause more overall activations than others? If so, why? The West Wing has caused the highest number of average activations over time: 4.26 activations per hour if we consider the sum of activations across all the devices we tested. Note that The West Wing is among the shows with the highest density of dialogue (145 words per minute). If we consider the amount of dialogue, Narcos caused the highest number of activations: 6.21 activations per 10,000 words of dialogue.

We then looked at other shows with a similarly high dialogue density (such as Gilmore Girls and The Office) and found that they also have a high number of activations, which suggests that the number of activations is at least in part related to the density of dialogue. However, we have also noticed that if we consider just the amount of dialogue (in number of words), Narcos is the one that triggers the most activations, even if it has the lowest dialogue density.

We investigated the actual dialogue that produced Narcos‘ activations and we have seen that it was mostly Spanish dialogue and poorly pronounced English dialogue. This suggests that, in general, words that are not pronounced clearly may lead to more unwanted activations.

Do specific TV shows cause more activations for a given wake word? Yes. For each wake word, a different show causes the most activations.

Are there any TV shows that do not cause activations? No. All shows cause at least one device to wake up at least once. Almost every TV show causes multiple devices to wake up.

Do activations depend on the TV show character’s accent, ethnicity, gender, or other factors? We have found evidence that (English language) smart speakers activate more when they are exposed to unclear dialogue, such as a foreign language, or garbled speech. This suggests that smart speaker users who do not speak English clearly, or that are farther away from the smart speaker (lower voice volume), may have an additional risk of unintentionally activating the device.

Are activations long enough to record sensitive audio from the environment? Yes, we have found several cases of long activations: 10% of the activations were at least 10 seconds long for the Homepod, 9 seconds for Google Home Mini, and 8 seconds for Echo Dot 2nd generation with “Echo” wake word. Half of the activations for Homepod and Echo Dot 2nd generation (Alexa and Computer wake words) were also at least 4 seconds long. During our experiments, we have also seen rare cases of activations lasting up to 43 seconds; however, such cases – which also appeared in our preliminary findings – represent situations that only happened in a single experiment, and therefore we have decided to consider them as outliers.

How many activations lead to audio recordings being sent to the cloud vs. processed only on the smart speaker? We have found that almost all activations that are detected locally (device lit up) are also sent to the cloud.

Do cloud providers correctly show all cases of audio recording to users? For Amazon and Google smart speakers the answer is yes: they both show activations that match the ones we detected using camera and network traffic. For the Homepod and the Invoke, the answer is no, since they do not allow users to view the recordings that are stored in the cloud.

Do smart speakers adapt to observed audio and change whether they activate over time? We have found some evidence that Amazon devices are adapting to observed audio since they activate less often when we repeat the experiments, meaning that they may be building voice profiles of their users to improve recognition. It is possible that other devices adapt as well, but we did not find significant evidence from our study.

are adapting to observed audio since they activate less often when we repeat the experiments, meaning that they may be building voice profiles of their users to improve recognition. It is possible that other devices adapt as well, but we did not find significant evidence from our study. Is there any difference in how US smart speakers activate with respect to UK ones? Both testbeds showed the presence of unintentional activations; however, UK activations were significantly different from the US ones, meaning that either the region, or other differences in the test environment play a role in how devices activate.

Both testbeds showed the presence of unintentional activations; however, UK activations were significantly different from the US ones, meaning that either the region, or other differences in the test environment play a role in how devices activate. What kind of non-wake words consistently cause long activations? We found several patterns for non-wake words causing activations that can be reproduced at least three times during our experiments. Our PETS2020 paper contains an appendix with full closed captions for the most reproducible activations. Our findings are summarized as follows. For the Google Home Mini , these activations commonly occurred when the dialogue included words rhyming with “Hey” or “Hi” (e.g., “They” or “I”), followed by hard “G” or something containing “ol”. Examples include “okay … to go”, “maybe I don’t like the cold”, “they’re capable of”, “yeah … good weird”, “hey .. you told”, “A-P … I won’t hold”. For the Apple Homepod , activations occurred with words rhyming with “Hey” or “Hi” (e.g., “They” or “I”), followed by a voiceless “s”/“f”/“th” sound and a “i”/“ee” vowel. Examples include “hey … missy”, “they … sex, right?”, “hey, Charity”, “they … secretly”, “I’m sorry”, “hey … is here”, “yeah. I was thinking”, “Hi. Mrs. Kim”, “they say … was a sign”, “hey, how you feeling”. For Invoke (powered by Cortana), we found activations with words containing a “K” sound closely followed by a “R” or a “T”. Examples include “take a break … take a”, “lecture on”, “quartet”, “courtesy”, “according to”. For Amazon devices, we observed different activating patterns based on the wake word. For the Alexa wake word, we found activations with sentences starting with “I” followed by a “K” or a voiceless “S”. Examples include “I care about”, “I messed up”, “I got something”, “it feels like I’m”. For the Echo wake word, we found activations with words containing a vowel plus “k” or “g” sounds. Examples include “head coach”, “he was quiet”, “I got”, “picking”, “that cool”, “pickle”, “Hey, Co.”. For the Computer wake word, we found activations with words starting with “comp” or rhyming with “here”/“ear”. Examples include “Comparisons”, “I can’t live here”, “come here”, “come onboard”, “nuclear accident”, “going camping”, “what about here?”. For the Amazon wake word, we found activations with sentences containing combinations of “was”/“as”/“goes”/“some” or “I’m” followed by “s”, or words ending in “on/om”. Examples include “it was a”, “I’m sorry”, “just … you swear you won’t”, “I was in”, “what was off”, “life goes on”, “have you come as”, “want some water?”, “he was home”.



OUR PETS PAPER

Our research will be published in the proceedings of the 20th Privacy Enhancing Technologies Symposium (PETS 2020).

Paper title: When Speakers Are All Ears: Characterizing Misactivations of IoT Smart Speakers

Authors: Daniel J. Dubois (Northeastern University), Roman Kolcun (Imperial College London), Anna Maria Mandalari (Imperial College London), Muhammad Talha Paracha (Northeastern University), David Choffnes (Northeastern University), Hamed Haddadi (Imperial College London)

Abstract: Internet-connected voice-controlled speakers, also known as smart speakers, are increasingly popular due to their convenience for everyday tasks such as asking about the weather forecast or playing music. However, such convenience comes with privacy risks: smart speakers need to constantly listen in order to activate when the “wake word” is spoken, and are known to transmit audio from their environment and record it on cloud servers. In particular, this paper focuses on the privacy risk from smart speaker misactivations, i.e., when they activate, transmit, and/or record audio from their environment when the wake word is not spoken. To enable repeatable, scalable experiments for exposing smart speakers to conversations that do not contain wake words, we turn to playing audio from popular TV shows from diverse genres. After playing two rounds of 134 hours of content from 12 TV shows near popular smart speakers in both the US and in the UK, we observed cases of 0.95 misactivations per hour, or 1.43 times for every 10,000 words spoken, with some devices having 10% of their misactivation durations lasting at least 10 seconds. We characterize the sources of such misactivations and their implications for consumers, and discuss potential mitigations.

Full Text (PDF): pre-print available.

Presentation: available on YouTube.

Software and data: available on Github.

Citation:

@inproceedings{dubois-pets20, title={{When Speakers Are All Ears: Characterizing Misactivations of IoT Smart Speakers}}, author={Dubois, Daniel J. and Kolcun, Roman and Mandalari, Anna Maria and Paracha, Muhammad Talha and Choffnes, David and Haddadi, Hamed}, booktitle={Proc. of the Privacy Enhancing Technologies Symposium (PETS)}, year={2020} }

PRESS

ONGOING WORK

There are several other important open questions that we are in the process of answering as part of future research, such as:

How do smart speakers react to other stimuli, such as non-verbal noises, a dictionary of words, and voices using different languages and accents?

Can such stimuli identify undocumented wake words or sounds?

Do such stimuli cause discriminatory biases to the respective smart speaker ecosystems?

How do smart speaker ecosystems use and share the data they gather from their environments?

We will provide further updates to this page when we have more details to share.

PAGE HISTORY