Participants

A total of 46 participants completed the study (30 females; 16 males; age range = 23–57 years, M (age) = 34.2, SD (age) = 8.0). The study took place at Klick Inc., which is a technology, media, and research company in the healthcare sector based in Toronto, Canada. All of the participants were employees of Klick Inc. and were recruited via the company’s online intranet system, which is able to provide newsfeed posts for workplace and social events. All participants signed informed written consent and freely volunteered their time with no compensation. The study was performed in accordance with relevant guidelines and regulations, and received full ethics approval from Advarra IRB Services (www.advarra.com/services/irb-services/), an independent ethics committee that reviewed the study.

Although all participants could speak English fluently, 12 participants had audible accents (assessed by the authors) that were different than the typical “Canadian accent” (specifically from Toronto and Southern Ontario), regardless of whether English was their native language. Of the 12 participants with a non-Canadian, foreign accent, 2 were born in the Philippines, 1 in Albania, 1 in China, 1 in El Salvador, and 1 in Romania, with English not being their native language; 2 were born in England, 2 in Scotland, 1 in Botswana, and 1 in Eritrea, with English as their first language. Thus, 4 participants had a United Kingdom accent, 3 Spanish/Filipino, 2 African, 2 Eastern European, and 1 Chinese accent.

The rest of the participants with a typical Canadian accent more specifically had a dialect that one would find in the greater Toronto area of Southern Ontario, regardless of their place of origin. Of these 34 participants, 28 were born in Canada, 1 in the US, 1 in England, 1 in France, and 1 in Israel, with English as their native language; 1 was born in Brazil and 1 in the Philippines, with English not being their native language.

Materials

In addition to demographic characteristics, health literacy was evaluated using the Rapid Estimate of Adult Literacy in Medicine (REALM) questionnaire,15 as well as the usage frequency of how often participants use voice assistants (Table 6).

Apparatus

Instead of speaking live into each voice assistant, participants’ voices were recorded in order to use the same audio clips to play back to each device during analyses. Participant voice recordings were captured via an Audio-Technica AT2020 Cardioid Condenser Microphone and QuickTime software on a MacBook Air laptop computer. During analyses, voice recordings were played back from the laptop using a Jabra Speak 410 speaker that was placed directly adjacent to the microphones of the voice assistant devices.

Alexa (Amazon), Google Assistant (Google), and Siri (Apple) were selected as voice assistants since they were the most popular and widely used consumer options at the time of the study, which took place in December 2018 to January 2019. Alexa was analyzed using a first-generation Amazon Echo smart speaker; Google Assistant was analyzed using a Samsung Galaxy S8 smartphone; Siri was analyzed using an iPhone 6 smartphone. All hardware devices were updated with their latest software available, and device language was set for English (Canada). All three assistants were connected to the internet using the Klick Inc. network.

Procedure

After informed consent, participants completed the REALM questionnaire and reported their demographic information. From this point onward, all participant responses were audio-recorded to play back to each voice assistant in a controlled manner. To assess the baseline comprehension performance of the voice assistants, each participant asked three calibration questions (“What day is it today?”, “What is 10 + 10?”, “What is the capital of France?”). Calibration questions were used primarily to test the speech recognition quality of the devices and to make sure the system was working properly for each participant’s voice.

Participants were then instructed to read a list of all the brand name medications presented to them (Supplementary Table 2), followed by a list of all the generic name medications (Supplementary Table 1) presented to them. Brand names were always read before generic names since they were assumed to be easier to read and were expected to “warm up” the participants for pronouncing complex medications. The presentation order of the medication names within each list was randomized for each participant. For each medication name, participants were instructed to state the phrase “Tell me about…” followed by the drug name on the list (e.g., “Tell me about acetaminophen”). Participants were asked to pronounce each name as best they could, and if they felt that they mispronounced a word, they were welcome to say it again. No feedback was given to participants after each medication name as to whether they correctly pronounced the word or not. No maximum amount of pronunciation attempts were implemented, but the great majority of participants announced each name just once, and usually no more than two or three times. Only their best recording for each name was used for analyses.

Data analysis

All voice recordings were analyzed between mid-December 2018 and mid-January 2019, using the latest software updates of all the voice assistants before analyses. No manual updates were performed on the devices during the analysis period. The relatively short time frame of the analysis was to minimize any potential improvements of the algorithms from each company’s technology.

Calibration questions from each participant were first played back to each device used to assess baseline comprehension performance of the voice assistants, as well as to adjust the volume of the audio-recording playback if needed. All 46 participants elicited 100% comprehension accuracy on all three voice assistants for each calibration question. In other words, all of the voice assistants yielded perfect accuracy on speech recognition from each participant for generic queries. Therefore, it could be inferred that any reduction in comprehension accuracy for each medication name would be based on the AI ability of the software to recognize the complexity of the drug name, and not due to the hardware used for recording and playback, or to the incomprehensibility of the participants’ voices. Additionally, calibration questions were purposely designed to be relatively easy and simple to pronounce using common words and phrases to assess the baseline measurement of comprehension performance of each voice assistant, which in this case, was intended to be 100% accuracy. Although medication names range in commonality and complexity from the calibration statements, the purpose of the current study was to test any detriment of speech recognition for drug names compared to the baseline performance of common voice assistant commands for everyday tasks.

Individual audio clips of each drug name (e.g., “Tell me about [medication name]”) from each participant were played back to each voice assistant. Although the reading ability of medication names by the participants was not directly being tested, their audio clips were scored by the authors using established norms16,17 as to whether the names were pronounced correctly or not. Each medication name pronunciation was scored in one of three ways:

i. Incorrect: the participant did not pronounce the word correctly at all (e.g., either by missing syllables, adding extra letters, or rearranging the phonetic pronunciation of each syllable). ii. Partially correct: the participant pronounced each syllable of the word correctly, but placed a different emphasis on the wrong syllable (e.g., pronouncing alprazolam as “al-pra- ZO -lam” as opposed to the correct way of “al- PRA -zo-lam”). iii. Fully correct: the participant pronounced the word correctly, including the proper enunciation of each syllable in the word.

Arguably, methods ii) and iii) of pronouncing the word are both “correct”. As an analogy, this is similar to the pronunciation of the word “tomato”, which may be commonly pronounced as “tuh- MAY -toh” in North America, but “tuh- MAH -toh” in the UK, and it would be likely that a restaurant chef would understand what a patron requests in their food order regardless of how they say the word or which country they are in. Similarly, a pharmacist or physician is likely to have a high probability of understanding a medication name regardless of whether a patient placed emphasis on different syllables of a drug name, as long as all the correct syllables were in the word. For statistical analyses, all of the participants’ voice recordings and pronunciations were used to playback to the voice assistants (Table 1). That is, incorrect pronunciations of words were not excluded as to represent real-world behaviors of speaking complex medications, as well as to examine the AI capabilities of the software (see results for further analyses); voice assistants were sometimes able to correctly recognize a word, even when incorrectly pronounced by the participant.

After the audio clips of each medication name were played back to the voice assistants, the devices’ responses were also scored in one of three ways:

i. No response: the voice assistant did not recognize any form of the query (e.g., by stating, “Sorry, I don’t know that”), and did not display any results. ii. Misinterpreted: the voice assistant comprehended a different word or sentence and responded inaccurately with an irrelevant answer. iii. Accurate: the voice assistant comprehended the medication name accurately, and provided a relevant response based on the drug.

Only response type iii) was scored as a valid response and was the main dependent variable of comprehension accuracy. Voice assistant responses were not scored on the quality or usefulness of information received or where the AI system sourced its information (e.g., Wikipedia, WebMD, World Health Organization, etc.). The only variable of interest was whether the voice assistant accurately recognized the medication name when being orally spoken by a variety of individuals. No feedback was given to the voice assistants after each response, so as not to potentially alter their algorithms or response patterns.

No statistical comparisons were conducted on misinterpreted responses (see type ii above) or no responses (see type i above), i.e., error rates, largely because Alexa was the only voice assistant who yielded “no response” scores (see Tables 2 and 3). By contrast, Google Assistant and Siri would always reveal a misinterpreted response when making a speech recognition error or faced with an incomprehensible medication name. Therefore, the misinterpreted response rates of Google Assistant and Siri would be the exact inverse of the accurate response rates, and were not reported due to redundancy.

Comprehension accuracy rates (i.e., accurate responses) were analyzed with a 2 (name type: brand medication, generic medication) x 3 (voice assistant: Alexa, Google Assistant, Siri) repeated measures analysis of variance (ANOVA), with participant accent (Canadian accent, foreign accent) as a between-subjects factor. Post-hoc t-tests (two-sided) were used to analyze differences in comprehension accuracy across voice assistants. Analyses revealed no significant effects of participant age or gender on comprehension accuracy across voice assistants.

Reporting Summary

Further information on research design is available in the Nature Research Reporting Summary linked to this article.