« previous post | next post »

There's been a fair amount of media interest in a recent study suggesting that dyslexics are worse than controls at (certain kinds of) speaker recognition. This is an interesting study in itself, which is why it made it into Science. But I'm just as interested in its uptake in the popular press, which mostly ranged from "missing the point" to "catastrophic confusion" (and you may not be surprised to learn where on the spectrum the BBC's coverage landed, alas). I'll discuss the study itself here, and then take up the press coverage in another post.

The work in question is Tyler K. Perrachione, Stephanie N. Del Tufo, & John D. E. Gabrieli, "Human Voice Recognition Depends on Language Ability", Science 7/29/2011:

The ability to recognize people by their voice is an important social behavior. Individuals differ in how they pronounce words, and listeners may take advantage of language-specific knowledge of speech phonology to facilitate recognizing voices. Impaired phonological processing is characteristic of dyslexia and thought to be a basis for difficulty in learning to read. We tested voice-recognition abilities of dyslexic and control listeners for voices speaking listeners’ native language or an unfamiliar language. Individuals with dyslexia exhibited impaired voice-recognition abilities compared with controls only for voices speaking their native language. These results demonstrate the importance of linguistic representations for voice recognition. Humans appear to identify voices by making comparisons between talkers’ pronunciations of words and listeners’ stored abstract representations of the sounds in those words.

In interpreting this work, we should start with some of the implicit background. The current scientific consensus on dyslexia is clearly expressed in the National Institute of Health's PubMed Health page:

Developmental reading disorder (DRD), or dyslexia, occurs when there is a problem in areas of the brain that help interpret language. It is not caused by vision problems. The disorder is a specific information processing problem that does not interfere with one's ability to think or to understand complex ideas. Most people with DRD have normal intelligence, and many have above-average intelligence. […]

A person with DRD may have trouble rhyming and separating sounds that make up spoken words. These abilities appear to be critical in the process of learning to read. A child's initial reading skills are based on word recognition, which involves being able to separate out the sounds in words and match them with letters and groups of letters.

Because people with DRD have difficulty connecting the sounds of language to the letters of words, they may have difficulty understanding sentences.

True dyslexia is much broader than simply confusing or transposing letters, for example mistaking ”b” and “d.".

In general, symptoms of DRD may include:

Difficulty determining the meaning (idea content) of a simple sentence

Difficulty learning to recognize written words

Difficulty rhyming

You'll find a similar perspective in The International Dyslexia Association's FAQ.

An older view, which remains strong in the popular imagination and in some corners of the specialist literature, is that reading-specific difficulties are caused by visual problems, especially by a propensity to reverse letters. The NIH PubMed page goes out of its way to reject this idea specifically, and Perrachione et al. endorse the current scholarly consensus in the phrase "Impaired phonological processing is characteristic of dyslexia and thought to be a basis for difficulty in learning to read".

Another now-generally-debunked theory is that reading difficulties are caused by children skipping the crawling phase of motor development. There are still people out there who offer therapy for poor readers in the form of re-learning to crawl, although as far as I know, there is no good evidence that this works. The conceptual link between crawling and reading is much less intuitive than the link between letter-recognition and reading, so the motor-development idea is much less common than the letter-reversal idea.

Turning to the problem of speaker recognition (or "voice recognition", as Perrachione et al. confusingly call it), it's plausible that people should be better at doing this in a language they know well than in one that they don't. This was confirmed by J.P. Goggin et al., "The role of language familiarity in voice identification", Memory and Cognition 1991:

Four experiments examined the effects of language characteristics on voice identification. In Experiment I, monolingual English listeners identified bilinguals' voices much better when they spoke English than when they spoke German. The opposite outcome was found in Experiment 2, in which the listeners were monolingual in German. In Experiment 3, monolingual English listeners also showed betterr voice identification when bilinguals spoke a familiar language (English) than when they spoke an unfamiliar one (Spanish). However, English-Spanish bilinguals hearing the same voices showed a different pattern, with the English-Spanish difference being statistically eliminated. Finally, Experiment 4 demonstrated that , for English-dominant listeners, voice recognition deteriorates systematically as the passage being spoken is made less similar to English by rearranging words, rearranging syllables, and reversing normal text. Taken together, the four experiments confirm that language familiarity plays an important role in voice identification.

So if dyslexics have deficiencies in phonological processing, and if speaker recognition is improved by native-language linguistic analysis, which includes phonological processing, then it's plausible that dyslexics would show native-language-specific deficiencies in speaker recognition. The recent Perrachione et al. paper tests this plausible hypothesis, and finds supporting evidence:

(A) Mean voice-recognition performance of dyslexic and control listeners (error bars indicate SEM). All individuals scored above chance (20%), shown as baseline. (B and C) Relationships between clinical measures of language (phonological) ability in dyslexia and voice-recognition ability. CTOPP, Comprehensive Test of Phonological Processing.

The (A) graph shows that their dyslexic subjects, who were native speakers of English, performed just as well as controls in learning to identify new Chinese speakers' voices, but quite a bit worse at learning to recognize new English speakers' voices. [The paper doesn't give any numbers, but from a careful measurement of the graph, the performance of the dyslexic group appear to be 50% correct on average, while the performance of the control group was 68% correct on average. The standard deviations are roughly 14 and 15 respectively. This translates to an average of 2.5 correct out of 5, compared to an average of 3.4 correct out of 5, and an effect size of about 1.2.]

The (B) graph compares speaker-identification performance to scores on a "nonword repetition" task, which involves repeating (verbally-presented) nonsense words like "dooloowheep". The (C) graph shows the relationship to an "elision" task, which involves following instructions like "say 'blend' without saying /l/".

OK, now comes the boring part where we look at the experiment itself (see "Never mind the conclusions, what's the evidence?", 8/30/2010).

There were 16 controls and 16 "individuals with dyslexia". According to Perrachione et al.'s Supporting Online Material,

Inclusionary criteria for dyslexia consisted of a prior clinical diagnosis or lifelong history of reading disability and scoring below the 16th percentile (one standard deviation below the age-normed mean) on any two subtests from the following standard clinical reading and language assessments: Woodcock Reading Mastery Test-Revised (WRMT-R/NU), Test of Word Reading Efficiency (TOWRE), and Comprehensive Test of Phonological Processing (CTOPP).

For more about these tests, see WRMT-R, TOWRE, CTOPP.

Groups were matched based on cognitive performance (“Matrices” and “Block Design” from the Wechsler Abbreviated Scale of Intelligence, WASI; (10)), working memory (Wechsler Adult Intelligence Scale WAIS-IV; (11)), age, and education.

As usual, it's worth giving a bit of thought to the population from which the subjects were taken. Again as usual, we don't know a great deal about this, but the authors of the study are all at MIT, and the age of the subjects was 21.3 ± 2.7 (controls) and 23.9 ± 6.8 (dyslexia), so most of them were probably MIT students. As a result, it's worth registering the usual mental reservation to the effect that neither the control group nor the dyslexic group are typical in other respects of the groups they conceptually represent. As usual, it's not clear whether this matters or not.

Also, we should note that when they say that the two groups were "matched on cognitive performance […], working memory […], age, and education", what they mean (apparently) is something like "group means were within a standard deviation of one another". It's slightly worrisome that in fact the dyslexia group was on average below the control group in every cognitive dimension on which they were supposed to be "matched", by as much as 0.644 standard deviations:

What about the experimental design?

Two sets of ten sentences designed for acoustic assessment were recorded for this experiment: one spoken in English, the other in Mandarin. The English sentences were read by five male native speakers of American English (aged 19-26 years, M = 21.6). The Mandarin sentences were read by five male native speakers of Mandarin Chinese (aged 21-26 years, M = 22.6). […] Recordings of sentences were 1.46sec to 4.09sec in duration (M = 2.43, SD = 0.54). In each language, five sentences were used during the familiarization and practice phases, and all ten were used during the final voice recognition test.

A few comments are in order here.

First, the same set of ten sentences was recorded by every speaker. This may put a premium on detailed segment-by-segment comparison, in a way that a text-independent task might not (i.e. a task where each speaker's utterances involved different words and phrases). So it would be nice to know whether the effect is maintained or attenuated or eliminated in a text-independent task.

Second, these are relatively short stimuli (mean of 2.43 seconds). In the automatic speaker-recognition area, the relative performance of different algorithms can be quite different as the length of the training and testing stimuli increases, and it's plausible that this is also true for various aspects of human voice-identification abilities.

Third, we aren't told whether the speakers differed significantly in regional, class, or ethnic features. If they did, then this would plausibly put a premium on paying attention to a phonological analysis, so that specific features (e.g. ae-raising) could be noted.

Fourth, it's important that the same stimuli were (partly) used in training and in testing: "In each language, five sentences were used during the familiarization and practice phases, and all ten were used during the final voice recognition test". This changes the nature of the test even further in the direction of text-dependent speaker recognition — which obviously puts a premium on phonological memory. (This is especially true when the texts are few and short, so that they can be memorized and used as a key for registering speaker-and-text-specific acoustic properties.)

And fifth, this is a test of read sentences rather than naturally-occurring speech. Read speech and natural speech have quite different properties, and it's possible that these differences include a different balance between phonological and other (e.g. prosodic) cues to speaker identity.

Here's more about the procedures used:

Participants learned to identify five talkers in each of two language conditions (English and Mandarin) from the sound of their voice. Each talker was associated with a distinct cartoon avatar. Training and testing on voice recognition were completed in each language condition separately, and the order was counterbalanced across listeners. During an initial familiarization phase, participants heard each of the voices in succession while the corresponding avatars were displayed on a computer screen. Participants then actively practiced identifying the talkers with corrective feedback: The five avatars appeared on the screen while a recording from one talker was played, and participants selected the avatar matching the voice they heard. If participants selected incorrectly, the computer indicated the correct response. During the task, all instructions were presented both as text on the screen and as auditory prompts recorded by an additional female talker. The familiarization and active practice phases were repeated over five training sentences, and each sentence was practiced ten times. Following training, participants undertook a 50-item talker identification test, in which they identified the voices without feedback.

Summing up:

This experimental design has many features that are likely to increase the value of phonological analysis and phonological memory: the stimuli are quite short; each speaker reads the same small set of sentences; half of the stimuli in the testing phase were also used in the training (with feedback) phase; the stimuli are read rather than natural.

To the extent that the goal is to confirm, in a new way, the existing consensus that the (probably diverse) collection of reading difficulties known as "dyslexia" is strongly associated with deficiencies in phonological processing, none of this matters.

To the extent that the goal is to confirm, in a new way, the plausible hypothesis that human speaker recognition is mediated in part by phonological processing, none of this matters.

But if we're interested in whether people with reading difficulties are likely also to have problems recognizing who's talking when, in the context of everyday life, these design issues matter quite a bit.

So which interpretive frame do you think dominated the media coverage?

Permalink