Abstract The dual-route model of speech processing includes a dorsal stream that maps auditory to motor features at the sublexical level rather than at the lexico-semantic level. However, the literature on gesture is an invitation to revise this model because it suggests that the premotor cortex of the dorsal route is a major site of lexico-semantic interaction. Here we investigated lexico-semantic mapping using word-gesture pairs that were either congruent or incongruent. Using fMRI-adaptation in 28 subjects, we found that temporo-parietal and premotor activity during auditory processing of single action words was modulated by the prior audiovisual context in which the words had been repeated. The BOLD signal was suppressed following repetition of the auditory word alone, and further suppressed following repetition of the word accompanied by a congruent gesture (e.g. [“grasp” + grasping gesture]). Conversely, repetition suppression was not observed when the same action word was accompanied by an incongruent gesture (e.g. [“grasp” + sprinkle]). We propose a simple model to explain these results: auditory and visual information converge onto premotor cortex where it is represented in a comparable format to determine (in)congruence between speech and gesture. This ability of the dorsal route to detect audiovisual semantic (in)congruence suggests that its function is not restricted to the sublexical level.

Citation: Josse G, Joseph S, Bertasi E, Giraud A-L (2012) The Brain’s Dorsal Route for Speech Represents Word Meaning: Evidence from Gesture. PLoS ONE 7(9): e46108. https://doi.org/10.1371/journal.pone.0046108 Editor: Emmanuel Andreas Stamatakis, University Of Cambridge, United Kingdom Received: April 6, 2012; Accepted: August 28, 2012; Published: September 26, 2012 Copyright: © Josse et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. Funding: This study was supported by a grant from the Agence Nationale de la Recherche attributed to Anne-Lise Giraud (http://www.agence-nationale-recherche.fr/). Goulven Josse was also supported by the Philippe Foundation. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. Competing interests: The authors have declared that no competing interests exist.

Introduction What we know of human brain function at a macro-anatomical level includes a couple of relatively well-accepted distinctions. In the communication domain, the left hemisphere is thought to be specialized for language processing relative to the right hemisphere [1], [2]. In the visual domain, a ventral stream of information processing is thought to be specialized in object recognition (“What is it?”), whereas a dorsal stream specializes in spatial processing (“Where is it?”) [3]. More recently, it was proposed that the auditory system may be similarly divided into dorsal and ventral streams [4]. Again, the questions “What?” and “Where?” were used as a metaphor to emphasize the major differences between ventral and dorsal streams in the perceptual domain. However, coincidentally or not, these questions almost obviously suggest a link to language. Building on this, Hickok and Poeppel have argued that not only the left-right, but also the ventral-dorsal distinction could be useful to understand how the brain processes language [5], [6]. Their dual-route model is actually more about auditory speech than about language which can also involve the visual modality through reading, visually perceived mouth movements, sign language or (as we will see here) gesture. In their own dual-route model, Hickok and Poeppel have proposed that, first, a bilateral temporal network (the ventral route) maps word to meaning at the lexico-semantic level, thus allowing for word comprehension (“What does the sound I heard mean?”). Second, a left-lateralized temporo-parieto-frontal network maps auditory-to-motor word features at the sublexical level. This may seem at odds with the “Where?” question. In fact, answering “Where is the object?” is a pre-requisite to acting on the object by grasping it and manipulating it using both the sensory (visual or auditory) system and the motor system. In other words, the dorsal route is a sensori-motor pathway [7]. In a context where objects are words, this dorsal route could thus enable speech parsing based on articulatory movements [8], as well as speech repetition and, therefore, learning to speak [5], [6]. This dual-route model is supported by many behavioral and neuroimaging studies [5], [9], [10]. However, it is also challenged by evidence that the premotor cortex of the dorsal route is involved in lexico-semantic processing (“What?”) in addition to sublexical processing. For instance, premotor activity is thought to underlie action words comprehension [11], [12]. Premotor activity has also been associated with the perception of gesture that carries meaning alongside speech [13]. In this article, we argue that the dorsal route serves to map auditory speech not only onto articulatory representations, but also onto gestural representations corresponding to speech-associated hand gesture. Since gesture carries meaning, the auditory-motor mapping function of the dorsal route necessarily engages semantic representations. Importantly, due to repeated associations between words and gesture, these gestural representations are engaged even when words are perceived or produced alone. Our proposal that the (pre)motor representations of words and gesture overlap is based on several lines of evidence. Elegantly defined as “visible action as utterance” [14], speech-associated gesture has been described as an integral part of language [15]. Gesture is associated with speech in terms of meaning and the two are tightly synchronized. There is evidence that gesturing improves lexical retrieval in both normal subjects and Broca’s aphasics [16], [17], [18]. At the neuronal level, there are reports suggesting that hand and mouth representations could overlap in (pre)motor cortex [19], [20], [21], [22]. For instance, in people speaking while grasping an object at the same time, the size of the grasped object predicts the size of mouth openings [19]. This is the case despite the fact that the grasping task has nothing to do with the act of speaking other than both are taking place simultaneously. We can therefore suppose that gesture affects speech to an even greater extent when the two are congruent, and that the overlap between their neuronal representations is also greater. We further propose that the overlap between word and gesture representations in premotor cortex can explain some previously reported neuroimaging findings. A few studies have investigated the neural correlates of speech comprehension in a context where visually meaningful, iconic gesture could potentially modulate speech processing. In order to control for biological motion, most of these studies used a control condition where speech was presented with either incongruent gesture, or meaningless hand movements (we will refer to both cases as incongruence). These studies using functional magnetic resonance imaging (fMRI) consistently reported that incongruence was associated with increased activity relative to a similar condition where speech and gesture were congruent, whereas congruence was not associated with any specific activity relative to incongruence [23], [24], [25], [26]. We reasoned that this might be the case because the overlap between the neuronal population representing the word and the population representing gesture is less extensive when word and gesture are incongruent: in this case more neurons are activated and, therefore, a larger BOLD signal is observed. Alternatively, it has been proposed that increased semantic load in incongruent conditions, or after-effects unrelated to the process of comparing speech and gesture (such as inhibition), could explain increased activation [23], [25]. However, activation studies have so far only provided information at the macro-anatomical level that leaves both this question, and mechanisms of speech and gesture integration at the neuronal level, unaddressed. To test our neuronal model of overlapping word and gesture representation, we use repetition effects on the BOLD signal. Stimulus repetition is usually associated with a decrease in both neuronal activity and BOLD signal [27], [28], [29]). Accordingly, activation of a neuronal population by a given word should decrease when the word is repeated. We hypothesize that many neurons in this neuronal population are also activated by any congruent gesture. Therefore, the combined presentation of congruent word and gesture should enhance the repetition suppression effect. To the contrary, if a word and a gesture are incongruent, each will target different neurons, and therefore the suppression effect related to word repetition will not be stronger when an incongruent gesture is repeated alongside the word. Our findings confirm our hypothesis that the dorsal route for speech is a major site of audiovisual integration and semantic processing. PPT PowerPoint slide

PowerPoint slide PNG larger image

larger image TIFF original image Download: Figure 1. Stimuli. Stillframes of video stimuli from the two audiovisual conditions where Speech and Gesture were either Congruent (SGC) or Incongruent (SGI). In the two corresponding speech-only conditions (respectively Sc and Si), videos were shown with a black mask macthing the background, so that subjets only perceived the audio from the video files. https://doi.org/10.1371/journal.pone.0046108.g001

Results During the initial phase, posterior temporal, inferior parietal and precentral areas were activated by speech alone (Figure 3 and Table 1). We checked that no difference was observed between activity related to speech Sc extracted from SGC videos and activity related to speech Si extracted from SGI videos (p>0.05 corrected for multiple comparison). Additional occipital, occipito-temporal, inferior parietal and precentral activations were detected in the audiovisual conditions. Critically, no activation was detected in the SGC > SGI contrast even at lower thresholds of significance (p>0.001 uncorrected for multiple comparisons). The reverse contrast revealed more activation in the incongruent condition mainly in parietal and frontal areas. At the behavioral level, incongruence between speech and gesture was reported to be systematically detected. During the test phase, where words were always presented auditorily, the comparison between new, non-repeated, words (“NR”) with repeated words (“R”) showed repetition suppression bilaterally in the superior part of the temporal lobe, centered on the superior temporal sulcus (STS, Figure 4 and Table 2). However, we only observed this effect when the words had been repeated in an audio-only context (S) or in the congruent speech and gesture audiovisual condition (SGC), but not in the incongruent condition (SGI). In the latter condition, we detected repetition enhancement (R>NR) in the right precentral gyrus (Figure 4 and Table 2). This held when controlling for the audio-only condition (R-NR post SGI - post Si = NR-R post Si - post SGI, Figure 5 and Table 2). This latter contrast unveiled 2 similar effects bilaterally in the occipito-temporal cortex (Figure 5 and Table 2) where there was a trend for repetition suppression following Si, and a trend for repetition enhancement following SGI. These effects observed during the test phase suggest that visual information interacts with auditory information during word repetition and that this audiovisual interaction depends on congruency between word and gesture. To address this, we directly compared word repetition effects after the words were repeated in the congruent speech and gesture condition vs after the words were repeated in the incongruent speech and gesture condition ([post SGC NR-R] vs [post SGI NR-R]). We observed a significant effect in the left precentral gyrus (corrected for multiple comparisons in terms of spatial extent), suggesting that the effect localized to the premotor cortex (Figure 6 and Table 2). To confirm this localization, we overlapped the cluster onto probabilistic cytoarchitectonic maps - SPM Anatomy toolbox - [38] of the precentral gyrus. The cluster mainly overlapped with area 6 (54% of the cluster, Figure 6), but also with area 44 (15%) and other areas in the rolandic operculum or not part of cytoarchitectonic maps (31%). One could argue that the decision to use the spatial extent of this cluster as a correction for multiple comparisons was flawed, given that the cluster overlapped with several cytoarchitectonic areas with potentially different functional roles. We therefore focused our search on an independently defined mask of regions, which demonstrated auditory activation and word-specific auditory repetition suppression effects (see methods). By doing so, we confirmed the significance of the effect in left area BA 6 in terms of intensity, and found another significant peak in the left temporo-parietal cortex, which showed a similar pattern of activity across conditions (Figure 7 and Table 2). Furthermore, these differences cannot be attributed to differences between auditory conditions because no difference was observed between the test phase after repetition of speech Sc extracted from SGC videos and the test phase after repetition of speech Si extracted from SGI videos (p>0.05 corrected for multiple comparison). None of these repetition effects were significantly lateralized after correction for multiple comparisons across the whole brain. Plots in Figure 7 for these areas confirm that there was repetition suppression following repetition of a word either in the audio-only context or in the audiovisual context with a congruent gesture. Plots also show that repetition in the incongruent audiovisual condition led to enhancement rather than suppression of the BOLD signal. Overall, indications that the BOLD signal decreases following repetition of auditory words could only be observed if the words were presented alone or with a congruent gesture, but not with an incongruent gesture.

Discussion Our results confirm that speech and audiovisual speech activate perisylvian regions [9], [39], [40], [41], [42], and critically show that co-speech gesture interferes with responses to auditory words in the premotor cortex. Not only did (in)congruency between word and gesture affect premotor activity the first time stimuli were presented, (in)congruency also later affected premotor activity during repetition of the auditory word alone. This straightforward finding supports our hypothesized mechanism for the comparison of speech and gesture. Plots of activity during the various conditions suggest that, following auditory word repetition, premotor activity was reduced only if the word was repeated alone or with a congruent gesture, but not with an incongruent gesture (Figures 7 and 8). According to this mechanism, each word activated a specific premotor representation “S” (for “speech”, Figure 8.A). After repetition of the same word, neurons participating in the representation of this word showed less activity relative to a new, non-repeated word. A significant part of this word representation overlaps with the representation “G” of a gesture associated with the action evoked by the word (Figure 8.B). Therefore after repetition of both word and gesture, activity was even further reduced than after repetition of the word alone. This further repetition may have been caused solely by the mouth movements that were visually perceived alongside gesture. However, this effect disappeared when word and gesture were incongruent, showing the influence of gesture on word processing. We propose that when an incongruent gesture “J” was presented (Figure 8.C), word and gesture representations did not co-localize as well. The same cortical unit showed an even higher BOLD signal in the initial phase because more neurons were activated than in the condition where speech and gesture were congruent. The sum of these neuronal activations outweighed the co-activation of fewer neurons by congruent word and gesture. In the test phase the representation of the auditory word was then associated with many neurons specific to the incongruent gesture. This could explain that the BOLD signal in this phase was even higher than when a new auditory word was presented. In other words, “repetition enhancement” here would reflect associative perceptual learning between what subjects recognized to be incongruent word and gesture. The repetition suppression effects we observed are in agreement with the “fatigue” model of repetition suppression according to which each neuron responding to a given stimulus will also respond less after the same or a similar stimulus is repeated [27], [29]. Our interpretation of the repetition enhancement effects may seem more speculative given the fewer reports of this type of effect and the rare discussions of its potential underlying causes and implications [28], [43], [44]. Yet, studies showing a relation between increased activity and perceptual learning support this account. Gauthier et al. have shown increased activity in the fusiform gyrus of subjects who had developed visual expertise for a novel type of objects [45]. Dolan et al. have also shown increased activation of the fusiform gyrus in subjects who had learned to recognize familiar objects or faces in degraded pictures [46]. Furthermore, Turk-Browne et al. have reported repetition enhancement for low-contrast scenes versus repetition suppression for more easily recognizable high-level contrast scenes [44]. Together, these studies support that repetition enhancement reflects additional processing of initially non-familiar/unrecognized stimuli [47]. We propose that such additional processing also exists in the case of a non-familiar association between word and gesture. This in turn may induce additional processing when the auditory word alone is presented after it has been repeatedly presented with an unexpected gesture. We further propose that the overlap between representations activated by an action word and an iconic gesture allows for quantifying congruence between these two types of stimuli. This computation would allow the observer to distinguish which of the speaker’s gestures are relevant to his/her discourse. The processing of gesture alone may not be sufficient: the same hand movement may be related or not to discourse. For instance, the speaker may need to briefly scratch his face because it is itching, or the same movement may be an iconic gesture imaging his verbal recall of when he had an itchy beard. A qualitative comparison between speech and gesture, as the one we propose in our model, allows for distinguishing between these two cases. Additionally, the mechanism we propose could explain why we and others only find more activation for incongruent vs congruent speech and gesture [23], [25]. Holle et al. have reported more activation for congruent vs incongruent speech and gesture in several occipito-temporal, inferior parietal and precentral regions, but the incongruent condition involved hand movements that were not communicative [24]. In addition, these hand movements were systematically directed towards the body (“grooming”). Subjects could therefore understand that these hand movements were not communicative prior to integrating speech and gesture. In another study, Skipper et al. found premotor activity specifically related to gestures congruent with speech, but this was again relative to “grooming” movements unrelated to speech [13]. Although the analysis was of an event-related type, words and gestures were presented to subjects in the context of discourse, and it is unclear whether grooming movements were as synchronized with speech as gestures were. Premotor activity in the congruent condition may therefore have reflected the sum of activities related to speech and gesture, whereas activity during grooming may have been related to hand motion only. At the macro-anatomical level, premotor and temporo-parietal cortices where we found audiovisual effects are part of the dorsal route for speech processing [5], [6]. In the dorsal route model, the temporo-parietal junction maps auditory information onto articulatory representations stored in premotor cortex. Premotor cortex is also able to represent mouth movements and gesture based on visual input [13], [41], [42]. In line with previous work, we therefore propose that audiovisual integration can take place in premotor cortex because it is a “neutral ground” between the auditory and visual modalities, potentially providing a unifying code in which auditory and visual information pertaining to the same communicative act can be compared [48], [49], [50]. This comparison assumes overlap between representations of the hands and vocal apparatus, a reasonable prerequisite given previous reports of overlapping representations of some body parts and actions in premotor and even primary motor cortex [19], [20], [21], [22]. Finally, the fact that the temporo-parietal junction was sensitive to audiovisual congruence may be explained by a feedback signal from premotor cortex [42]. The dorsal route has been proposed to be an auditory-motor pathway playing a role at the sublexical level [5], [6]. Our results suggest it is also sensitive to visual information of a lexico-semantic nature, in accordance with models where semantic representations involve sensory-motor associations [12]. This also agrees with improved lexical retrieval in normal subjects and Broca’s aphasics when they are allowed or encouraged to gesture [16], [17], [18]. Although our findings point to the dorsal route, it could be argued that they indirectly arise from an integrative process in the ventral route, where auditory word representations are presumably mapped onto semantic representations [5], [6]. We cannot rule out that we missed ventral effects due to reduced signals in temporal regions [51]. Even if this was the case, the lexical status of words may not apply to co-speech gesture. Contrary to words, co-speech gestures are not part of a formal linguistic code, and are not linguistically associated with a specific meaning in an arbitrary way [15], [52]. They are idiosyncratic, specific to each speaker, and hence only interpretable via their iconicity. These characteristics primarily suggest processing by a sensory-motor system similar to the dorsal action perception system, rather than by a lexico-semantic ventral system. On the other hand, while words are linguistic units, they may also be represented as articulatory gestures in a dorsal sensory-motor system, which therefore turns out to be the best candidate for the comparison of gesture and speech. Note that we do not claim that the ventral route for speech processing plays no role in integrating speech and gesture. Rather, the mechanism suggested by our data does not require it. In the context of discourse, Skipper et al. found stronger connectivity between the precentral gyrus and anterior temporal areas in a congruent speech and gesture condition [13] suggesting that gesture influences semantic processes taking place in temporal areas of the ventral route [10], [13], [39], [53]. Therefore, a more accurate model of speech and iconic gesture integration may include both the dorsal and the ventral route, starting with the dorsal route - where the comparison between single words and iconic gestures takes place, which then influences the ventral route, most prominently at the sentence and discourse levels. In summary, our data suggest that auditory and visual information from speech and gesture converge onto the premotor cortex of the dorsal route for speech processing where it is conveniently represented as unimodal information. Information from speech and gesture could thus be compared, and (in)congruence could be inferred from the size of the overlap between the populations of sensory-motor neurons activated by speech and gesture respectively, with congruence leading to a large overlap, and incongruence leading to a small or no overlap.

Acknowledgments We thank Matthew Longo (Birkbeck), Brian Fischer (INSERM), Katia Lehongre (INSERM), Sondip R. Mukherjee (UNESCO), Guillaume Flandin (FIL, UCL) and Emmanuelle Volle (ICM/Pitié-Salpêtrière) for comments and corrections on earlier versions, the Imaging team at the CR-ICM, and Laura Chadufau (INSERM) for administrative help.

Author Contributions Conceived and designed the experiments: GJ SJ EB. Performed the experiments: GJ EB. Analyzed the data: GJ SJ. Contributed reagents/materials/analysis tools: EB. Wrote the paper: GJ. Funded and commented on the paper: A-LG.