Speech-related neurons

Neuronal responses in human temporal and frontal lobes were recorded from 11 patients with intractable epilepsy monitored with intracranial depth electrodes to identify seizure foci for potential surgical treatment (see Methods). Following an auditory cue, subjects uttered one of five vowels (a/a/, e/ε/, i/i/, o/o/ and u/u/) or simple syllables containing these vowels (consonant+vowel: da/da/, de/dε/, di/di/, do/do/ and du/du/...). We recorded the activity of 716 temporal and frontal lobe units. As this study focuses on speech and owing to the inherent difficulty to distinguish between auditory- and speech-related neuronal activations, we analysed only the 606 units that did not respond to auditory stimuli. A unit was considered speech-related if its firing rate during speech differed significantly from the pre-cue baseline period (see Methods). Overall, 8% of the analysed units (49) were speech-related, of which, more than a half (25) were vowel-tuned, showing significantly different activation for the five vowels (see Supplementary Fig. S1).

Sharp and broad vowel tuning

Two areas commonly activated in speech studies2, the STG and a medial–frontal region overlying the rostral anterior cingulate and the adjacent medial orbitofrontal cortex (rAC/MOF; Brodmann areas 11 and 12; See Supplementary Fig. S2 for anatomical locations of the electrodes), had the highest proportions of speech-related (75% and 11%, respectively) and vowel-tuned units (58% and 77% of these units). In imaging and electrocorticography studies, the rACC was shown to participate in speech control2,6, the orbitofrontal cortex in speech comprehension and reading7, and the STG in speech production at the phoneme level8. Involvement of STG neurons in speech production was also observed in earlier single-unit recordings in humans9. We analysed neuronal tuning in these two areas and found that it had divergent characteristics: broadly tuned units that responded to all vowels, with a gradual modulation in the firing rate between vowels, comprised 93% of tuned units in STG (13/14) but were not found in rAC/MOF (0/10), whereas sharply tuned units that had significant activation exclusively for one or two vowels comprised 100% of the tuned rAC/MOF units (10/10) but were rare in STG (1/14).

Figure 1 displays responses of five sharply tuned units in rAC/MOF, each exhibiting strong, robust increases in their firing rate specifically for one or two vowel sounds, whereas for the other vowels firing remains at the baseline rate. For example, a single unit in the right rostral anterior cingulate cortex (rACC) (Fig. 1, top row) elevated its firing rate to an average of 97 spikes/s when the patient said 'a', compared with 6 spikes/s for 'i', 'e', 'o' and 'u' (P<10−13, one-sided two-sample t-test). Anecdotally, in the first two trials of this example (red arrow) the firing rate remained at the baseline level, unlike the rest of the 'a' trials; in these two trials, the patient wrongly said 'ah' rather than 'a' (confirmed by the sound recordings).

Figure 1: Sharply tuned medial–frontal (rAC/MOF) units. Raster plots and peri-stimulus time histograms of five units during the utterance of the five vowels a, e, i, u and o. For each of the units, significant change in firing rate from the baseline occurred for one or two vowels only (Methods). Red vertical dashed lines indicate speech onset. All vertical scale bars correspond to firing rates of 20 spikes/s. Full size image

A completely different encoding of vowels was found in the STG, where the vast majority of tuned units exhibited broad variation of the response over the vowel space, during the articulation of both vowels (Fig. 2a) and simple syllables containing these vowels (Supplementary Fig. S3a). This structured variation is well approximated by sinusoidal tuning curves (Fig. 2b and Supplementary Fig. S3b) analogous to the directional tuning curves commonly observed in motor cortical neurons10. Units shown in Fig. 2 had maximal responses ('preferred vowel', in analogy to 'preferred direction') to the vowels 'i' and 'u', which correspond to a closed articulation where the tongue is maximally raised, and minimal ('anti-preferred') response to 'a' and 'o' where it is lowered.

Figure 2: Broadly tuned STG units. (a) Raster plots and peri-stimulus time histograms during the utterance of the five vowels a, e, i, u and o. Significant change in firing rate from the baseline occurred for all or most vowels, with modulated firing rate (Methods). Red vertical dashed lines indicate speech onset; vertical bars, 10 spikes/s. (b) Tuning curves of the respective units in a over the vowel space, showing orderly variation in the firing rate of STG units with the articulated vowel. Full size image

Population-level decoding and structure

Unlike directional tuning curves, where angles are naturally ordered, vowels can have different orderings. In the tuning curves of Fig. 2, we ordered the vowels according to their place and manner of articulation as expressed by their location in the IPA chart1, but is this ordering natural to the neural representation? Instead of assuming a certain ordering, we could try and deduce the natural organization of speech features represented in the population-level neural code. That is, we can try to infer a neighbourhood structure (or order) of the vowels where similar (neighbouring) neuronal representations are used for neighbouring vowels. We reasoned that this neighbourhood structure could be extracted from the error structure of neuronal classifiers: when a decoder, such as the population vector11 errs, it is more likely to prefer a value that is a neighbour of the correct value than a more distant one. Thus, classification error rates are expected to be higher between neighbours than between distant features when feature ordering accurately reflects the neural representation neighbourhood structure. In this case, classification error rates expressed by the classifier's confusion matrix will have a band-diagonal structure.

To apply this strategy, we decoded the population firing patterns using multivariate linear classifiers with a sparsity constraint to infer the uttered vowel (Methods). The five vowels were decoded with a high average (cross-validated) accuracy of 93% (significantly above the 20% chance level, P<10−5, one-sided one-sample t-test, n=6 cross-validation runs; Supplementary Table S1), and up to 100% when decoding pairs of vowels (Fig. 3a). Next, we selected the vowel ordering that leads to band-diagonal confusion matrices (Fig. 3b). Interestingly, this ordering is consistent across different neuronal subpopulations (Fig. 3b and Supplementary Fig. S4) and exactly matches the organization of vowels according to their place and manner of articulation as reflected by the IPA chart (Fig. 3c). As the vowel chart represents the position of the highest point of the tongue during articulation, the natural organization of speech features by neuronal encoding reflects a functional spatial-anatomical axis in the mouth.