How the human auditory system extracts perceptually relevant acoustic features of speech is unknown. To address this question, we used intracranial recordings from nonprimary auditory cortex in the human superior temporal gyrus to determine what acoustic information in speech sounds can be reconstructed from population neural activity. We found that slow and intermediate temporal fluctuations, such as those corresponding to syllable rate, were accurately reconstructed using a linear model based on the auditory spectrogram. However, reconstruction of fast temporal fluctuations, such as syllable onsets and offsets, required a nonlinear sound representation based on temporal modulation energy. Reconstruction accuracy was highest within the range of spectro-temporal fluctuations that have been found to be critical for speech intelligibility. The decoded speech representations allowed readout and identification of individual words directly from brain activity during single trial sound presentations. These findings reveal neural encoding mechanisms of speech acoustic parameters in higher order human auditory cortex.

Spoken language is a uniquely human trait. The human brain has evolved computational mechanisms that decode highly variable acoustic inputs into meaningful elements of language such as phonemes and words. Unraveling these decoding mechanisms in humans has proven difficult, because invasive recording of cortical activity is usually not possible. In this study, we take advantage of rare neurosurgical procedures for the treatment of epilepsy, in which neural activity is measured directly from the cortical surface and therefore provides a unique opportunity for characterizing how the human brain performs speech recognition. Using these recordings, we asked what aspects of speech sounds could be reconstructed, or decoded, from higher order brain areas in the human auditory system. We found that continuous auditory representations, for example the speech spectrogram, could be accurately reconstructed from measured neural signals. Reconstruction quality was highest for sound features most critical to speech intelligibility and allowed decoding of individual spoken words. The results provide insights into higher order neural speech processing and suggest it may be possible to readout intended speech directly from brain activity.

Copyright: © 2012 Pasley et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

In this study, we focus on whether important spectro-temporal auditory features of spoken words and continuous sentences can be reconstructed from population neural responses. Because significant information may be transformed or lost in the course of higher order auditory processing, an exact reconstruction of the physical stimulus is not expected. However, analysis of stimulus reconstruction can reveal the key auditory features that are preserved in the temporal cortex representation of speech. To investigate this, we analyzed multichannel electrode recordings obtained from the surface of human auditory cortex and examined the extent to which these population neural signals could be used for reconstruction of different auditory representations of speech sounds.

The early auditory system decomposes speech and other complex sounds into elementary time-frequency representations prior to higher level phonetic and lexical processing [1] – [5] . This early auditory analysis, proceeding from the cochlea to the primary auditory cortex (A1) [1] – [3] , [6] , yields a faithful representation of the spectro-temporal properties of the sound waveform, including those acoustic cues relevant for speech perception, such as formants, formant transitions, and syllable rate [7] . However, relatively little is known about what specific features of natural speech are represented in intermediate and higher order human auditory cortex. In particular, the posterior superior temporal gyrus (pSTG), part of classical Wernicke's area [8] , is thought to play a critical role in the transformation of acoustic information into phonetic and pre-lexical representations [4] , [5] , [9] , [10] . PSTG is believed to participate in an “intermediate” stage of processing that extracts spectro-temporal features essential for auditory object recognition and discards nonessential acoustic features [4] , [5] , [9] – [11] . To investigate the nature of this auditory representation, we directly quantified how well different stimulus representations account for observed neural responses in nonprimary human auditory cortex, including areas along the lateral surface of STG. One approach, referred to as stimulus reconstruction [12] – [15] , is to measure population neural responses to various stimuli and then evaluate how accurately the original stimulus can be reconstructed from the measured responses. Comparison of the original and reconstructed stimulus representation provides a quantitative description of the specific features that can be encoded by the neural population. Furthermore, different stimulus representations, referred to as encoding models, can be directly compared to test hypotheses about how the neural population represents auditory function [16] .

Results

Words and sentences from different English speakers were presented aurally to 15 patients undergoing neurosurgical procedures for epilepsy or brain tumor. All patients in this study had normal language capacity as determined by neurological exam. Cortical surface field potentials were recorded from non-penetrating multi-electrode arrays placed over the lateral temporal cortex (Figure 1, red circles), including the pSTG. We investigated the nature of auditory information contained in temporal cortex neural responses using a stimulus reconstruction approach (see Materials and Methods) [12]–[15]. The reconstruction procedure is a multi-input, multi-output predictive model that is fit to stimulus-response data. It constitutes a mapping from neural responses to a multi-dimensional stimulus representation (Figures 1 and 2). This mapping can be estimated using a variety of different learning algorithms [17]. In this study a regularized linear regression algorithm was used to minimize the mean-square error between the original and reconstructed stimulus (see Materials and Methods). Once the model was fit to a training set, it could then be used to predict the spectro-temporal content of any arbitrary sound, including novel speech not used in training.

PPT PowerPoint slide

PowerPoint slide PNG larger image

larger image TIFF original image Download: Figure 1. Experiment paradigm. Participants listened to words (acoustic waveform, top left), while neural signals were recorded from cortical surface electrode arrays (top right, red circles) implanted over superior and middle temporal gyrus (STG, MTG). Speech-induced cortical field potentials (bottom right, gray curves) recorded at multiple electrode sites were used to fit multi-input, multi-output models for offline decoding. The models take as input time-varying neural signals at multiple electrodes and output a spectrogram consisting of time-varying spectral power across a range of acoustic frequencies (180–7,000 Hz, bottom left). To assess decoding accuracy, the reconstructed spectrogram is compared to the spectrogram of the original acoustic waveform. https://doi.org/10.1371/journal.pbio.1001251.g001

PPT PowerPoint slide

PowerPoint slide PNG larger image

larger image TIFF original image Download: Figure 2. Spectrogram reconstruction. (A) Top: spectrogram of six isolated words (deep, jazz, cause) and pseudowords (fook, ors, nim) presented aurally to an individual participant. Bottom: spectrogram-based reconstruction of the same speech segment, linearly decoded from a set of electrodes. Purple and green bars denote vowels and fricative consonants, respectively, and the spectrogram is normalized within each frequency channel for display. (B) Single trial high gamma band power (70–150 Hz, gray curves) induced by the speech segment in (A). Recordings are from four different STG sites used in the reconstruction. The high gamma response at each site is z-scored and plotted in standard deviation (SD) units. Right panel: frequency tuning curves (dark black) for each of the four electrode sites, sorted by peak frequency and normalized by maximum amplitude. Red bars overlay each peak frequency and indicate SEM of the parameter estimate. Frequency tuning was computed from spectro-temporal receptive fields (STRFs) measured at each individual electrode site. Tuning curves exhibit a range of functional forms including multiple frequency peaks (Figures S1B and S2B). (C) The anatomical distribution of fitted weights in the reconstruction model. Dashed box denotes the extent of the electrode grid (shown in Figure 1). Weight magnitudes are averaged over all time lags and spectrogram frequencies and spatially smoothed for display. Nonzero weights are largely focal to STG electrode sites. Scale bar is 10 mm. https://doi.org/10.1371/journal.pbio.1001251.g002

The key component in the reconstruction algorithm is the choice of stimulus representation, as this choice encapsulates a hypothesis about the neural coding strategy under study. Previous applications of stimulus reconstruction in non-human auditory systems [14],[15] have focused primarily on linear models to reconstruct the auditory spectrogram. The spectrogram is a time-varying representation of the amplitude envelope at each acoustic frequency (Figure 1, bottom left) [18]. The spectrogram envelope of natural sounds is not static but rather fluctuates across both frequency and time [19]–[21]. Envelope fluctuations in the spectrogram are referred to as modulations [18]–[22] and play an important role in the intelligibility of speech [19],[21]. Temporal modulations occur at different temporal rates and spectral modulations occur at different spectral scales. For example, slow and intermediate temporal modulation rates (<4 Hz) are associated with syllable rate, while fast modulation rates (>16 Hz) correspond to syllable onsets and offsets. Similarly, broad spectral modulations relate to vowel formants while narrow spectral structure characterizes harmonics. In the linear spectrogram model, modulations are represented implicitly as the fluctuations of the spectrogram envelope. Furthermore, neural responses are assumed to be linearly related to the spectrogram envelope.

For stimulus reconstruction, we first applied the linear spectrogram model to human pSTG responses using a stimulus set of isolated words from an individual speaker. We used a leave-one-out cross-validation fitting procedure in which the reconstruction model was trained on stimulus-response data from isolated words and evaluated by directly comparing the original and reconstructed spectrograms of the out-of-sample word. Reconstruction accuracy is quantified as the correlation coefficient (Pearson's r) between the original and reconstructed stimulus. The reconstruction procedure is illustrated in Figure 2 for one participant with a high-density (4 mm) electrode grid placed over posterior temporal cortex. For different words, the linear model yielded accurate spectrogram reconstructions at the level of single trial stimulus presentations (Figure 2A and B; see Figure S7 and Supporting Audio File S1 for example audio reconstructions). The reconstructions captured major spectro-temporal features such as energy concentration at vowel harmonics (Figure 2A, purple bars) and high frequency components during fricative consonants (Figure 2A, [z] and [s], green bars). The anatomical distribution of weights in the fitted reconstruction model revealed that the most informative electrode sites within temporal cortex were largely confined to pSTG (Figure 2C).

Across the sample of participants (N = 15), cross-validated reconstruction accuracy for single trials was significantly greater than zero in all individual participants (p<0.001, randomization test, Figure 3A). At the population level, mean accuracy averaged over all participants and stimulus sets (including different word sets and continuous sentences from different speakers) was highly significant (mean accuracy r = 0.28, p<10−5, one-sample t test, df = 14). As a function of acoustic frequency, mean accuracy ranged from r = ∼0.2–0.3 (Figure 3B).

PPT PowerPoint slide

PowerPoint slide PNG larger image

larger image TIFF original image Download: Figure 3. Individual participant and group average reconstruction accuracy. (A) Overall reconstruction accuracy for each participant using the linear spectrogram model. Error bars denote resampling SEM. Overall accuracy is reported as the mean over all acoustic frequencies. Participants are grouped by grid density (low or high) and stimulus set (isolated words or sentences). Statistical significance of the correlation coefficient for each individual participant was computed using a randomization test. Reconstructed trials were randomly shuffled 1,000 times and the correlation coefficient was computed for each shuffle to create a null distribution of coefficients. The p value was calculated as the proportion of elements greater than the observed correlation. (B) Reconstruction accuracy as a function of acoustic frequency averaged over all participants (N = 15) using the linear spectrogram model. Shaded region denotes SEM over participants. https://doi.org/10.1371/journal.pbio.1001251.g003

We observed that overall reconstruction quality was influenced by a number of anatomical and functional factors as described below. First, informative temporal electrodes were primarily localized to pSTG. To quantify this, we defined “informative” electrodes as those associated with parameters with high signal-to-noise ratio in the reconstruction models (t ratio>2.5, p<0.05, false discovery rate (FDR) correction) Figure 4A shows the anatomical distribution of informative electrodes pooled across participants and plotted in standardized anatomical coordinates (Montreal Neurological Institute, MNI) [23]). The distribution was centered in the pSTG (x = −70, y = −29, z = 12, MNI coordinates; Brodmann area 42), and was dispersed along the anterior-posterior axis.

PPT PowerPoint slide

PowerPoint slide PNG larger image

larger image TIFF original image Download: Figure 4. Factors influencing reconstruction quality. (A) Group average t value map of informative electrodes, which are predominantly localized to posterior STG. For each participant, informative electrodes are defined as those associated with significant weights (p<0.05, FDR correction) in the fitted reconstruction model. To plot electrodes in a common anatomical space, spatial coordinates of significant electrodes are normalized to the MNI (Montreal Neurological Institute) brain template (Yale BioImage Suite, www.bioimagesuite.org). The dashed white line denotes the extent of electrode coverage pooled over participants. (B) Reconstruction accuracy is significantly greater than zero when using neural responses within the high gamma band (∼70–170 Hz; p<0.05, one sample t tests, df = 14, Bonferroni correction). Accuracy was computed separately in 10 Hz bands from 1–300 Hz and averaged across all participants (N = 15). (C) Mean reconstruction accuracy improves with increasing number of electrodes used in the reconstruction algorithm. Error bars indicate SEM over 20 cross-validated data sets of four participants with 4 mm high density grids. (D) Accuracy across participants is strongly correlated (r = 0.78, p<0.001, df = 13) with tuning spread (which varied by participant depending on grid placement and electrode density). Tuning spread was quantified as the fraction of frequency bins that included one or more peaks, ranging from 0 (no peaks) to 1 (at least one peak in all frequency bins, ranging from 180–7,000 Hz). https://doi.org/10.1371/journal.pbio.1001251.g004

Second, significant predictive power (r>0) was largely confined to neural responses in the high gamma band (∼70–170 Hz; Figure 4B; p<0.01, one-sample t tests, df = 14, Bonferroni correction). Predictive power for the high gamma band (∼70–170 Hz) was significantly better compared to other neural frequency bands (p<0.05, Bonferroni adjusted pair-wise comparisons between frequency bands, following significant one-way repeated measures analysis of variance (ANOVA), F(30,420) = 128.7, p<10−10). This is consistent with robust speech-induced high gamma responses reported in previous intracranial studies [24]–[29] and with observed correlations between high gamma power and local spike rate [30].

Third, increasing the number of electrodes used in the reconstruction improved overall reconstruction accuracy (Figure 4C). Overall prediction quality was relatively low for participants with five or fewer responsive STG electrodes (mean accuracy r = 0.19, N = 6 participants) and was robust for cases with high density grids (mean accuracy r = 0.43, N = 4, mean of 37 responsive STG electrodes per participant).

What neural response properties allow the linear model to find an effective mapping to the stimulus spectrogram? There are two major requirements as described in the following paragraphs. First, individual recording sites must exhibit reliable frequency selectivity (e.g., Figure 2B, right column; Figures S1B, S2). An absence of frequency selectivity (i.e., equal neural response amplitudes to all stimulus frequencies) would imply that neural responses do not encode frequency and could not be used to differentiate stimulus frequencies. To quantify frequency tuning at individual electrodes, we used estimates of standard spectro-temporal receptive fields (STRFs) (see Materials and Methods). The STRF is a forward modeling approach commonly used to estimate neural tuning to a wide variety of stimulus parameters in different sensory systems [16]. We found that different electrodes were sensitive to different acoustic frequencies important for speech sounds, ranging from low (∼200 Hz) to high (∼7,000 Hz). The majority of individual sites exhibited a complex tuning profile with multiple peaks (e.g., Figure 2B, rows 2 and 3; Figure S2B). The full range of the acoustic speech spectrum was encoded by responses from multiple electrodes in the ensemble, although coverage of the spectrum varied by participant (Figure 4D). Across participants, total reconstruction accuracy was positively correlated with the proportion of spectrum coverage (r = 0.78, p<0.001, df = 13; Figure 4D).

A second key requirement of the linear model is that the neural response must rise and fall reliably with fluctuations in the stimulus spectrogram envelope. This is because the linear model assumes a linear mapping between the response and the spectrogram envelope. This requirement for “envelope-locking” reveals a major limitation of the linear model, which is most evident at fast temporal modulation rates. This limitation is illustrated in Figure 5A (blue curve), which plots reconstruction accuracy as a function of modulation rate. A one-way repeated measures ANOVA (F(5,70) = 13.99, p<10−8) indicated that accuracy was significantly higher for slow modulation rates (≤4 Hz) compared to faster modulation rates (>8 Hz) (p<0.05, post hoc pair-wise comparisons, Bonferroni correction). Accuracy for slow and intermediate modulation rates (≤8 Hz) was significantly greater than zero (r = ∼0.15 to 0.42; one-sample paired t tests, p<0.0005, df = 14, Bonferroni correction) indicating that the high gamma response faithfully tracks the spectrogram envelope at these rates [26]. However, accuracy levels were not significantly greater than zero at fast modulation rates (>8 Hz; r = ∼0.10; one-sample paired t tests, p>0.05, df = 14, Bonferroni correction), indicating a lack of reliable envelope-locking to rapid temporal fluctuations [31].

PPT PowerPoint slide

PowerPoint slide PNG larger image

larger image TIFF original image Download: Figure 5. Comparison of linear and nonlinear coding of temporal fluctuations. (A) Mean reconstruction accuracy (r) as a function of temporal modulation rate, averaged over all participants (N = 15). Modulation-based decoding accuracy (red curve) is higher compared to spectrogram-based decoding (blue curve) for temporal rates ≥4 Hz. In addition, spectrogram-based decoding accuracy is significantly greater than zero for lower modulation rates (≤8 Hz), supporting the possibility of a dual modulation and envelope-based coding scheme for slow modulation rates. Shaded gray regions indicate SEM over participants. (B) Mean ensemble rate tuning curve across all predictive electrode sites (n = 195). Error bars indicate SEM. Overlaid histograms indicate proportion of sites with peak tuning at each rate. (C) Within-site differences between modulation and spectrogram-based tuning. Arrow indicates the mean difference across sites. Within-site, nonlinear modulation models are tuned to higher temporal modulation rates than the corresponding linear spectrogram models (p<10−7, two sample paired t test, df = 194). https://doi.org/10.1371/journal.pbio.1001251.g005

Given the failure of the linear spectrogram model to reconstruct fast modulation rates, we evaluated competing models of auditory neural encoding. We investigated an alternative, nonlinear model based on modulation (described in detail in [18]). Speech sounds are characterized by both slow and fast temporal modulations (e.g., syllable rate versus onsets) as well as narrow and broad spectral modulations (e.g., harmonics versus formants) [7]. The modulation model represents these multi-resolution features explicitly through a complex wavelet analysis of the auditory spectrogram. Computationally, the modulation representation is generated by a population of modulation-selective filters that analyze the two-dimensional spectrogram and extract modulation energy (a nonlinear operation) at different temporal rates and spectral scales (Figure 6A) [18]. Conceptually, this transformation is similar to the modulus of a 2-D Fourier transform of the spectrogram, localized at each acoustic frequency [18]. The modulation model and applications to speech processing are described in detail in [18] and [7].

PPT PowerPoint slide

PowerPoint slide PNG larger image

larger image TIFF original image Download: Figure 6. Schematic of nonlinear modulation model. (A) The input spectrogram (top left) is transformed by a linear modulation filter bank (right) followed by a nonlinear magnitude operation (not shown). This nonlinear operation extracts the modulation energy of the incoming spectrogram and generates phase invariance to local fluctuations in the spectrogram envelope. The input representation is the two-dimensional spectrogram S(f,t) across frequency f and time t. The output (bottom left) is the four-dimensional modulation energy representation M(s,r,f,t) across spectral modulation scale s, temporal modulation rate r, frequency f, and time t. In the full modulation representation [18], negative rates by convention correspond to upward frequency sweeps, while positive rates correspond to downward frequency sweeps. Accuracy for positive and negative rates was averaged unless otherwise shown. See Materials and Methods. (B) Schematic of linear (spectrogram envelope) and nonlinear (modulation energy) temporal coding. Left: acoustic waveform (black curve) and spectrogram of a temporally modulated tone. The linear spectrogram model (top) assumes that neural responses are a linear function of the spectrogram envelope (plotted for the tone center frequency channel, top right). In this case, the instantaneous output may be high or low and does not directly indicate the modulation rate of the envelope. The nonlinear modulation model (bottom) assumes that neural responses are a linear function of modulation energy. This is an amplitude-based coding scheme (plotted for the peak modulation channel, bottom right). The nonlinear modulation model explicitly estimates the modulation rate by taking on a constant value for a constant rate [32]. https://doi.org/10.1371/journal.pbio.1001251.g006

The nonlinear component of the model is phase invariance to the spectrogram envelope (Figure 6B). A fundamental difference with the linear spectrogram model is that phase invariance permits a nonlinear temporal coding scheme, whereby envelope fluctuations are encoded by amplitude rather than envelope-locking (Figure 6B). Such amplitude-based coding schemes are broadly referred to as “energy models” [32],[33]. The modulation model therefore represents an auditory analog to the classical energy model of complex cells in the visual system [32]–[36], which are invariant to the spatial phase of visual stimuli.

Reconstructing the modulation representation proceeds similarly to the spectrogram, except that individual reconstructed stimulus components now correspond to modulation energy at different rates and scales instead of spectral energy at different acoustic frequencies (see Materials and Methods, Stimulus Reconstruction). We next compared reconstruction accuracy using the nonlinear modulation model to that of the linear spectrogram model (Figure 5A; Figure S3). In the group data, the nonlinear model yielded significantly higher accuracy compared to the linear model (two-way repeated measures ANOVA; main effect of model type, F(1,14) = 33.36, p<10−4). This included significantly better accuracy for fast temporal modulation rates compared to the linear spectrogram model (4–32 Hz; Figure 5A, red versus blue curves; model type by modulation rate interaction effect, F(5,70) = 3.33, p<0.01; post hoc pair-wise comparisons, p<10−4, Bonferroni correction).

The improved performance of the modulation model suggested that this representation provided better neural sensitivity to fast modulation rates compared to the linear spectrogram. To further investigate this possibility, we estimated modulation rate tuning curves at individual STG electrode sites (n = 195) using linear and nonlinear STRFs, which are based on the spectrogram and modulation representations, respectively (Figure S4). Consistent with prior recordings from lateral temporal human cortex [31], average envelope-locked responses exhibit prominent tuning to low rates (1–8 Hz) with a gradual loss of sensitivity at higher rates (>8 Hz) (Figure 5B and C). In contrast, the average modulation-based tuning curves preserve sensitivity to much higher rates approaching 32 Hz (Figure 5B and C).

Sensitivity to fast modulation rates at single STG electrodes is illustrated for one participant in Figure 7A. In this example (the word “waldo”), the spectrogram envelope (blue curve, top) fluctuates rapidly between the two syllables (“wal” and “do,” ∼300 ms). The linear model assumes that neural responses (high gamma power, black curves, left) are envelope-locked and directly track this rapid change. However, robust tracking of such rapid envelope changes was not generally observed, in violation of linear model assumptions. This is illustrated for several individual electrodes in Figure 7A (compare black curves, left, with blue curve, top). In contrast, the modulation representation encodes this fluctuation nonlinearly as an increase in energy at fast rates (>8 Hz, dashed red curves, ∼300 ms, bottom two rows). This allows the model to capture energy-based modulation information in the neural response. Modulation energy encoding at these sites is quantified by the corresponding nonlinear rate tuning curves (Figure 7A, right column). These tuning curves show neural sensitivity to a range of temporal modulations with a single peak rate. For illustrative purposes, Figure 7A (left) compares modulation energy at the peak temporal rate (dashed red curves) with the neural responses (black curves) at each individual site. This illustrates the ability of the modulation model to account for a rapid decrease in the spectrogram envelope without a corresponding decrease in the neural response.

PPT PowerPoint slide

PowerPoint slide PNG larger image

larger image TIFF original image Download: Figure 7. Example of nonlinear modulation coding and reconstruction. (A) Top: the spectrogram of an isolated word (“waldo”) presented aurally to one participant. Blue curve plots the spectrogram envelope, summed over all frequencies. Left panels: induced high gamma responses (black curves, trial averaged) at four different STG sites. Temporal modulation energy of the stimulus (dashed red curves) is overlaid (computed from 2, 4, 8, and 16 Hz modulation filters and normalized to maximum value). Dashed black lines indicate baseline response level. Right panels: nonlinear modulation rate tuning curves for each site (estimated from nonlinear STRFs). Shaded regions and error bars indicate SEM. (B) Original spectrogram (top), modulation-based reconstruction (middle), and spectrogram-based reconstruction (bottom), linearly decoded from a fixed set of STG electrodes. The modulation reconstruction is projected into the spectrogram domain using an iterative projection algorithm and an overcomplete set of modulation filters [18]. The displayed spectrogram is averaged over 100 random initializations of the algorithm. https://doi.org/10.1371/journal.pbio.1001251.g007

The effect of sensitivity to fast modulation rates can also be observed when the modulation reconstruction is viewed in the spectrogram domain (Figure 7B, middle, see Material and Methods, Reconstruction Accuracy). The result is that dynamic spectral information (such as the upward frequency sweep at ∼400–500 ms, Figure 7B, top) is better resolved compared to the linear spectrogram-based reconstruction (Figure 7B, bottom). These combined results support the idea of an emergent population-level representation of temporal modulation energy in primate auditory cortex [37]. In support of this notion, subpopulations of neurons have been found that exhibit both envelope and energy-based response properties in primary auditory cortex of non-human primates [37]–[39]. This has led to the suggestion of a dual coding scheme in which slow fluctuations are encoded by synchronized (envelope-locked) neurons, while fast fluctuations are encoded by non-synchronized (energy-based) neurons [37].

While these results indicate that a nonlinear model is required to reliably reconstruct fast modulation rates, psychoacoustic studies have shown that slow and intermediate modulation rates (∼1–8 Hz) are most critical for speech intelligibility [19],[21]. These slow temporal fluctuations carry essential phonological information such as formant transitions and syllable rate [7],[19],[21]. The linear spectrogram model, which also yielded good performance within this range (Figure 5A; Figure S3), therefore appears sufficient to reconstruct the essential range of temporal modulations. To examine this issue, we further assessed reconstruction quality by evaluating the ability to identify isolated words using the linear spectrogram reconstructions. We analyzed a participant implanted with a high-density electrode grid (4 mm spacing), the density of which provided a large set of pSTG electrodes. Compared to lower density grid cases, data for this participant included ensemble frequency tuning that covered the majority of the (speech-related) acoustic spectrum (180–7,000 Hz), a factor which we found was critical for accurate reconstruction (Figure 4D). Spectrogram reconstructions were generated for each of 47 words, using neural responses either from single trials or averaged over 3–5 trials per word (same word set and cross-validated fitting procedure as described in Figure 2). To identify individual words from the reconstructions, a simple speech recognition algorithm based on dynamic time warping was used to temporally align words of variable duration [40]. For a target word, a similarity score (correlation coefficient) was then computed between the target reconstruction and the actual spectrograms of each of the 47 words in the candidate set. The 47 similarity scores were sorted and word identification rank was quantified as the percentile rank of the correct word. (1.0 indicates the target reconstruction matched the correct word out of all candidate words; 0.0 indicates the target was least similar to the correct word among all other candidates.) The expected mean of the distribution of identification ranks is 0.5 at chance level.

Word identification using averaged trials was substantially higher than chance (Figure 8A and B, median identification rank = 0.89, p<0.0001; randomization test), with correctly identified words exhibiting accurate reconstructions and poorly identified words exhibiting inaccurate reconstructions (Figure 8C). For single trials, identification performance declined slightly but remained significant (median = 0.76, p<0.0001; randomization test). In addition, for each possible word pair, we computed the similarity between the two original spectrograms and compared this to the similarity between the reconstructed and actual spectrograms (using averaged trials; Figure 8D; Figure S5). Acoustic and reconstruction word similarities were correlated (r = 0.41, p<10−10, df = 45), suggesting that acoustic similarity of the candidate words is likely to influence identification performance (i.e., identification is more difficult when the word set contains many acoustically similar sounds).