Participants

Three human epilepsy patients undergoing treatment at the UCSF Medical Center participated in this study. For the clinical purpose of localizing seizure foci, ECoG arrays were surgically implanted on the cortical surface of one hemisphere for each participant. All participants were right-handed with left hemisphere language dominance determined by their clinicians.

The research protocol was approved by the UCSF Committee on Human Research. Prior to surgery, each patient gave his or her written informed consent to participate in this research.

Neural data acquisition

Participants 1 and 2 were each implanted with two 128-channel ECoG arrays (PMT Corp.) and participant 3 was implanted with a 256-channel ECoG array (Ad-Tech, Corp.). Participants 1 and 3 had left hemisphere coverage and participant 2 had right hemisphere coverage. Each implanted array contained disc electrodes with 1.17 mm exposure diameters arranged in a square lattice formation with a 4 mm center-to-center electrode spacing. We used the open source img_pipe package46 to generate MRI brain reconstruction images with electrode locations for each participant (Fig. 2, Supplementary Fig. 3).

We used a data acquisition (DAQ) rig to process the local field potentials recorded from these arrays at multiple cortical sites from each participant. These analog ECoG signals were amplified and quantized using a pre-amplifier (PZ5, Tucker-Davis Technologies). We then performed anti-aliasing (low-pass filtering at 1500 Hz) and line noise removal (notch filtering at 60, 120, and 180 Hz) on a digital signal processor (RZ2, Tucker-Davis Technologies). On the DAQ rig, we stored these neural data (at 3051.76 Hz) along with the time-aligned microphone and speaker audio channels (at 24414.06 Hz). These neural data were anti-aliased again (low-pass filtered at 190 Hz) and streamed at a sampling rate of 381.47 Hz to a real-time computer, which was a Linux machine (64-bit Ubuntu 14.04, Intel Core i7-4790K processor, 32 GB of RAM) implementing a custom software package called real-time Neural Speech Recognition (rtNSR)14.

High gamma feature extraction

The rtNSR package implemented a filter chain comprising three processes to measure high gamma activity in real-time (Supplementary Fig. 1). We used high gamma band activity (70–150 Hz) in this work because previous research has shown that activity in this band is correlated with multi-unit firing processes in the cortex28 and can be used as an effective representation of cortical activity during speech processing4,10,13,14,15.

The first of these three processes applied eight band-pass finite impulse response (FIR) filters to the ECoG signals acquired from the DAQ rig (at 381.47 Hz). The logarithmically increasing center frequencies of these filters were 72.0, 79.5, 87.8, 96.9, 107.0, 118.1, 130.4, and 144.0 (in Hz, rounded to the nearest decimal place)13,15. The filters each had an order of 150 and were designed using the Parks-McClellan algorithm47.

The second process in the filter chain estimated the analytic amplitude values for each band and channel using the signals obtained from the band-passing process. An 80th-order FIR filter was designed using the Parks-McClellan algorithm to approximate the Hilbert transform. For each band and channel, this process estimated the analytic signal using the original signal (delayed by 40 samples, which was half of the filter order) as the real component and the FIR Hilbert transform approximation of the original signal as the imaginary component48. The analytic amplitudes were then computed as the magnitudes of these analytic signals. This filtering approach was applied to every fourth sample of the received signals, effectively decimating the signals to 95.37 Hz.

The final process in the filter chain averaged analytic amplitude values across the eight bands, yielding a single high gamma analytic amplitude measure for each channel.

After filtering, the high gamma signals were z-scored using Welford’s method with a 30-second sliding window49. To mitigate signal artifacts such as channel noise and epileptic activity, we clipped the z-score values to lie within the range of [−3.5, 3.5]. We used the resulting z-scores as the representation of high gamma activity in all subsequent analyses and real-time testing.

Experimental task design

The overall goal of this task was to demonstrate real-time decoding of perceived and produced speech while leveraging contextual relationships between the content of the two speech modalities. To achieve this, we designed a question-and-answer task in which participants listen to questions and respond verbally to each question with an answer. There were 9 pre-recorded acoustic question stimuli and 24 possible answers (Table 1). Questions were recorded by a female speaker at 44.1 kHz and were presented to each participant aurally via loudspeakers. Each visual answer choice was represented as a small rectangle containing the text prompt and a small image depicting the text (Fig. 1b; images were included to increase participant engagement). The stimuli were divided into four question/answer sets (QA sets 1–4). The answers in each QA set represented the answer choices that would appear on the screen for each of the questions in that set.

We used the following three types of task blocks: (1) question (perception) training, in which participants heard each question 10 times in a random order (stimulus length varied from 1.38–2.42 s in duration with an onset-to-onset interval of 3 s); (2) answer (production) training, in which participants read each possible answer choice aloud 10 times in a random order (each answer appeared on the screen with a gray background for 0.5 s, was changed to a green background for 1.5 s to represent a go cue for the participant to read the answer, and removed from the screen for 0.5 s before the next answer was displayed); and (3) testing, in which participants heard questions and responded verbally with answers (choosing a response from the possible options presented on the screen after each question). During the testing blocks, a green circle appeared on the screen after each question was presented to cue participants to respond aloud with an answer of their choosing. We encouraged the participants to choose different answers when they encountered the same questions, although they were free to respond with any of the presented answer choices during each trial. There was 2–3 s of silence and a blank screen between each trial. In each block, the questions played to the participant were chosen based on how many questions and answers are in each QA set (questions with more valid answers had a greater chance of being played in each trial). Trials in which the participant failed to respond or responded with an invalid choice (less than 0.5% of trials) were excluded from further analysis. There were 26 question-and-answer trials in each testing block.

During each block, time-aligned behavioral and neural data were collected and stored. The data collected during training blocks were used to fit the decoding models. The data collected during testing blocks were used to decode the perceived questions and produced answers in real-time and were also used offline during hyperparameter optimization.

Phonetic transcription

After data collection, both questions and answers were phonetically transcribed from the time-aligned audio using the p2fa package50, which uses the Hidden Markov Model Toolkit and the Carnegie Mellon University Pronouncing Dictionary51,52. The phone boundaries were manually fine-tuned using the Praat software package53. Including a silence phone token /sp/, there were a total of 38 unique phones in the question stimuli and 38 unique phones in the produced answer utterances, although these two phone sets were not identical.

Modeling

After collecting training data for a participant, we fit models using the time-aligned high gamma z-score neural data and phonetic transcriptions. Model fitting was performed offline, and the trained models were saved to the real-time computer to be used during online testing. As described in Section 4.7, the values for many model parameters that were not learned directly from the training data were set using hyperparameter optimization. We used three types of models in this work: speech detection models, utterance classification models, and context integration models.

Speech detection: Before using the neural data to train speech detection models, we analyzed the collected data to identify electrodes that were responsive to speech events13,14. For each time point in the neural data, we used the phonetic transcriptions to determine if that time point occurred during speech perception, speech production, or silence. We then performed Welch’s analysis of variance (ANOVA) on each electrode to identify channels that were significantly modulated by the different types of speech events. Channels that had a Welch’s ANOVA p-value less than a threshold hyperparameter were included in the feature vectors used to train and test the speech detection models.

Speech events were modeled discriminatively as conditional probability distributions of the form p(h t |y t ). Here, h t represents the speech event at time t and is one of the values in the set {perception,production,silence}, and y t is the spatiotemporal neural feature vector at time t. The h t labels were determined from the phonetic transcriptions: for any given time index t, h t was perception if the participant was listening to a phone at time t, production if the participant was producing a phone at time t, or silence otherwise. Each of these feature vectors was constructed by concatenating high gamma z-score values for relevant electrodes across all of the time points in a time window relative to the target time point, capturing both spatial (multiple electrodes) and temporal (multiple time points) dynamics of the cortical activity13,14 (Supplementary Fig. 7). Specifically, a feature vector associated with the speech event label at some time index t consisted of the neural data at the time indices within the closed interval [t + ν shift , t + ν shift + ν duration ], where ν shift and ν duration represent the window onset shift and window duration, respectively, and were determined using hyperparameter optimization.

To compute the speech event probabilities p(h t |y t ) at each time point, we fit a principal component analysis (PCA) model with the constraint that the dimensionality of the projected feature vectors would be reduced to the minimum number of principal components required to explain a certain fraction of the variance across the features (this fraction was a hyperparameter determined during optimization). We then used these new projected feature vectors and the speech event labels to fit a linear discriminant analysis (LDA) model implementing the least-squares solution with automatic shrinkage described by the Ledoit-Wolf lemma54. After training, these PCA-LDA models could be used during testing to extract the principal components from a previously unseen spatiotemporal feature vector and predict speech event probabilities from the resulting projection (the LDA model assumed flat class priors when computing these probabilities). We used the Python package scikit-learn to implement the PCA and LDA models55.

During testing, the predicted speech event probabilities were used to detect the onsets and offsets of speech events (Supplementary Fig. 2) with a multi-step approach. For every time point t, the p(h t |y t ) probabilities were computed using the speech event probability model (Supplementary Fig. 2a). For perception and production, these probabilities were smoothed using a sliding window average (Supplementary Fig. 2b). Next, these smoothed probabilities were discretized to be either 1 if the detection model assigned time point t to the associated speech event type or 0 otherwise (Supplementary Fig. 2c). These probability-thresholded binary values were then thresholded in time (debounced); a speech onset (or offset) was only detected if this binary value changed from 0 to 1 and remained 1 for a certain number of time points (or the opposite for offsets; Supplementary Fig. 2d). Whenever a speech event offset was detected (which could only occur after an onset had been detected), the neural data in the detected window were passed to the appropriate utterance classification model (Supplementary Fig. 2e). The number of recent time points used during probability averaging, probability threshold value, time threshold duration, and onset and offset index shifts (integers added to the predicted onset and offset time indices before segmenting the neural data) were all treated as hyperparameters and set via optimization (with separate parameters for perception and production).

Utterance classification: For each participant and utterance type (questions and answers), we used classification models to predict the likelihood of each utterance given a detected time segment of neural activity. For each utterance, we constructed a hidden Markov model (HMM) to represent that utterance56, with phones q t as hidden states and spatiotemporal neural feature vectors y t as observed states at each time index t. Each of these HMMs was created using the representative phone sequence for the associated utterance (determined from the phonetic transcriptions). The transition matrix for each HMM, which specified the transition probabilities p(q t+1 |q t ), was defined such that each hidden state was one of the phones in the associated representative sequence and could only self-transition (with some probability p self ) or transition to the next phone in the sequence (with probability 1 − p self ). A self-transition probability of 1 was used for the final state. We used the silence phone token /sp/ as the initial and final states for each HMM. Given a time series of high gamma z-score values, each of these HMMs yielded the likelihood of observing those neural features during perception or production of the underlying phone sequence. These likelihoods are robust to natural variability in the durations of the phones in the sequence, which is a key motivation for using HMMs in this approach (even with a single speaker producing the same utterance multiple times, phone durations will vary).

Similar to the relevant electrode selection procedure used for the speech detection models, we identified which channels should be considered relevant to the type of speech processing associated with each utterance type. Using the three previously described data subsets (perception, production, and silence), we performed two-tailed Welch’s t-tests for each channel between the appropriate subsets for each utterance type (perception vs. silence for questions and production vs. silence for answers). Channels with a p-value less than a threshold hyperparameter value were considered relevant for the current utterance type and were used during subsequent phone likelihood modeling.

PCA-LDA models were then trained to compute the phone emission likelihoods p(y t |q t ) at each time point t. The hyperparameters associated with these models, including the feature time window parameters and the PCA minimum variance fraction, were optimized separately from the parameters in the speech event model.

During testing, we used Viterbi decoding on each HMM to determine the likelihood of each utterance given a detected time segment of high gamma z-scores13,14,29,45 (Supplementary Fig. 8). We computed the log likelihood of each utterance using the following recursive formula:

$$\upsilon _{(t,s)} = w_e{\mathrm{log}}p(y_t|s) + \mathop {{{\mathrm{max}}}}\limits_{i \in S} \left[ {\upsilon _{(t - 1,i)} + {\mathrm{log}}p(s|i)} \right],$$ (1)

where υ (t,s) is the log probability of the most likely Viterbi path that ends in phone (state) s at time t, p(y t |s) is the phone emission likelihood (the probability of observing the neural feature vector y t if the current phone is s), p(s|i) is the phone transition probability (the probability of transitioning from phone i to phone s), w e is an emission probability scaling factor (a model hyperparameter) to control the weight of the emission probabilities relative to the transition probabilities (see Supplementary Note 4), and S is the set of all possible phones. To initialize the recursion, we forced each Viterbi decoding procedure to start with a Viterbi path log probability of zero for the first state (the initial silence phone /sp/) and negative infinity for every other state.

After decoding for each HMM, the Viterbi path log probability at the final state and time point for that HMM represents the log likelihood \(\ell _u\) of the corresponding utterance u given the neural data. Log probabilities are used here and in later computations for numerical stability and computational efficiency.

The computed log likelihoods for each utterance were then smoothed and normalized using the following formula:

$$\ell _u^ \ast : = \omega \ell _u - {\mathrm{log}}\left[ {\mathop {\sum}

olimits_{j \in U} {\mathrm{exp}(\omega \ell _j)} } \right],$$ (2)

where \(\ell _u^ \ast\) is the smoothed and normalized log likelihood for utterance u, ω is the smoothing hyperparameter, and U is the set of all valid utterances (for the current utterance type). Because differences in utterance log likelihoods can be large (e.g., in the hundreds), the smoothing hyperparameter, which lay in the closed interval [0, 1], was included to allow the model to control how confident its likelihood predictions were. The closer ω is to zero, the smoother the log likelihoods are (less sample variance among the log likelihoods). The final log term in Eq. 2 represents the LogSumExp function and was used to compute the normalization constant for the current smoothed log likelihoods. After computing this constant and subtracting it from the smoothed log likelihoods, the \(\ell _u^ \ast\) values satisfied the following equality:

$$\mathop {\sum}\limits_{j \in U} \,{\mathrm{exp}}(\ell _j^ \ast ) = 1.$$ (3)

These \(\ell _u^ \ast\) values were used as the utterance classification model’s estimate of the utterance log likelihoods given the corresponding neural data.

Context integration: Because each answer was only valid for specific questions and an answer always followed each question, we developed a context integration model that used predicted question likelihoods to update the predicted answer probabilities during testing. Based on our previous demonstration that auditory sentences could be decoded from neural activity14, we hypothesized that we could use reliable decoding of the questions to improve answer predictions.

Prior to testing, we defined the relationships between questions and answers in the form of conditional probabilities. These probabilities, referred to as the context priors, were computed using the following formula:

$$p(u_a|u_q) = \left\{ {\begin{array}{*{20}{l}} {\frac{1}{{N_{A,q}}}} \hfill & {{\mathrm{if}}\;u_a{\mathrm{and}}\,u_q\,{\mathrm{are}}\,{\mathrm{in}}\,{\mathrm{same}}\,{\mathrm{QA}}\,{\mathrm{set,}}} \hfill \\ 0 \hfill & {{\mathrm{otherwise,}}} \hfill \end{array}} \right.$$ (4)

where p(u a |u q ) is the context prior specifying the probability of responding to the question u q with the answer u a and N A,q is the number of answers in the same question-and-answer (QA) set as u q (the number of valid answers to u q ; Table 1). These context priors assume that the valid answers to any question are equally likely.

During testing, the context integration model receives predicted utterance log likelihoods from both the question and answer classification models. Each time the model receives predicted question log likelihoods (denoted \(\ell _{U_Q}^ \ast\), containing the log likelihoods \(\ell _{u_q}^ \ast\) for each question utterance u q ), it computes prior log probabilities for the answer utterances from these question likelihoods and the pre-defined context priors using the following formula:

$${\mathrm{log}}p_{_Q}\left( {u_a} \right) = {\mathrm{log}}\left\{ {\mathop {\sum}\limits_{u_q \in U_Q} \,{\mathrm{exp}}\left[ {{\mathrm{log}}p(u_a|u_q) + \ell _{u_q}^ \ast } \right]} \right\} + c,$$ (5)

where p Q (u a ) is defined as the prior probability of the answer utterance u a computed using \(\ell _{U_Q}^ \ast\), U Q is the set of all question utterances, and c is a real-valued constant. Each time the model receives predicted answer log likelihoods (the \(\ell _{u_a}^ \ast\) values for each answer utterance u a ), it computes posterior log probabilities for the answer utterances from these answer likelihoods and the answer priors. The unnormalized log posterior probabilities \(\phi _{u_a}\) were computed for each answer utterance u a using the following formula:

$$\phi _{u_a}: = m\,{\mathrm{log}}\,p_{_Q}(u_a) + \ell _{u_a}^ \ast + d,$$ (6)

where m is the context prior scaling factor and d is a real-valued constant. Here, m is a hyperparameter that controls the weight of the answer priors relative to the answer likelihoods (a larger m causes the context to have a larger impact on the answer posteriors). We then normalize these answer log posterior values using the following formula:

$$\phi _{u_a}^ \ast : = \phi _{u_a} - {\mathrm{log}}\left[ {\mathop {\sum}\limits_{j \in U_A} \,{\mathrm{exp}}(\phi _j)} \right],$$ (7)

where \(\phi _{u_a}^ \ast\) is the normalized log posterior probability of u a and U A is the set of all answer utterances. The constants c and d do not need to be computed in practice because they are canceled out during the normalization step in Eq. 7. These \(\phi _{u_a}^ \ast\) values satisfy the following equality:

$$\mathop {\sum}\limits_{j \in U_A} \,{\mathrm{exp}}(\phi _j^ \ast ) = 1.$$ (8)

Finally, the predicted utterance identities are computed as:

$$\hat u_q = \mathop {{{\mathrm{argmax}}}}\limits_{u_q \in U_Q} \ell _{u_q}^ \ast ,$$ (9)

$$\hat u_{a - } = \mathop {{{\mathrm{argmax}}}}\limits_{u_a \in U_A} \ell _{u_a}^ \ast ,$$ (10)

$$\hat u_{a + } = \mathop {{{\mathrm{argmax}}}}\limits_{u_a \in U_A} \phi _{u_a}^ \ast ,$$ (11)

where \(\hat u_q\), \(\hat u_{a - }\), and \(\hat u_{a + }\) are the system’s predictions for questions, answers without context, and answers with context, respectively. The \(\hat u_q\) and \(\hat u_{a + }\) predictions are the system outputs during decoding, and the \(\hat u_{a - }\) predictions are used in offline analyses. For a more thorough mathematical description of the context integration procedure, see Supplementary Note 5.

Although an answer followed each question during testing, it was possible for the speech detector to fail to detect question or answer events (or to detect false positives). Because of this, we did not force the context integration model to always expect answer likelihoods after receiving question likelihoods or vice versa. Instead, during each test block, we maintained a set of values for the answer priors that were only updated when a new set of question likelihoods was received. When a new set of answer likelihoods was received, the current answer prior values were used to compute the posteriors. If answer likelihoods were received before receiving any question likelihoods, answer posteriors and answer with context predictions would not be computed from those likelihoods (although this did not actually occur in any test blocks).

Hyperparameter optimization

Each type of model (speech detection, utterance classification, and context integration) had one or more parameters that could not be learned directly from the training data. Examples of physiologically relevant hyperparameters include a temporal offset shift between perceived and produced phones and the neural data (which could account for neural response delays or speech production planning), the duration of the spatiotemporal neural feature vectors used during model training and testing, and a p-value threshold used when deciding which electrodes should be considered relevant and included in the analyses.

Instead of manually selecting values for these hyperparameters, we performed cross-validated hyperparameter optimization using the hyperopt Python package32,33. This package uses a Bayesian-based optimization algorithm called the Tree-structured Parzen Estimator to explore a hyperparameter space across multiple epochs. Briefly, this optimization approach samples hyperparameter values from pre-defined prior distributions, uses a loss function to evaluate the current hyperparameters, and then repeats these steps using knowledge gained from the evaluations it has already performed. After a desired number of epochs, the hyperparameter set associated with the minimal loss value across all epochs is chosen as the optimal hyperparameter set.

We performed hyperparameter optimization for each participant, model type, and test block. We used a leave-one-block-out cross-validation scheme for each test block. Specifically, during an optimization run for any given test block, the hyperparameters were evaluated on a held-out validation set comprising all of the other test blocks available for the current participant. We used 250 epochs for each optimization run. All of the hyperparameters that were set via optimization are described in Supplementary Table 5, and the full optimization procedure is described in Supplementary Method 2.

Evaluation methods and statistical analyses

Primary evaluation metrics: We used the following metrics during the primary evaluations of our system: decoding accuracy rate, classification accuracy, cross entropy, speech detection score, and electrode discriminative power (Fig. 2). The decoding accuracy rate metric represented the full performance of the system (the combined performance of the speech detection, utterance classification, and context integration models). When computing the accuracy rates for each prediction type (questions, answers without context, and answers with context) and participant, we obtained overall actual and predicted sequences by concatenating the actual and predicted utterances across all of the test blocks. We then calculated an utterance error rate using these sequences, which is an analog of the commonly used word error rate metric and is a measure of the edit (Levenshtein) distance between the actual and decoded utterance label sequences in a given test block. The accuracy rate was then computed as 1 minus the utterance error rate (or 0 if this difference would be negative, although this was never observed in our experiments).

Classification accuracy and cross entropy metrics were computed for each participant by using only the utterance classification and context integration models (and not the speech detection models). In this approach, we performed decoding on the test blocks using the actual speech event times and the previously trained utterance classification models. Because the HMMs used to represent the utterances were designed to start and end the Viterbi decoding process during silence, we padded 300 ms of silence time points before and after the utterance in each speech-related time window of neural data passed to the classifiers. We then performed context integration model optimization with these new classification results and applied the optimized context integration models to the results. After this step, we pooled all of the pairs of actual and predicted utterance labels for each prediction type across all of the test blocks for each participant.

Classification accuracy was defined as the proportion of trials in which the utterance classification model correctly predicted the identity of the utterance. To obtain the mean and variance of the classification accuracy, we used classification accuracies computed on bootstrapped resamples of the trials (one million resamples). To measure information transfer rate (which we did not consider a primary evaluation metric in this work), we used these classification accuracy values, speech durations from the test blocks, and the number of possible answer responses (see Supplementary Note 1 for details).

The cross entropy metric quantified the amount of predictive information provided by the utterance classification and context integration models during testing and hyperparameter optimization. We computed cross entropies using the surprisal values for each classification trial, prediction type, and participant. For a given trial and prediction type, the relevant surprisal value for that trial is equal to the negative of the predicted log probability associated with the actual utterance label. The cross entropy is equal to the mean of these surprisal values. To obtain the mean and variance of the cross entropy, we used cross entropies computed on bootstrapped resamples of the trials (one million resamples). Lower cross entropy indicates better performance.

To evaluate and optimize the speech detector, we created a score metric that computes a weighted combination of a frame-by-frame accuracy a frame and a general event detection accuracy a event . The frame-by-frame accuracy measures the performance of the speech detector using the detected presence or absence of a speech event at each time point. This measure is analogous to sensitivity and specificity analyses commonly used for binary prediction. Phonetic transcriptions were used to determine the actual times of the speech events and compute true positives, true negatives, false positives, and false negatives. When using these transcribed speech times, we decremented each speech onset time and incremented each speech offset time by 300 ms to label some silence time points before and after each utterance as positive frames. We performed this modification to encourage the optimizer to select hyperparameters that would include silence before and after each utterance in the detected neural feature time windows, which is useful during utterance classification. We calculated the frame-by-frame accuracy measure using the following formula:

$$a_{{\mathrm{frame}}}: = \frac{{w_{\mathrm{P}}N_{{\mathrm{TP}}} + (1 - w_{\mathrm{P}})N_{{\mathrm{TN}}}}}{{w_{\mathrm{P}}N_{\mathrm{P}} + (1 - w_{\mathrm{P}})N_{\mathrm{N}}}},$$ (12)

where w P is the positive weight fraction, N TP is the number of true positives detected, N TN is the number of true negatives detected, N P is the total number of positive frames in the test data, and N N is the total number of negative frames in the test data. The positive weight fraction was included to allow control over how important true positive detection was relative to true negative detection. In practice, we used w P = 0.75, meaning that correctly detecting positive frames was three times as important as correctly detecting negative frames. We used this value to encourage the optimizer to select hyperparameters that would prefer to make more false positive errors than false negative errors, since the performance of the utterance classifiers should diminish more if a few speech-relevant time points were excluded from the detected time window than if a few extra silence time points were included. The general event detection accuracy, which measures how well the speech events were detected without considering which time points were associated with each event, was computed using the following formula:

$$a_{{\mathrm{event}}}: = 1 - {\mathrm{min}}\left( {1,\frac{{\left| {N_{{\mathrm{DE}}} - N_{{\mathrm{AE}}}} \right|}}{{N_{{\mathrm{AE}}}}}} \right),$$ (13)

where N DE and N AE are the number of detected and actual speech events in the current test block, respectively. To compute the speech detection score s detection , these two measures were combined using the following formula:

$$s_{{\mathrm{detection}}} = w_{\mathrm{F}}a_{{\mathrm{frame}}} + (1 - w_{\mathrm{F}})a_{{\mathrm{event}}},$$ (14)

where w F is the frame-by-frame accuracy weight fraction, which allows control over how much impact the frame-by-frame accuracy measure has on the speech detection score relative to the general event detection accuracy. In practice, we let w F = 0.5 for an equal weighting between the two measures. For each participant and utterance type, the overall detection score was computed by taking the average of the detection scores for each test block.

To assess the importance of each electrode during phone and speech event likelihood modeling, we estimated the discriminative power of each electrode within the trained PCA-LDA models13. We arbitrarily selected a test block for each participant and obtained the trained and optimized utterance classification and speech detection models associated with that test block. For each of these models, we examined the learned parameters within the LDA model. For each feature in the LDA model (which is a principal component), we measured the between-class variance for that feature by computing the variance of the corresponding class means. We used the values along the diagonal of the shared covariance matrix as a measure of the within-class variance of each feature (because we did not force diagonal covariance matrices in the LDA models, this is only an approximation of the true within-class variances). Similar to a coefficient of determination (R2) calculation, we then estimated the discriminative power for each LDA feature as a ratio of the between-class variance to the total variance using the following formula:

$$\eta _i = \frac{{\sigma _{{\mathrm{b}},i}^2}}{{\sigma _{{\mathrm{w}},i}^2 + \sigma _{{\mathrm{b}},i}^2}},$$ (15)

where η i , \(\sigma _{{\mathrm{b}},i}^2\), and \(\sigma _{{\mathrm{w}},i}^2\) are the estimated discriminative power, between-class variance, and within-class variance, respectively, for the ith LDA feature. To obtain the discriminative powers for each original feature in the spatiotemporal neural feature vectors (the inputs to the PCA model), the absolute values of the PCA component weights were used to project the LDA feature discriminative powers back into the original feature space. Finally, the discriminative power for each electrode was set equal to the maximum discriminative power value observed among the original features associated with that electrode (that is, the maximum function was used to aggregate the discriminative powers across time for each electrode within the spatiotemporal feature vectors). The resulting discriminative power values were used to quantify the relative contributions of each electrode during phone or speech event discrimination.

Auxiliary decoding analyses: As described in Section 2.3, we investigated the sensitivity of the decoding models to limited data availability and sub-optimal hyperparameter configurations (Fig. 3). Thorough descriptions of these analyses are provided in Supplementary Method 1. The analysis of how spatial resolution affected decoder performance is described in Supplementary Note 3.

As described in Section 2.4, we performed additional analyses on the Viterbi decoding and phone likelihood modeling approaches used by the answer classifiers (Fig. 4). Thorough descriptions of these analyses are provided in Supplementary Method 3.

When performing answer classification with hard or true priors instead of soft priors, the question likelihoods in each trial were modified prior to context integration. For hard priors, the likelihood of the most likely question was set to 1 and the likelihoods of the other questions were set to 0. For true priors, the likelihood of the question that was actually presented to the participant was set to 1 and the likelihoods of the other questions were set to 0. After this modification, the context integration procedure was performed normally to obtain the answer predictions.

Statistical testing: The statistical tests used in this work are all described in Section 2. For all tests, we considered p-values less than 0.05 as significant. We used a 4-way Holm-Bonferroni correction30 for the chance comparisons with the three prediction types (questions, answers without context, and answers with context) and the answer with vs. without context comparison because the neural data used during these analyses were not independent of each other. Thorough descriptions of all of the statistical tests are provided in Supplementary Method 4.

Real-time decoding

In our previous work, we introduced the rtNSR software package14. Written in Python57, this package is flexible and efficient due to its modular structure and utilization of software pipelining58. With further development, we used rtNSR here to present the audio and visual stimuli, process the neural signals, and perform speech decoding in real-time (Supplementary Fig. 9). We also used it for offline model training and data analysis.

Due to clinical time constraints, we were not able to perform hyperparameter optimization prior to real-time testing with the participants. All of the results reported in this work were computed using offline simulations of the data with the rtNSR system, a process that we described in our previous work. During the offline simulations, the real-time process that reads samples from the real-time interface card is replaced with a process that simulates input samples from a dataset on disk. The remainder of the decoding pipeline remains the same. During online testing at the patient’s bedside, the system performed decoding without experiencing systematic/runtime errors and with negligible latency using hyperparameter values chosen via trial and error on datasets that were previously collected. Therefore, we can reasonably expect that the decoding results we observe in our offline simulations would have been identical to those in the online setting with the patients, since the only relevant differences between the online and offline tests were the specific values of the hyperparameters.

Reporting summary

Further information on research design is available in the Nature Research Reporting Summary linked to this article.