Dataset

The study was a single-center, prospective observational study approved by the Partners Institutional Review Board (IRB). The IRB waived the requirement for written signed consent in this study. The EEG signals were collected from 195 distinct ICU patients. The inclusion criteria were: (1) age ≥18 years; (2) on mechanical ventilation; and (3) have at least one RASS or CAM-ICU assessment during EEG recording. The exclusion criteria were: (1) any known focal neurologic deficits or dementia; and (2) poor EEG signal quality by visual inspection (ten patients excluded). The final dataset contains 174 patients. The average ICU stay was 12–13 days. The most commonly used sedative was propofol. Patient characteristics are summarized in Table 2.

Table 2 Patient characteristics Full size table

RASS and CAM-ICU

To measure the level of consciousness, we used the Richmond agitation-sedation scale (RASS)7 as the target to train the model. RASS was assessed by ICU nurses and clinical research technicians approximately every 2 h. RASS has ten levels from −5 to +4 as shown in Supplementary Table 1. The range from −5 to 0 (inclusive) describes different levels of sedation, where −5 and −4 indicate coma (unarousable, no response to verbal or noxious stimulation) and 0 indicates an alert and calm state. The range from +1 to +4 (inclusive) describes different levels of agitation which are associated with hyperactive delirium. In this study we limited RASS assessments to those of normal or decreased levels of arousal only, i.e. −5 to 0, since (1) there was no positive RASS during CAM-ICU assessments with EEG signal available in the dataset (Supplementary Fig. 1) and (2) being combative and agitated can be reliably detected by ICU staff.

To measure delirium, we used the CAM-ICU as the target to train the model. The CAM-ICU is a screening protocol that is performed about every 24 h8 (Supplementary Table 2). While unresponsive patients (RASS = −4 or −5) are typically not further assessed in formal use of the CAM-ICU, we treated these patients as CAM-ICU positive for model training purposes, given the clearly abnormal mental status.

EEG preprocessing

The EEG signals were recorded using Sedline brain function monitors (Masimo Corporation, Irvine, CA, USA), with 250 Hz sampling rate and 4 frontal electrodes. We re-referenced the signals to 2 bipolar channels: Fp1-F7 and Fp2-F8. The signals were first notch filtered at 60 Hz, then bandpass filtered between 0.5 Hz to 20 Hz, and finally downsampled to 62.5 Hz.

We took 1 h EEG segments 30 min before and 30 min after each RASS or CAM-ICU assessment. This is because the assessment times recorded by the ICU nurses may be imprecise, since they are recorded after performing assessments. We therefore included the longer EEG segment to ensure it included the actual assessment time.

EEG artifact were defined based on the presence of any of the following in any EEG channel: (1) maximum amplitude higher than 1000 µV; (2) standard deviation less than 0.2 µV; (3) overly fast changes of more than 900µV within 0.1 s; or (4) spuriously staircase-like spectrum, when the maximum value obtained by convolution with a predefined staircase-like kernel exceeds an empirical threshold of 10, indicating the presence of nonphysiologic single-frequency artifacts from ICU machines (e.g. cooling blankets or pumps).

Deep learning model

The overall deep learning model consisted of convolutional neural network (CNN) followed by long-short term memory (LSTM), as shown in Supplementary Fig. 9. CNN extracts useful information from each 4 s in the EEG waveform and LSTM provides the temporal context. The CNN followed the architecture in Hannun et al.28 It contains 8 blocks mainly consisting of two convolutional layers (conv) and a skip layer maxpooling connection. The output from CNN is then fed to a two-layer LSM, followed by an output layer, which is ordinal regression for RASS and binary classification for CAM-ICU. The ordinal regression learns a continuous “z-score” and the thresholds. If needed, we can apply the learned thresholds to discretize z-score into RASS levels. The binary classification outputs the probability of being CAM-ICU positive (delirium). The detailed description of the model architecture and coding details can be found in Supplementary Methods.

Model training

To avoid the model being overfit to the dataset, we randomly split patients into ten groups (folds). We took each fold as a testing set, and the other ninefolds as the training set. For the training set we further randomly selected 10% of assessments as the validation set, and the remaining 90% of assessments as the training set. The model with the minimum loss on the validation set was used, and then results were calculated for the held-out testing set. The above procedure was repeated for each fold.

To prepare data for CNN, the 1 h EEG signal around each assessment was segmented into 4 s windows with 2 s overlap (Supplementary Fig. 2a). We removed 4s-segments identified as artifact. 10% of segments were removed due to artifacts. The input to the CNN has size N x 2 × 250, where N is the number of 4s-segments, 2 is the number of channels and 250 is the number of time points in 4 s (62.5 Hz). The choice of 4 s window is inspired by domain knowledge – in clinical neurology practice, windows of 10 s are used, but 4 s is enough to discern features usually used to describe the EEG, e.g. the presence of delta or theta slowing, epileptiform abnormalities, and EEG suppression.

Data preparation for LSTM is different. There are 900 4s-segments in each 1 h EEG signal. Training an LSTM model on such a long sequence is difficult. Therefore we trained the two layers of LSTM separately while fixing the parameters in the already trained CNN. The first LSTM layer was trained using 9.5 min sequences with step size 1 min (Supplementary Fig. 2b). The input had size N x 142 × 2 × 250, where 142 is the number of 4s-segments in a 9.5 min sequence. To train the second LSTM layer, we fixed the first LSTM layer. 1 h sequences were used with size N x 900 × 2 × 250, where 900 is the number of 4s-segments in a 1 h sequence (Supplementary Fig. 2c). Sequences with more than 50% of 4s-segments being artifact were removed, otherwise the artifacts in 4s-segments were kept to ensure continuity of the sequence. 9% of the sequences were removed.

For the CAM-ICU, since the number of samples was less than that of RASS, we copied the first M layers of the RASS CNN model to the CAM-ICU CNN model and fixed them to avoid overfitting; only the layers after the first M layers were trained. The performance of different M’s is shown in Supplementary Fig. 3. Here we took M = 5 since it achieved the best validation performance.

In both tasks, to address the imbalance of RASS levels or CAM-ICU scores in the dataset, we computed sample weights for each level inversely proportional to the number of examples in this level from the training set. The models were trained with a minibatch size of 32 and the RMSprop optimizer with learning rate 0.001.

Model evaluation

The final performance was reported using the testing patients pooled from all folds. For tracking RASS, the predicted z-score was averaged across all 4s-segments in each 1 h sequence, and then the thresholds learned by the ordinal regression layer were used to discretize the averaged z-score to produce the predicted RASS level. We evaluated the RASS tracking performance using three metrics: (1) balanced mean absolute error (MAE), i.e. the average absolute difference between true and predicted RASS levels, weighted by class weights inverse proportional to number of samples in that class; (2) balanced accuracy when allowing up to one level difference, weighted by class weights; and (3) binary classification performance, measured by area under the receiver operator curve (AUC), for discriminating RASS levels −5 or −4 (“coma”) from −1 or 0 (“awake”), while discarding other levels. For tracking CAM-ICU, the predicted probability was averaged across all 4s-segments in each 1 h sequence to get the probability of being delirious.

The accuracy per 4 s without averaging (CNN only) is shown in Supplementary Fig. 7. These accuracies are worse than the averaged versions. The 4 s window is best thought as a step for local evaluation of the signal, and these local evaluations are aggregated to compute the probability of RASS/delirium at the present time, based on the prior EEG. Our model still reports an updated prediction every 4 s (this is the step size), although the prediction for the present time is based on the past 1 h. By contrast, in the ICUs in our institution, RASS is manually assessed every 2 h, and delirium is formally assessed only one time per day, thus the proposed method is an improvement.

Technician–nurse agreement

Since RASS assessments were available from both ICU nurses and clinical research technicians, we were able to measure the technician–nurse agreement, as follows. For each assessment done by each research staff member, we found the closest nurse assessment for the same patient. We excluded assessment pairs more than 4 h apart.

Baseline methods to be compared

To compare with other deep learning candidates, we built three other models (1) using EEG waveforms as input and CNN only; (2) using EEG spectrograms as input and LSTM only; and (3) using EEG band powers as input and LSTM only. The CNN and LSTM had the same structure as in Supplementary Fig. 9. The EEG band powers included delta (0–4 Hz), theta (4–8 Hz), and alpha (8–12 Hz), as well as the relative band power normalized by total power (0–12 Hz).

To compare with non-deep learning methods, we extracted the above band power from each 4s-segment, which were then averaged across 1 h time. We also extracted the BSR, i.e., the proportion of time within 1 h having signal envelope less than 5 µV. After generating these features, we trained ordinal regression for RASS; and logistic regression, support vector machine, and random forest for CAM-ICU.

Statistical tests

To compare the performance among multiple algorithms, we used Kruskal–Wallis one-way analysis of variance (KW-ANOVA), which is a nonparametric version of ANOVA. The null hypothesis is that the medians of all groups are equal. We used Dunn’s test (two-sided) as the post hoc test together with Bonferroni multiple comparisons correction to decide which pairs had significantly different medians. The confidence intervals mentioned below are all 95% confidence interval obtained by bootstrapping 1000 times.

Delays in tracking level of consciousness

For each patient we artificially concatenated two segments of 9.5 min EEG signals with different RASS levels, denoted as RASS1 and RASS2, where the absolute difference between RASS1 and RASS2 was more than one level. The delay is defined as the time from concatenation point to the first time the prediction reaches RASS2 ± 1.

Reporting Summary

Further information on research design is available in the Nature Research Reporting Summary linked to this article.