Datasets

We included EHR data from the University of California, San Francisco (UCSF) from 2012 to 2016, and the University of Chicago Medicine (UCM) from 2009 to 2016. We refer to each health system as Hospital A and Hospital B. All EHRs were de-identified, except that dates of service were maintained in the UCM dataset. Both datasets contained patient demographics, provider orders, diagnoses, procedures, medications, laboratory values, vital signs, and flowsheet data, which represent all other structured data elements (e.g., nursing flowsheets), from all inpatient and outpatient encounters. The UCM dataset additionally contained de-identified, free-text medical notes. Each dataset was kept in an encrypted, access-controlled, and audited sandbox.

Ethics review and institutional review boards approved the study with waiver of informed consent or exemption at each institution.

Data representation and processing

We developed a single data structure that could be used for all predictions, rather than requiring custom, hand-created datasets for every new prediction. This approach represents the entire EHR in temporal order: data are organized by patient and by time. To represent events in a patient’s timeline, we adopted the FHIR standard.75 FHIR defines the high-level representation of healthcare data in resources, but leaves values in each individual site’s idiosyncratic codings.28 Each event is derived from a FHIR resource and may contain multiple attributes; for example, a medication-order resource could contain the trade name, generic name, ingredients, and others. Data in each attribute were split into discrete values, which we refer to as tokens. For notes, the text was split into a sequence of tokens, one for each word. Numeric values were normalized, as detailed in the supplement. The entire sequence of time-ordered tokens, from the beginning of a patient’s record until the point of prediction, formed the patient’s personalized input to the model. This process is illustrated in Fig. 4, and further details of the FHIR representation and processing are provided in Supplementary Materials.

Fig. 4 Data from each health system were mapped to an appropriate FHIR (Fast Healthcare Interoperability Resources) resource and placed in temporal order. This conversion did not harmonize or standardize the data from each health system other than map them to the appropriate resource. The deep learning model could use all data available prior to the point when the prediction was made. Therefore, each prediction, regardless of the task, used the same data Full size image

Outcomes

We were interested in understanding whether deep learning could produce valid predictions across wide range of clinical problems and outcomes. We therefore selected outcomes from divergent domains, including an important clinical outcome (death), a standard measure of quality of care (readmissions), a measure of resource utilization (length of stay), and a measure of understanding of a patient’s problems (diagnoses).

Inpatient mortality

We predicted impending inpatient death, defined as a discharge disposition of “expired.”42,46,48,49

30-day unplanned readmission

We predicted unplanned 30-day readmission, defined as an admission within 30 days after discharge from an “index” hospitalization. A hospitalization was considered a “readmission” if its admission date was within 30 days after discharge of an eligible index hospitalization. A readmission could only be counted once. There is no standard definition of “unplanned”76 percentage, so we used a modified form of the Centers for Medicare and Medicaid Services definition,77 which we detail in the supplement. Billing diagnoses and procedures from the index hospitalization were not used for the prediction because they are typically generated after discharge. We included only readmissions to the same institution.

Long length of stay

We predicted a length of stay at least 7 days, which was approximately the 75th percentile of hospital stays for most services across the datasets. The length of stay was defined as the time between hospital admission and discharge.

Diagnoses

We predicted the entire set of primary and secondary ICD-9 billing diagnoses from a universe of 14,025 codes.

Prediction timing

This was a retrospective study. To predict inpatient mortality, we stepped forward through each patient’s time course, and made predictions every 12 h starting 24 h before admission until 24 h after admission. Since many clinical prediction models, such as APACHE,78 are rendered 24 h after admission, our primary outcome prediction for inpatient mortality was at that time-point. Unplanned readmission and the set of diagnosis codes were predicted at admission, 24 h after admission, and at discharge. The primary endpoints for those predictions were at discharge, when most readmission prediction scores are computed79 and when all information necessary to assign billing diagnoses is available. Long length of stay was predicted at admission and 24 h after admission. For every prediction we used all information available in the EHR up to the time at which the prediction was made.

Study cohort

We included all admissions for patients 18 years or older. We only included hospitalizations of 24 h or longer to ensure that predictions at various time points had identical cohorts.

To simulate the accuracy of a real-time prediction system, we included patients typically removed in studies of readmission, such as those discharged against medical advice, since these exclusion criteria would not be known when making predictions earlier in the hospitalization.

For predicting the ICD-9 diagnoses, we excluded encounters without any ICD-9 diagnosis (2–12% of encounters). These were generally encounters after October, 2015 when hospitals switched to ICD-10. We included such hospitalizations, however, for all other predictions.

Algorithm development and analysis

We used the same modeling algorithm on both hospitals’ datasets, but treated each hospital as a separate dataset and reported results separately.

Patient records vary significantly in length and density of data points (e.g., vital sign measurements in an intensive care unit vs outpatient clinic), so we formulated three deep learning neural network model architectures that take advantage of such data in different ways: one based on recurrent neural networks (long short-term memory (LSTM)),80 one on an attention-based TANN, and one on a neural network with boosted time-based decision stumps. Details of these architectures are explained in the supplement. We trained each architecture (three different ones) on each task (four tasks) and multiple time points (e.g., before admission, at admission, 24 h after admission and at discharge), but the results of each architecture were combined using ensembling.81

Comparison to previously published algorithms

We implemented models based on previously published algorithms to establish baseline performance on each dataset. For mortality, we used a logistic model with variables inspired by NEWS51 score but added additional variables to make it more accurate, including the most recent systolic blood pressure, heart rate, respiratory rate, temperature, and 24 common lab tests, like the white blood cell count, lactate, and creatinine. We call this the augmented Early Warning Score, or aEWS, score. For readmission, we used a logistic model with variables used by the HOSPITAL67 score, including the most recent sodium and hemoglobin level, hospital service, occurrence of CPT codes, number of prior hospitalizations, and length of the current hospitalization. We refer to this as the mHOSPITAL score. For long length of stay, we used a logistic model with variables similar to those used by Liu:44 the age, gender, hierarchical condition categories, admission source, hospital service, and the same 24 common lab tests used in the aEWS score. We refer to this as the mLiu score. Details for these and additional baseline models are in the supplement. We are not aware of any commonly used baseline model for all diagnosis codes so we compare against known literature.

Explanation of predictions

A common criticism of neural networks is that they offer little insight into the factors that influence the prediction.82 Therefore, we used attribution mechanisms to highlight, for each patient, the data elements that influenced their predictions.83

The LSTM and TANN models were trained with TensorFlow and the boosting model was implemented with C++ code. Statistical analyses and baseline models were done in Scikit-learn Python.84

Technical details of the model architecture, training, variables, baseline models, and attribution methods are provided in the supplement.

Model evaluation and statistical analysis

Patients were randomly split into development (80%), validation (10%), and test (10%) sets. Model accuracy is reported on the test set, and 1000 bootstrapped samples were used to calculate 95% confidence intervals. To prevent overfitting, the test set remained unused (and hidden) until final evaluation.

We assessed model discrimination by calculating AUROC and model calibration using comparisons of predicted and empirical probability curves.85 We did not use the Hosmer–Lemeshow test as it may be misleadingly significant with large sample sizes.86 To quantify the potential clinical impact of an alert with 80% sensitivity, we report the work-up to detection ratio, also known as the number needed to evaluate.87 For prediction of the a patient’s full set of diagnosis codes, which can range from 1 to 228 codes per hospitalization, we evaluated the accuracy for each class using macro-weighted-AUROC88 and micro-weighted F1 score89 to compare with the literature. The F1 score is the harmonic mean of positive-predictive-value and sensitivity; we used a single threshold picked on the validation set for all classes. We did not create confidence intervals for this task given the computational complexity of the number of possible diagnoses.

Data availability

The datasets analysed during the current study are not publicly available: due to reasonable privacy and security concerns, the underlying EHR data are not easily redistributable to researchers other than those engaged in the Institutional Review Board-approved research collaborations with the named medical centers.

Code availability

The FHIR format used in this work is available at https://github.com/google/fhir. The transformation of FHIR-formatted data to Tensorflow training examples and the models themselves depend on Google’s internal distributed computation platforms that cannot be reasonably shared. We have therefore emphasized detailed description of how our models were constructed and designed in our Methods section and Supplementary Materials.