Data Preprocessing

First, patients under 16 years old and those who stayed in the ICU less than 7 hours are removed from the dataset.

Different cleaning steps were needed for each type of dataset used: diagnoses, treatments, past medical history, periodic vital signs, aperiodic vitals, and lab results. Each was loaded and pre-processed independently according to its needs.

In order to train the model, positive and negative patients need to be identified, and timepoints for those events need to be saved for later extraction of features. Because I chose to keep the model flexible, different labeling algorithms were needed for identifying mortality and other events in the dataset. In the case of mortality prediction, all patients who expired are identified, and the last event (treatment or diagnosis) for that patient is used as the timestamp. In the case of a diagnosis, only patients who received that diagnosis more than three hours after admission are selected, and the first instance of that diagnosis is used for the timestamp.

Feature Engineering

After selecting positive and negative patients, feature vectors are generated for all of those examples. I used three categories of data — categorical, numerical and sequential.

Categorical features included ethnicity, gender, the admitting unit, and more. These are one-hot encoded.

Numerical features represented the bulk of the feature engineering work, and included vital signs, lab results, and static values like admission weight. I created 66 features here, employing my medical education. For example, one feature is the ratio of blood urea nitrogen (BUN) to serum creatinine, which is used to indicate the location of an acute kidney injury. For each feature, the maximum, minimum or mean value is calculated over one of three windows, a 4-hour window just prior to the ‘present moment’ (i.e. the prediction timepoint), a window extending from admission to the present moment, or a baseline window lasting the first six hours of admission. Other features, like the BUN-to-creatinine ratio, the creatinine change from baseline, or the arterial-to-inspired oxygen ratio (the PF ratio) are calculated from these first features. Missing values are tolerated for these features, although the 10% of examples missing the greatest number of features are removed. The remaining missing values are replaced with the feature mean and are scaled.

Finally, the sequential features include the past medical history diagnoses (given in English), diagnoses given in the ICU up until the present moment (given in English and ICD-9 codes) and therapeutic and diagnostic interventions up until the present moment. After translating all of the past medical history diagnoses to the common language of ICD-9 codes, all of these events are arranged by timestamp — in the order in which they actually happened.

A Multimodal Deep Learning Model

A flexible, multimodal architecture for predicting diagnoses or survival from ICU data.

I chose to keep the diagnoses and treatments of each patient in sequential order so as to preserve the clinical patterns that I know to be important. For example, a patient with a history of malignancy who now has unilateral swelling of the leg is relatively likely to be diagnosed with a pulmonary embolism.

I needed a model architecture that would respect that sequential information, so I used a long short-term memory (LSTM) network with a bidirectional wrapper and 32 units. I also wanted the model to learn representations for each diagnosis and treatment, so I used an embedding layer with 32-element vectors, which the model learns as it trains.

With three 16-unit hidden layers, as determined by hyperparameter tuning, the RNN achieved an area under the ROC curve (AUROC) of 0.85.

Next, I trained a DNN on just the categorical and numerical data. Three hidden layers of 32 units each with 50% dropout between each layer worked best here, and after feature selection and engineering, the DNN achieved an AUROC of 0.87.

In order to combine the two models, I popped the last layer off of each model and concatenated the outputs. That layer serves as input for a 3-layer, 64-unit DNN with dropout. An additional auxiliary output from the categorical/numerical DNN ensures that part of the network learns a representation useful to the final task. So how did the composite model perform?