Anonymized electronic medical records are an increasingly popular source of research data. However, these datasets often lack race and ethnicity information. This creates problems for researchers modeling human disease, as race and ethnicity are powerful confounders for many health exposures and treatment outcomes; race and ethnicity are closely linked to population-specific genetic variation. We showed that deep neural networks generate more accurate estimates for missing racial and ethnic information than competing methods (e.g., logistic regression, random forest, support vector machines, and gradient-boosted decision trees). RIDDLE yielded significantly better classification performance across all metrics that were considered: accuracy, cross-entropy loss (error), precision, recall, and area under the curve for receiver operating characteristic plots (all p < 10 −9 ). We made specific efforts to interpret the trained neural network models to identify, quantify, and visualize medical features which are predictive of race and ethnicity. We used these characterizations of informative features to perform a systematic comparison of differential disease patterns by race and ethnicity. The fact that clinical histories are informative for imputing race and ethnicity could reflect (1) a skewed distribution of blue- and white-collar professions across racial and ethnic groups, (2) uneven accessibility and subjective importance of prophylactic health, (3) possible variation in lifestyle, such as dietary habits, and (4) differences in background genetic variation which predispose to diseases.

Race and ethnicity are typically unspecified in very large electronic medical claims datasets. Computationally estimating a patient’s missing race and ethnicity from their medical records is important on both an academic and practical basis. Academically, discriminative medical events tell us about racial and ethnic health disparities and divergent genetic predispositions. Practically, imputed race and ethnicity information can substantially improve genetic and epidemiological analyses with these large datasets.

Funding: The study was supported by funds from the Defense Advanced Projects Agency, contract W911NF1410333 to AR ( https://www.darpa.mil/program/big-mechanism ), the National Heart Lung and Blood Institute, award R01HL122712 to AR ( https://www.nhlbi.nih.gov/ ), the National Institute of Mental Health, award P50 MH094267 to AR ( https://grants.nih.gov/grants/guide/pa-files/PAR-14-120.html ), by the King Abdullah University of Science and Technology (KAUST) Office of Sponsored Research (OSR), awards FCC/1/1976-04, URF/1/3007-01, URF/1/ 3450-01 and URF/1/3454-01to XG, and a gift from Liz and Kent Dauten to AR. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Data Availability: The data comprise millions of de-identified patient clinical records that cannot be deposited publicly and cannot be shared without special agreement with the Columbia University and the University of Chicago. Data are available from third party: to access the University of Chicago data, please visit the Center for Research Informatics, http://cri.uchicago.edu ; at the Columbia University, data can be accessed through the Electronic Medical Records and Genomics (eMERGE) network, http://emerge.cumc.columbia.edu .

Copyright: © 2018 Kim et al. This is an open access article distributed under the terms of the Creative Commons Attribution License , which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

In addition to investigating the novel utility of deep learning for race and ethnicity imputation, we used recent methods in interpreting neural network models [ 10 ] to perform a systematic evaluation of racial and ethnic patterns for approximately 15,000 different medical events. We believe that this type of large-scale evaluation of disease patterns and maladies by race and ethnicity has not been done heretofore.

RIDDLE uses a multi-layer perceptron (MLP) network containing two hidden layers of either Rectified Linear Units (ReLU) or Parametric Rectified Linear Unit (PReLU) nodes. The input to the MLP is the set of binary encoded features comprising age, gender, and International Disease Classification, version 9 (ICD9) codes. The output is the set of probability estimates for each of the four race and ethnicity classes.

We introduce a framework for using deep learning to estimate missing race and ethnicity information in EMR datasets: RIDDLE or R ace and ethnicity I mputation from D isease history with D eep LE arning. RIDDLE uses a relatively simple multilayer perceptron (MLP), a type of neural network architecture that is a directed acyclic graph (see Fig 1 ).

Deep learning involves the approximation of some utility function (e.g., classification of an image) as a neural network. A neural network is a directed graph of functions which are referred to as units, neurons or nodes. This network is organized into several layers; each layer corresponds to a different representation of the input data. As the input data is transformed and propagated through this network, the data at each layer corresponds to a new representation of the sample [ 9 ]. For our imputation task, the aim was to learn the representation of an individual as a mixture of race and ethnicity classes where each class is assigned a probability. This representation is encoded in the final output layer of the neural network. The output of a neural network functions as a prediction of the distribution of race and ethnicity classes given a set of input features.

Traditionally, logistic regression classifiers have been used to impute categorical variables such as race and ethnicity [ 8 ]. However, there has been recent interest in the use of deep learning for solving similar supervised learning tasks. Deep learning is particularly exciting as it offers the ability to automatically learn complex representations of high-dimensional data. These representations can be used to solve learning tasks such as regression or classification [ 9 ].

Bayesian approaches to race and ethnicity imputation using census data have been proposed [ 6 ] and have been used for race and ethnicity imputation in EMR datasets [ 7 ]. However, these approaches require sensitive geolocation and surname data from patients. Geolocation and surname data can be missing in anonymized EMR datasets (as in the datasets used here), limiting the utility of approaches which use this information.

The task of race and ethnicity imputation can be serialized as a supervised learning problem. Typically, the goal of imputation is to estimate a posterior probability distribution over plausible values for a missing variable. This distribution of plausible values can be used to generate a single imputed dataset (e.g., by choosing plausible values with highest probability), or to generate multiple imputed datasets as in multiple imputation [ 4 ]. In our setting, the goal was to impute the distribution of mutually-exclusive race and ethnicity classes given a set of clinical features. Features comprised age, gender, and codes from the International Disease Classification, version 9 (ICD9, [ 5 ]); ICD9 codes describe medical conditions, medical procedures, family information, and some treatment outcomes.

In addition, race and ethnicity information can be useful for producing and investigating hypotheses in epidemiology. For example, variation in disease risk across racial and ethnic groups that cannot be fully explained by allele frequency information may provide insights into the possible environmental modifiers of genes [ 3 ].

However, these datasets are often anonymized and lack race and ethnicity information (e.g., insurance claims datasets). Race and ethnicity information may also be missing for specific individuals within datasets. This is problematic in research settings as race and ethnicity can be powerful confounders for a variety of effects. Race and ethnicity are strong correlates of socioeconomic status, a predictor of access to and quality of education and healthcare. These factors are differentially associated with disease incidence and trajectories. As a result of this correlation, race and ethnicity may be associated with variation in medical histories. As an example, it has been reported that referrals for cardiac catheterization are rarer among African American patients than in White patients [ 2 ]. Furthermore, researchers have reported differences in genetic variation which influence disease across racial and ethnic groups [ 3 ]. Due to the association between race, ethnicity and medical histories, we hypothesize that clinical features in EMRs can be used to impute missing race and ethnicity information.

Electronic medical records (EMRs) are an increasingly popular source of biomedical research data [ 1 ]. EMRs are digital records of patient medical histories, describing the occurrence of specific diseases and medical events such as the observation of heart disease or dietary counseling. EMRs can also contain demographic information such as gender or age.

Results

We aimed to assess RIDDLE’s imputation performance in a multiclass classification setting. We used EMR datasets from Chicago and New York City, collectively describing over 1.5 million unique patients. There were approximately 15,000 unique input features consisting of basic demographic information (gender, age) and observations of clinical events (codified as ICD9 codes). The target class was race and ethnicity; possible values were White, Black, Other or Hispanic (see Table 1). Although race and ethnicity can be described as a mixture, our training datasets labeled race and ethnicity as one of four mutually exclusive classes. For the testing set, we treated the target race and ethnicity class as unknown, and compared the predicted class against the true class. The large dimensionality of features, high number of samples, and heterogeneity of the source populations present a unique and challenging classification problem.

In our experiments, RIDDLE yielded an average accuracy of 0.668, and cross-entropy loss of 0.857 on test data, significantly outperforming logistic regression, random forest classifiers, and gradient-boosted decision tree (GBDT) classifiers across all classification metrics (p < 10−9; see Table 2).

PPT PowerPoint slide

PowerPoint slide PNG larger image

larger image TIFF original image Download: Table 2. Evaluation of RIDDLE and baseline classification methods. All values are averaged over ten k-fold cross-validation experiments. In addition, the precision, recall and ROC scores are averaged across classes, weighted by the number of samples in each class. Support vector machines (SVMs) could not be evaluated on the full dataset as individual trials required more than 36 hours of computation. For runtime comparisons a standard computing configuration was used: 16 Intel Sandybridge cores at 2.6 GHz and 16GB RAM; graphics processing units were not utilized. https://doi.org/10.1371/journal.pcbi.1006106.t002

Support vector machines (SVMs) with various kernels were also evaluated. However, SVMs could not be feasibly used with the full dataset as individual trials took longer than 36 hours each (36 hours runtime was the allowed maximum on the system used in our analysis). Additional experiments involving a smaller subset of the full dataset (165K samples) were performed; in such experiments, SVMs could be practically utilized and RIDDLE significantly outperformed the baseline methods across all classification metrics (p < 10−2; see Table E in S1 Supplement).

While the multiclass learning problem appeared relatively hard, RIDDLE achieved class-specific receiver operating characteristic’s (ROC) area under the curve (AUC) values above 0.8 (see Fig 2), and a micro-average (all cases considered as binary) AUC of 0.874—significantly higher than that of logistic regression (mean = 0.854, p = 6.67 × 10−11), random forest (mean = 0.844, p = 2.05 × 10−10) and GBDT (mean = 0.846, p = 1.20 × 10−10) classifiers (see Table 2).

PPT PowerPoint slide

PowerPoint slide PNG larger image

larger image TIFF original image Download: Fig 2. Receiver operating characteristic (ROC) curves. ROC curves and their corresponding area under the curve (AUC) values were calculated for each of the four race and ethnicity classes. Micro-average (all cases considered as binary, e.g., Hispanic vs. non-Hispanic) and macro-average (average across classes) curves were also computed. Data and metrics for a representative experiment is shown. Across experiments, the mean micro-average AUC was 0.874, and the macro-average AUC was 0.833. https://doi.org/10.1371/journal.pcbi.1006106.g002

RIDDLE exhibited runtime performance comparable to that of other machine learning methods on a standard computing configuration without the use of a graphics processing unit or GPU (see Table 2).

As explained prior, SVMs were also evaluated but precise runtime measurements could not be obtained as the computational cost was too high. However, on a smaller subset (165K samples) of the full dataset where SVMs could be utilized, RIDDLE exhibited significantly faster runtime performance compared to all SVM methods (p < 10−10; see Table E in S1 Supplement).

Influence of missing data on classifier performance In order to replicate real-world applications where data other than race and ethnicity (e.g., features for specific samples) may be missing, we conducted additional experiments to simulate random missing data. A random subset of feature observations (ranging from 10% to 30% of all feature observations) was artificially masked completely at random. Feature observations at the sample level (e.g., a particular ICD9 code for a specific patient) were randomly deleted to simulate random missing data. The number of whole features was kept fixed—only individual observations were removed. Otherwise, the same classification training and evaluation scheme was used as before. Under simulation of random missing data, RIDDLE significantly outperformed logistic regression, random forest classifiers and GBDTs in classification metrics across all simulation experiments (p < 10−9 for 10% and 20% missing data simulation, p < 10−4 for 30% missing data simulation; see Table 3). PPT PowerPoint slide

PowerPoint slide PNG larger image

larger image TIFF original image Download: Table 3. Evaluation of RIDDLE and other methods under simulation of random missing data. All values are averaged over ten k-fold cross-validation experiments involving different proportions of random missing data (10%–30%). In addition, the precision, recall and ROC scores are averaged across classes, weighted by the number of samples in each class. SVMs could not be evaluated on the full dataset as individual trials required more than 36 hours of computation. https://doi.org/10.1371/journal.pcbi.1006106.t003