Sun-Gou is an Insight alumnus from the Health Data Science in Boston and is now a R&D Scientist at Seven Bridges. While at Insight, he partnered with Sage Bionetworks, a Seattle-based non-profit organization that accelerates open biomedical research, to identify molecular traits that are predictive of Alzheimer’s Disease.

The Unknowns of Alzheimer’s Disease

Alzheimer’s disease (AD) is the most common form of dementia which affects around 5 million people in the US. Although there are medications that can temporarily improve memory and thinking in half of the patients, there is no effective cure for this disease. Currently, efforts are underway to develop drugs that target various pathways involving the formation of excessive accumulation of proteins in the brain, which is known to be associated with AD but is also observed in healthy aging population. What’s the biological mechanisms that drive protein accumulation? If we can’t distinguish patients from healthy individuals pathologically, can we look into their molecular profiles (DNA, RNA abundance)? As an Insight Fellow, I found that molecular profiles are indeed good predictors of Alzheimer’s state.

A growing Alzheimer’s Disease population and healthcare cost. (Image from www.alz.org)

A look into the dataset (ROS/MAP cohort)

Religious Orders Study (ROS): ~1150 individuals from religious communities (nuns, priests, etc.) with restricted life experiences and socioeconomic status and high educational attainment

Memory and Aging Project (MAP): ~1550 individuals from much wider range of life experiences, retirement community in Northeastern Illinois

Every individual in the ROS/MAP cohorts were subjected to annual clinical evaluation from their 50s and agreed to donate their brains after death. Longitudinal check-ups over the course of disease progression have been collected, and at the time of death, a summary clinical diagnosis was made by evaluating all available clinical data by expert neurologists. This summary diagnosis is then compared to an independent opinion using postmortem data by one or more neurologists and a neuropsychologist to agree on a final clinical diagnosis (the COGDX score — Clinical Consensus Diagnosis). The COGDX score is regarded as being the most accurate diagnosis of an individual’s mental state compared to the MMSE score (a cognitive questionnaire), the BRAAK stage (representing the amount of tangles), or the CERAD score (the amount of plaques). The diagnosis of Alzheimer’s or cognitive impairment is often times complicated, and having this COGDX score available for most of the samples can reduce the heterogeneity within analysis groups.

1. No cognitive impairment, 2. Mild cognitive impairment (MCI), 3. MCI and another cause of CI, 4. Alzheimer’s disease, 5. AZ and another cause of CI, 6. Other dementia

What got me most interested about this dataset was not only the high quality of clinical diagnosis, but the availability of diverse clinical information along with multi-dimensional molecular assays for each patient. Molecular assays that measured DNA variation, as well as transcriptional and epigenetic profiles within these subjects, are all publicly available on Synapse (a data repository/software platform hosted by Sage Bionetworks) when you sign a data usage agreement. A large dataset with multiple layers of molecular features assayed within an individual is extremely rare, especially in a tissue type such as the brain which is very difficult to obtain. In addition, the possibility that this information can be used along with longitudinal clinical examination data gathered from the 1990s, is surely a dataset that would get any human geneticist super excited.

The Goals

After looking through the dataset and talking to Dr. Lara Mangravite and Dr. Thanneer Perumal at Sage Bionetworks, I focused on three of the molecular features (see footnotes): DNA methylation (methylation array), mRNA expression (RNA sequencing), and micro RNA expression (micro RNA array). I wanted to understand which of these three features predicted Alzheimer’s with the greatest accuracy. Understanding which of these regulatory mechanisms are more involved in an Alzheimer’s brain can potentially enhance our molecular understanding of Alzheimer’s pathogenesis in a molecular level. If successful, this effort would be able to prioritize the involvement of different cellular regulatory mechanisms and help researchers focus on the specific molecular biology underlying Alzheimer’s.

Venn diagram showing the number of samples that were assayed using different technologies

Exploratory Data Analysis

The mRNA expression data from RNA sequencing consists of FPKM (Fragements Per Kilobase of transcript per Million mapped reads, a measure of RNA abundance) values of 55,889 unique Ensembl gene IDs (ENSG) across 649 individuals. As a first step, I used Principal Component Analysis (PCA) to reduce the data dimensionality, projecting the data to axes where variability is maximized.

PCA of all RNA sequencing data coloured by sequencing plate (1–8). Each circle represents a single sample

LDA of all RNA sequencing data

As shown by the figure ‘PCA of mRNA dataset’ , the largest variation within this dataset was a batch effect. The plate each sample was sequenced represented the greatest variation between samples. I also used an alternative dimensionality reduction method, linear discriminant analysis (LDA). Whereas PCA projects the data to orthogonal axes representing maximum variation in the dataset, LDA finds those axes (linear combination of features) that cause maximal separation of the given classes. Although the LDA did show some clustering into controls, MCI (a less severe form of AZ) and AZ (Alzheimer’s), the clustering was not clear.

Feature Selection

In order to improve classification/clustering performance and reduce the batch effect, I performed a feature selection step by identifying differentially expressed genes (DEGs). Although DESeq2 or edgeR packages are more suited for RNA sequencing data, I was not able to obtain non-normalized read counts, and had to select a suboptimal strategy of using a log2-normalized FPKM and the generalized linear model (GLM) implemented in the limma R package.

Number of DEGs for each pairwise comparison

The GLM was applied to all pairwise comparisons between the three groups with known clinical and technical covariates such as postmortem interval (PMI, hours between death and sampling), ApoE genotype (a known risk factor of Alzheimer’s), RNA integrity number (RIN), and sequencing plate. The venn diagram on the left shows the number of DEGs between the pairwise comparisons. Interestingly, there were no DEGs between control subjects and subjects with mild cognitive impairment, whereas more than nine hundred DEGs were identified between controls and Alzheimer’s patients. This suggested that people with a mildly impaired cognitive state did not differ much from normal people, at least in terms of mRNA expression.

PCA of RNA sequencing data based on DEGs colored by sequencing plate (1–8)

LDA of RNA sequencing data based on DEGs

Now look at figures with PCA and LDA results based on the 982 DEGs. First, it clearly shows that the batch effect which used to be the largest variation within the data is no longer present. Second, the supervised clustering through LDA shows far clearer separation between controls, MCI and AZ. Reducing the number of features and noise based on DEGs had a marked effect on classification accuracy.

Similar approaches for feature selection were conducted on the micro RNA and the DNA methylation datasets. However, in case of the DNA methylation dataset, I used the 71 methylation marks that were identified and validated by de Jager et al. Nat Neurosci 2014. This study used the ROS/MAP DNA methylation data and a GLM to identify methylation marks significantly associated to the burden of neuritic amyloid plaques (NP) which is a key ­quantitative measure of Alzheimer’s disease neuropathology. This allowed for a greater power to detect association, and the methylation marks identified during this discovery stage were validated on an independent cohort of 117 subjects.

Predicting Alzheimer’s from Molecular Features

After feature selection, I assessed the performance of four different machine learning approaches in predicting Alzheimer’s state: logistic regression, LDA, elastic net, and random forest. I optimized model hyper-parameters (L1 ratio, number of trees, etc) using a grid search approach with 5-fold cross-validation and the best parameters were used to test the performance based on the 20% of samples that were left out at the beginning of each analysis.

Before conducting predictions on Alzheimer’s using molecular features, I wanted to make sure that I was able to predict Alzehimer’s based on known clinical features such as ApoE genotype, sex and age of death. As shown in the figure on the upperleft, the combination of these clinical features had a better than random prediction power based on the receiver operating characteristic (ROC) curve and the area under the ROC curve (AUC). On the other hand, the prediction power was close to random when technical variables (PMI, RIN and plate information) were used to predict Alzheimer’s state (figure on the lowerleft). These results showed that my pipeline was working as expected.

Finally, I set out to see how well the molecular features of interest performed in terms of predicting Alzheimer’s, starting again with mRNA expression.

The predictive performance of mRNA expression based on the ROC curve did show better performance than when clinical features were used for prediction. In addition, although LDA was successful in visually clustering the three groups, it had the worst performance in classification. Meanwhile, the logistic regression showed best performance.

Above is the prediction performance based on AUC in all sets of analyses. Here are some of the most interesting observations and my interpretations:

First, it is interesting to note that random forest has the worst performance than logistic regression. One reason for this observation could be that the sample size is too small to train a non-linear model. There are still more features than samples after feature selection, and although we know that a biological system functions through interactions between its components, the current sample size may not allow for a thorough investigation of interactions between all components.

Second, mRNA expression, micro RNA expression, and DNA methylation all provide greater prediction performance than clinical features (ApoE genotype, age of death and sex combined). This suggests that these regulatory mechanisms within the brain tissue may be important in Alzheimer’s.

Third, mRNA expression, micro RNA expression, and DNA methylation provide little extra information when combined. This suggests that these features are correlated and there is no information gain when combined. Although it is still possible that one of the three regulatory mechanism is predominantly active in some pathways that lead to Alzheimer’s, none of the three molecular features is globally more important in predicting Alzheimer’s state.

Final Remarks

Although I was not able to identify a single globally important regulatory mechanism that governs Alzheimer’s pathogenesis, I discovered that all three molecular features are more predictive of Alzheimer’s than standard clinical features. While biological data presents unique challenges for machine learning methods (e.g. extremely high dimensional feature space, large non independent variables), it is encouraging to see molecular traits help identify Alzheimer’s Disease and its progression.

Footnotes

mRNAs are molecules that encode proteins that are the functional modules within a cell

DNA methylation is currently understood to regulate how much mRNA is transcribed from a gene

microRNAs will turn off mRNAs after they have been transcribed, meaning no functional proteins are produced

All code used in this project are available on github.

Interested in transitioning to career in health data science? Find out more about the Insight Health Data Science Fellows Program in Boston and Silicon Valley, apply today, or sign up for program updates.

Already a data scientist or engineer? Find out more about our Advanced Workshops for Data Professionals. Register for two-day workshops in Apache Spark and Data Visualization, or sign up for workshop updates.