The present results show that a combination of relatively few genetic and clinical variables can predict whether an individual with depression may reach remission with a specific antidepressant. The prediction models are parsimonious, based on only 17 and 20 variables, and the predictions are reproducible in non-overlapping validation datasets. These results demonstrate that a combination of genomic and clinical information in statistical learning framework has the potential to serve as a clinical decision support tool that may help select an antidepressant that an individual is more likely to benefit from.

The prediction was largely antidepressant-specific. The models predicted remission in validation sample treated with the same antidepressant, but not in samples treated with the other antidepressant. The drug-specificity makes the multivariate prediction more useful and applicable to clinical decision making. While the prediction of remission with escitalopram was driven by a combination of clinical and genetic variables, the achievement of remission with nortriptyline was predicted from genetic variants only. The clinical variables that contributed to the prediction of remission with escitalopram overlapped with previously reported predictors. Our model suggested that patients who had low levels of interest and activity, sleep problems, somatic symptoms and severe depression were less likely to reach remission, reflecting previously identified associations with symptom profiles10,12,13. For the prediction of response to nortriptyline, the procedure selected only genetic variables. The selection of only genetic variables in the nortriptyline-treated group suggests that the information predictive of nortriptyline response was better captured by genetic variables than the information predictive of response to escitalopram. The genetic variants selected into the prediction models were distinct from those identified in univariate genome-wide association studies3,4,5,6,7. For example, the genetic variants that predicted remission with nortriptyline in the multivariate model did not include the variant rs2500535 in UST that was previously identified as significantly associated with response to this antidepressant in the same dataset7. These results demonstrate that a statistical learning framework uses a multidimensional pool of predictors in a way that is partially distinct from traditional univariate approaches and has the potential to build novel prediction models that are relevant to clinical outcomes and robust in generalisation.

It is widely accepted that multiple genes/alleles are involved in determining response to antidepressants, some of which may not have been yet discovered. Interestingly, some of the genes containing variants that we reported as predictive of antidepressant treatment response have been recently identified as depression risk genes, as well as associated with bipolar disorder, schizophrenia and other brain diseases (Tables 1 and 2). For example, the SGCZ gene, part of the sarcoglycan complex, a group of six proteins which bridge the inner cytoskeleton and the extra-cellular matrix, has been recently reported to be associated with major depression, schizophrenia and bipolar disorder14, as well as with alcohol and nicotine co-dependence15, and Parkinson’s disease16. The consistent down-regulation in major depression patients in three independent samples suggested that SCL25A37 may be used as a potential biomarker for major depression diagnosis17. This gene was also associated with fatigue18. The acid sensing ion channel (ACCN1) has been associated with response to lithium treatment in bipolar disorder19 and also associated with risk of autism20. The gene encoding the transmembrane protein 229 b has been associated with risk for Parkinson disease21 and with childhood obesity22. The gene TMEM170A encoding the transmembrane protein 170 A and the CFDP1, the craniofacial development protein 1, have been both associated with coronary risk disease23. The latter has been also associated with lung function24. Another variant identified in this work was located in the transmembrane protein 2 gene TMEM2, which has an essential role in coordination of myocardial and endocardial morphogenesis25. None of the selected genetic variations were located in genes previously associated with pharmacogenetics in depression treatment. However, it is a common finding in genomics that most predictive genetic variants are in locations other than the predicted candidate genes. This is responsible for the general failure of the candidate gene approach and it opens new ways for understanding pathogenesis and pharmacology. Surprising findings from genomic research in other disorders have open new ways of understanding and treating the disorders (e.g. the involvement of complement in macular degeneration, schizophrenia was previously unsuspected). Further functional characterization may provide potential targets for future therapeutic antidepressants.

The prediction was accurate enough to be clinically meaningful. Remission was predicted in validation data with an AUC of 0.77 in the escitalopram group and 0.77 in the nortriptyline group. Following the classification proposed by Hosmer & Lemeshow26 our models had “acceptable discrimination” (values of AUC of 0.7 or higher). The utility of biomarkers and prediction models in practice does not depend solely on their prediction accuracy, as reflected by the AUC, but also on clinical context, gravity of the predicted outcomes, cost and burden of the test. For example, a comparison among breast cancer prediction algorithms reported good performance for models having AUC’s below 0.727. The fact that genetic and clinical variables used in the present model can be obtained with high accuracy and low-cost measurements that do not burden participants suggest that such models may be useful in practice.

Most of our previous work reporting on GENDEP applied analytical methods from the traditional inferential statistical framework, based on the assessment of association of a single clinical or genetic variant with treatment response in any given test. Association analysis aims to test the effects of specific factors on the response. This approach will highlight the predictive variable that has the strongest relationship with outcome on its own. In contrast, our current report aims to achieve an optimized prediction of outcome with the use of all available predictor variables, thus following a substantially different aim. Statistical learning can be used to build a model that will predict treatment outcome for new (unseen) cases, with clinical utility in practice. While explanatory power provides information about the strength of an underlying causal relationship, it does not imply its predictive power. By capturing underlying complex patterns and relationships, predictive modeling can suggest improvements to existing explanatory models28.

GENDEP has several strengths that make it suitable for prediction modeling. It is a randomised controlled trial that allows optimal comparison between treatments and the development of treatment-specific predictors29,30. The longitudinal study design of GENDEP allowed the follow-up of patients and the prospective assessment of symptom change, this being the most appropriate approach to establish cause-effect relations and avoid inconsistencies in data collection. The study was specifically designed to assess remission as the primary outcome, with patients being followed for 12 weeks. All patients had four or more depression severity measurements, with more than eighty percent of the sample having eight or more depression measurements, enough time to observe a clinical trend that could lead to clinical remission. However, interpretation of the present results has to take into account several limitations. First, while a wealth of information was available in the GENDEP dataset, not all relevant predictors were measured. For example, history of maltreatment in childhood has been shown to predict outcome of treatment with antidepressants31, but information on childhood maltreatment is not available in GENDEP. Second, since GENDEP only included individuals of white European ancestry without family history of bipolar disorder, the results may not generalize to individuals of other ethnicities or those with family history of bipolar disorder. Third, GENDEP only included two antidepressant drugs distinct in their mechanisms of action. Similar prediction of outcomes with other antidepressants, with neurostimulation and psychological treatments will require investigation in large and richly assessed samples of individuals treated with different modalities. Fourth, the GENDEP study was used as an exploratory dataset to build and test the predictive models. The clinical application of these models will require a comparison of outcomes between individuals whose treatment is selected according to a prediction model with those whose treatment is selected by chance or according to the judgement of the treating physician.

In conclusion, the present results demonstrate that a combination of a relatively small number of clinical and genetic variables can meaningfully and robustly predict remission with escitalopram and nortriptyline antidepressants among individuals with major depressive disorder. Statistical learning methods may be used to derive similar models for individuals treated with various antidepressants and other treatment modalities to map the opportunities for individualized indications for treatments.

The models are available online at https://gist.github.com/raqini/669c38a6329aa2231268770200519d64.