The AUC results are reported in Table 1 . Comparing the AUC values across the dataset versions (last column in Table 1 ), it is clear that, in general, the set of chemical descriptors have a greater ability to predict a compound’s class than the set of GO terms. More precisely, the dataset using only chemical descriptors as features has substantially larger AUC than the one using only GO terms as features (0.781 vs. 0.716, respectively). However, the GO term features still offer some positive contribution to the predictive accuracy of random forests, since the dataset version leading to the highest AUC value in Table 1 (0.800) was the one using both GO terms and chemical descriptors as features.

Predictive accuracy for the models developed was evaluated by Area Under the ROC curve (AUC). This is a measure between 0 and 1, with 1 indicating perfect (no error) class predictions. The reported predictive accuracy used is the median over the 10 test sets of the external cross-validation. We report the median accuracy, rather than the mean, because the former is more robust to outliers. The median AUC results from each of the different versions of the DrugAge dataset (using either chemical and/or biological descriptors), where for each dataset version we optimised the parameters ntrees and mtry of the random forest method as described in the Methods section.

We use the random forest method as the classification algorithm to analyse this dataset. This type of method was chosen because it is particularly popular in bioinformatics [ 21 , 22 ], it is robust to overfitting in datasets where the number of features is much larger than the number of instances (as with our dataset) [ 22 , 23 ], it is relatively simple to understand and to use, and finally, in contrast to other state-of-the-art classification methods like support vector machines, random forests produce interpretable results based on a variable (feature) importance measure, an interpretation mechanism also exploited in this paper.

We have created a DrugAge dataset specifically for studying the classification of compounds into the classes “increase lifespan” or “do not increase lifespan”, depending on each compound’s effect when administered to C. elegans . In this dataset, each compound to be classified belongs to one of the two just-mentioned classes, and is described by a large set of chemical descriptors and biological GO term features.

Biological and chemical features for the prediction of longevity compounds in C. elegans

One of the benefits of utilizing the random forest method, as well as it being a highly predictive technique, is that for each feature an importance measure can be calculated. This importance measure (often called variable importance) offers the opportunity to interpret the relevance of each feature in the model produced. In this work, using the Boruta and Ranger R packages [21,24] and computing the importance of features in the best model (built using both GO terms and chemical descriptors as features), 93 features – 73 chemical descriptors and 20 GO terms – were selected as statistically significant features (full table in Supplementary Material). Recall that the GO term features are derived from the proteins which are targeted by each compound.

The 20 GO terms selected as significant mainly make up biological process GO terms (14 out of 20), five molecular function terms and one defining a cellular component term. Biological process GO terms describe a series of processes as well as specific biological processes such as macromitophagy and macroautophagy, which are among the features with the highest importance in this work. Molecular function GO terms describe specific activities that occur at the molecular level such as isomerase activity and protein disulfide isomerase activity. Finally, cellular component GO terms describe locations in the cell, e.g. at the level of organelles or macromolecular complexes such as the mitochondrial proton-transporting ATP synthase complex, highlighted as the only significant cellular component GO term feature in this work.

Chemical molecular descriptors are calculated from the chemical structure and are normally used to build predictive models to study the relationship between a compound’s chemical structure and its biological and pharmacokinetic properties such as drug distribution and absorption [25,26]. This paper is the first use of chemical molecular descriptors (as well as GO terms) to study the relationship between longevity and the chemical structure of compounds that may affect longevity.

Chemical molecular descriptors can be broadly categorized into three main groups, which describe a compound’s chemical structure and its main properties. These groups are: hydrophobic, electronic and steric (size and/or shape) descriptors. Hydrophobicity descriptors describe the hydrophobic character of a chemical compound and how easily it can cross cell membranes, and they may also be important for receptor interactions. Electronic molecular descriptors describe the electron distribution in a chemical compound and its electrostatic interactions, therefore they give an indication of how strongly (in terms of affinity) and how specifically a chemical compound binds to specific receptors. Finally, steric descriptors describe the size and shape of the chemical compound. The size and shape of a compound may influence its binding with an enzyme or receptor binding sites and can also affect other psychochemical properties. Note that a chemical molecular descriptor can belong to more than one of the categories described above.

The top 20 selected features with the highest median variable importance are shown in Table 2. Considering just the top 20 features as shown in Table 2, there are slightly more GO terms (12 out of 20) than chemical molecular descriptors (8 out of 20). Those 12 GO terms include terms related to mitochondrial processes, terms related to enzymatic and immunological processes and terms related to metabolic and transport processes. Furthermore, the eight chemical molecular descriptors in the top 20 features contain descriptors related to electronic and steric (size and shape) effects, but not to hydrophobic effects directly.

Table 2. Top 20 selected features with the highest median variable importance.

Median Variable Importance Feature Feature type Feature Description 14.4 a_nN MD Number of nitrogen atoms in the molecule 12.8 isomerase activity GO Catalysis of the geometric or structural changes within one molecule 11.8 macromitophagy GO Degradation of a mitochondrion by macroautophagy 11.6 macroautophagy GO Process in which cellular contents are degraded by lysosomes 11.1 protein disulfide isomerase activity GO Catalysis of the rearrangement of both intrachain and interchain disulfide bonds in proteins. 11.0 dipeptidase activity GO Catalysis of the hydrolysis of a dipeptide. 9.72 pyruvate metabolic process GO The chemical reactions and pathways involving pyruvate 9.47 PEOE_VSA+4 MD Total positive van der waals surface area of atoms with atomic charge in the range of 0.20-0.25. 9.31 fatty acid transport GO The directed movement of fatty acids into, out of or within a cell, or between cells 8.79 mitochondrial electron transport, NADH to ubiquinone GO The transfer of electrons from NADH to ubiquinone mediated by the multisubunit enzyme known as complex I 8.64 vsurf_Wp2 MD Polar volume at -0.5, a descriptor reflecting the polarizability of a molecule 8.57 isotype switching GO The switching of activated B cells from IgM biosynthesis to biosynthesis of other isotypes 8.40 translation GO The cellular metabolic process in which a protein is formed 8.18 Q_RPC- MD Relative negative partial charge, defined as the most negative atomic charge divided by the sum of all negative atomic charges in the molecule. 8.09 aerobic respiration GO The enzymatic release of energy from inorganic and organic compounds 7.98 a_IC MD Atom information content (total), defined as the entropy of the element distribution in the molecule multiplied by the number of atoms. 7.95 PEOE_VSA_FPPOS MD Fractional polar positive vdw surface area 7.86 triglyceride mobilization GO The release of triglycerides from storage within cells or tissues, making them available for metabolism. 7.79 chi1v MD Valence corrected molecular connectivity index (order 1) 7.70 bpol MD Sum of the absolute value of the difference between atomic polarizabilities of all bonded atoms in the molecule GO: Gene ontology term; MD: Chemical Molecular descriptor

It can be seen from the list of important features that the vast majority of the most important features are very specific molecular and biological processes. However, these specific processes are generic in their applicability and occur across many tissues and organs. For example “isomerase activity” covers a broad range of various enzymes that catalyze reactions across many biological processes, such as in glycolysis and carbohydrate metabolism. Although it is evident that isomerase activity is relevant to metabolism (amongst other processes) and hence ageing, this feature is not specific enough to suggest practical targets for pharmacological intervention. In spite of this, some of the specific features have been linked with longevity and ageing processes.

GO terms related to metabolism encompass the vast majority of the GO term features listed in Table 2. These GO terms range from very general metabolism-related properties such as aerobic respiration to more specific processes such as dipeptidase activity, pyruvate metabolic process, fatty acid transport and mitochondrial electron transport from NADH to ubiquinone. Given the involvement of metabolic factors in several theories of ageing such as the free radical theory of ageing, as well as the well-established effect of calorie-restriction on longevity, it is expectable that the compounds that affect ageing do so by interacting with these pathways and processes, as evidenced also by the importance of such features in the random forest model.

One apparent group of features that can be related to longevity and ageing are the GO terms related to autophagy (macroautophagy and macromitophagy) and mitochondrial processes. Macroautophagy is the process where cellular contents are degraded by lysosomes or vacuoles and recycled, and this process controls cytosolic protein and organelle degradation [27,28]. Whereas macromitophagy is the degradation of mitochondrion by macroautophagy and controls mitochondrial quality and quantity [29]. It is known that autophagy in general is associated with ageing processes. This can be evidenced by the occurrence of degenerative changes in mammalian tissues, similar to changes seen with ageing, as a result of genetic inhibition of autophagy. Moreover, pharmacological or genetic manipulations that increase life span in model organisms often stimulate autophagy. In the same way, there is a decrease in autophagy with increasing age in organisms, which leads to accumulation of damage [30] which is thought to be responsible for the functional loss in many biological and physiological processes as ageing occurs [31,32]. In addition to macroautophagy, mitophagy is specifically implicated in ageing. Mitophagy has been shown to be a selective, “non-random” process [33] that is governed by several biological pathways (see [34] for a review of the molecular mechanisms).

Mitochondrial respiration, and in particular electron transport chain, is the main source of reactive oxygen species. As a result, mitochondrial homeostasis is particularly affected by ageing, as ROS generation in mitochondria leads to mitochondrial protein and mtDNA damage [34]. Therefore, mitophagy can be regarded as a defense against oxidative stress, mitochondrial dysfunction, and ageing. This is supported by findings that along with mitochondrial biogenesis pathways, a key mediator of mitophagy and longevity assurance under conditions of stress in C. elegans (DCT-1) is upregulated when mitophagy is impaired [35]. It is therefore not unexpected to find in this work that chemical compounds that modulated mitophagy are also important promoters of longevity. It is interesting to note that in model organisms such as C. elegans disruption of mitochondrial electron transport chain processes can lead to increases in longevity, through genetic [36] or pharmacological interventions [37]. Finally, a related property, aerobic respiration, was also selected by the random forest model. Although aerobic respiration is a very broad term encompassing many processes that lead to the production of cellular energy, it is very well-associated with ageing through the known impact of mitochondrial function and caloric restriction.

Other GO features with links to longevity and ageing processes are protein disulfide isomerase activity and translation. Protein disulfide isomerase activity refers to the activity of isomerases that are involved in protein folding via formation and breakage of disulfide bonds within proteins in the endoplasmic reticulum (ER) [38,39]. The activity of this enzyme is key to protein folding and quality control in the ER. A number of studies have demonstrated that the levels of disulfide isomerase and their catalytic activity diminish with age [40]. Misfolding of proteins and ER stress are alleviated by the signalling pathway known as the ER stress response or the unfolded protein response, which involves protective measures to limit the protein load. These include up-regulation of ER chaperones involved in the refolding of proteins, activation of pathways leading to reduction of protein translation and degradation of misfolded proteins. Where ER stress cannot be reversed, cellular functions deteriorate and apoptosis will occur [41]. There is evidence in the literature to suggest that disruption of protein disulfide isomerase activity leads to ER stress and accumulation of misfolded proteins, which can give rise to age-related disease pathology [42]. Finally, the GO term translation has a clear biological relevance, since it is well-known that translation inhibition extends lifespan in C. elegans [43]. Translation has also been highlighted as a prime category in age-related genes in C. elegans in a recent paper by Fernandes et al (2016) [44]. It is therefore evident that pathways involved in protein translation and folding may be a target of anti-ageing compounds, hence the significance of GO terms such as “translation” and “disulphide isomerase” in the random forest model.

The molecular descriptors in Table 2 indicate the molecular properties that impact the longevity effect of the compounds. From the eight molecular descriptors listed in the table, the majority are electrostatic descriptors such as PEOE_VSA+4, vsurf_Wp2, Q_RPC-, PEOE_VSA_FPPOS and bpol. These electrostatic parameters also carry information regarding the topology of the molecule, and along with steric parameters such as chi1v and a_IC explain the interaction and binding of the compounds with their target sites. These targets/processes are in addition to those already described in the model by the biological features (GO terms).

Overall, even though the used dataset (like any other biological dataset) is somewhat biased by the fact that some genes have been much more studied than others [44], some of the most important features shown in Table 2 can be related to important and known biological processes of ageing and longevity, such as those related to autophagy and mitochondrial processes. Furthermore, the other selected biological and chemical features are a good starting point that warrants further investigation, to further link the chemical and biological features of chemical compounds with longevity and underlying biological ageing processes.