Network-based machine-learning strategy for drug and food repositioning

The work presented herein exploits publicly available data on molecule to gene-encoded protein interactions as well as protein-protein interaction data. In brief, the sparse data of interactions between drugs and their protein/gene targets are initially mapped on large-scale interactome networks - a whole set of protein-to-protein interactions in humans (here and further due to the specifics of the existing interaction datasets, “gene” and “protein” terms can be used interchangeably). Most drugs exert their biomedical and functional activity by binding to a specific subset of proteins. Proteins rarely function in isolation but rather operate as part of highly interconnected networks23. Taking this into account, we have tailored random walks on graphs with restarts (controlled by a single network diffusion parameter “c”) to simulate the perturbation of individual drugs on human proteome networks using aggregated datasets of their targeted proteins. Similar network-based propagation approaches have been recently compared favourably to predict drug-target interactions, and evaluate network perturbations caused by cancer mutations for improved patient stratification24,25. This network diffusion transforms a short list of proteins targeted by a given molecule/drug into a genome-wide profile of gene scores based on their network proximity to target candidates. Using the genome-wide profiles of drugs, the supervised machine-learning strategy (“maximum margin criterion” and support vector machines, in this case) is trained to accurately classify “anti-cancer” (vs “other”) properties of molecules. The best obtained models were used to predict the probability of a given existing approved drug to exhibit anti-cancer properties. After validation of the predictive capacity of the model for anti-cancer drug repositioning, the same machine learning strategy was applied to predict various cancer-beating molecules within foods (Fig. 1). It should be noted that there are various methodologies for drug repositioning such as molecular structural commonality, molecular target similarity as well as shared genetic or phenotypic (e.g. side effect profile) influence26,27. However, these approaches mandate additional data sets (such as gene-expression data, proteomics, metabolomics or phenotypic effect data) for model building. In the search for food-based cancer beating molecules, these data are very limited.

Figure 1 Schematic diagram of the overall workflow. Full size image

Benchmarking and optimization of machine learning strategy

Among the machine learning methods tried, MMC (maximum margin criteria)28 and SVM29 with linear kernel showed comparable performance and relatively good processing speed (including parameter optimization, model training and prediction on 10-fold cross-validation). Radial kernel SVM did not exceed the performance of the linear methods and at the same time required much longer processing time (the best radial kernel SVM F1-score achieved is of 0.85 vs 0.86 for linear kernel SVM). Furthermore, the optimal gamma parameter for the radial SVMs tends to be very low (~10−7), effectively making them similar to the linear kernel SVMs. We have also explored 2 neural network classifiers and 2 regularized LASSO/Elastic Net logistic classifiers to see whether they bring any improvement in the classification accuracy. For the best performing type of interactome and settings of random walk on graphs, these more advanced approaches resulted in prediction accuracies comparable to linear SVM and MMC (see SI Appendix M1). This is well known in genomics studies involving a small number of examples and a large number of features, where the linear classifiers are preferred because of their transparency and biological interpretability. As a result, the major focus was made on linear kernel SVM and MMC methods for the final round of optimization. The best F-score achievable was of 0.86 with linear kernel SVM with 84% correct anti-cancer predictions and 90% correct non-anticancer predictions (see SI Dataset S1). Re-running the optimization multiple times for the same settings showed consistent performance (maximum 1–2% difference). Based on these results, it was decided to select the top 700 models (F-score > = 0.84) for anti-cancer likeness prediction from models based on linear kernel SVM and MMC for existing approved drugs (SI Dataset S2) and food compounds (SI Dataset S3). Interestingly, log-transformation of the input propagated profiles was systematically shown to increase performance of the classifiers. This is likely because some individual isolated genes, which do not propagate and thus stay with a very high perturbation level would have lesser effect on the overall profile in log-space. At the same time “c” parameter of the random walker and different matching settings between compounds and genes had less pronounced effects. Gene-gene connection thresholds were also not strongly influential except in the case of BioPlex interactome. This is likely because connections provided by STRING tend to include a wide range of knowledge sources giving a more representative and complete graph of gene-gene (or protein-protein) interactions and the sheer number of connections can compensate for the larger values of “c” and higher thresholds used. We have also evaluated individual gene influence on the final classification, i.e. gene importance, by finding the correlation between the gene levels and the prediction outcomes for the optimized model. The full table of averaged importance predictions for the top selected 700 models is provided as SI Dataset S4. As expected, the top-rated genes are involved in cell proliferation control and their mutations are often associated with cancer. This provides transparency to the machine learning based prediction of anti-cancer properties of the drugs.

Pathway analytics and differential interactome

A list of the most influential genes/proteins for predicting anti-cancer therapeutics derived from network-based machine learning was subjected to pathway analytics using gene-set enrichment (SI Dataset S4). Among the top 25 impacted pathways were cell cycle, DNA replication, apoptosis, p-53 signaling, JAK-STAT signaling and mismatch repair as well as various cancer-specific pathways. It adds to the biological plausibility of the modelling approach used here that the pathways identified as key drivers are those consistently implicated in cancer development and progression. In Fig. 2, relevant discriminating genes and their corresponding impacted pathways are presented. Here, individual node size corresponds to the relative discriminating capacity of a given gene-encoded protein and node color illustrates shared biological pathway functionality. Increasingly, it is understood that the mechanistic bases for cancer cell survival, dissemination and therapeutic resistance are manifold and involve multiple biochemical pathways. Most machine-learning derived pathways in our analysis have been suggested as targets for cancer prevention or therapeutic interventions30,31,32. Therefore, the “ideal” anti-cancer agent should be capable of disrupting multiple pro-tumorigenic biochemical processes. The machine learning approach presented here highlights the biological pathways influenced by currently utilized anti-cancer therapeutics, and thus permits in parallel a targeted search for unique agents, in this case bioactive compounds with foods, with the potential to impact on multiple pathways simultaneously.

Figure 2 Relevant genes and pathways derived from machine learning models for prediction of anti-cancer therapeutics tested in human trials. Individual node size corresponds to the relative discriminating capacity of a given gene-encoded protein and node color illustrates shared biological pathway functionality. Full size image

Drug repositioning in cancer using interactomics

The full prediction summary is presented in SI Dataset S2. As expected most compounds currently in use as cancer therapeutics demonstrated strong anti-cancer probability. Interestingly, several compounds which are not conventionally used in cancer treatment demonstrated very high anti-cancer likeness (ACL). The available literature on these compounds was further interrogated to understand the mechanistic basis for the potential anticancer effect(s) of these agents. For example, quinolone-derivative rosoxacin and quinoline-based clioquinol primary act as anti-microbial and anti-fungal agents, respectively. However, the analysis presented here indicates a potential direct role for these therapeutics in cancer. The quinolone antibiotics were shown to have a significant inhibiting potency against eukaryotic topoisomerase-II resulting in cytotoxicity of various cancer cell types33. This group of compounds can be explored in comparison to human topoisomerase-II inhibiting anti-tumor drugs such as doxorubicin and etoposide. Clioquinol is a chelator of zinc, copper and iron which are known to be involved in both carcinogenesis and angiogenesis34. The anti-neoplastic activity of clioquinol is thought to be through several potential mechanisms including NF-kB apoptosis induction, mTOR signaling and inhibition of lysosome35. Although of great promise its role in cancer therapy remains largely unexplored in clinical settings. The anti-diabetic drugs such as metformin and chromium picolinate, also emerged as potential candidates for anti-cancer drug repositioning from this evaluation. The molecular mechanisms responsible for this association remain uncertain, however both agents are used to alleviate insulin resistance through modulation of the insulin signaling cascade, and a number of studies have shown that chromium specifically alters proximal insulin signaling and directly effects insulin receptor phosphorylation and kinase activity36. The downstream consequences of therapy with both metformin and chromium is the reduction in insulin and insulin-like growth factor levels, which in turn is understood to inhibit several key processes within the mTOR signaling pathway, which is a central molecular driver of a variety of cancers37. Correspondingly a strong association has been shown on pooled analysis between metformin usage and incidence of cancer in type II diabetics37. By contrast, the chromium picolinate might act as a double “edged sword” due to its capacity to interfere with DNA leading to structural genetic lesions and thereby promoting carcinogenesis38. This example highlights the limitation of our approach to identify molecules that interact with relevant carcinogenetic processes irrespective of the nature of the interaction (i.e. inhibition or stimulation). Identifying the nature of molecular interactions would require additional datasets such as gene expression or proteomics but these are not generally available in the case of food-based molecules.

Prediction of cancer-beating molecules in foods

From all small molecules approved for anti-cancer therapies, almost half are derived from natural products39. These drugs are generally more tolerated and less toxic to normal cells39. The methodology outlined above was next applied to predicting the anti-cancer likeness of ~7692 bioactive compounds across various food categories. Here a comprehensive view of drug-like molecules in food is provided, unlike most studies in the literature to date which have tended to focus on a single compound or a single food type. Approximately 110 molecules from different chemical classes (see Fig. 3), including terpenoids, isoflavonoids, flavonoids, poly-phenols and brosso-steroids were identified and mapped according to their food sources using multiple experimental databases. A complete list of food molecules ranked by proxy according to anti-cancer drug likeness of >0.1 is provided in SI Dataset S3. Using the unsupervised learning random walk on graphs, we have propagated the influence of the most promising molecules on human interactome networks and identified their impacted molecular pathways (for detailed analysis see SI Dataset S3 and SI Dataset S5 only for compounds with ACL > 0.7). SI Appendix Table S1 summarizes a list of cancer-beating compounds identified in the present study with high ACL > 0.7 and their associated food sources. Furthermore, we have conducted a comprehensive review of the available literature on the top anti-cancer drug like molecules (with ACL > 0.9) and their putative molecular mechanisms of anti-cancer actions (SI Appendix Table S2). Both computational analysis and experimental data from literature show that the pathways and mechanisms responsible for these anti-cancer properties cover the breadth of our current understanding of the multi-step process of carcinogenesis. These include anti-inflammatory, pro-apoptotic effects, potent antioxidant activity and scavenging free radicals; regulation of gene expression in cell proliferation, cell differentiation, oncogenes, and tumor suppressor genes; modulation of enzyme activities in detoxification, oxidation, regulation of hormone metabolism; and antibacterial and antiviral effects40. For example, 3-indole-carbinol, which is found abundantly in members of the Brassica oleracrea family of vegetables (including cabbage, broccoli and brussel sprout) appears to be one of the most strongly anti-cancer-like molecules. This bioactive compound has been shown to target multiple aspects of cancer cell cycle regulation and survival, including caspase activation, oestrogen metabolism and receptor signaling and endoplasmic reticulum function (see SI Appendix Table S2 and reference therein). Other prominent examples include dydamin, which is a flavonoid glycoside found in citrus fruits and apigenin, which is particularly abundant in coriander, parsley and dill. Both are understood to influence apoptotic pathways as well as cell cycle arrest mechanisms and are believed to suppress cancer cell migration and invasion (see SI Appendix Table S2 and reference therein). Figure 4 provides a visual summary of CBMs associated with strong anti-cancer likeness. Each node in the figure denotes a particular food item and node size in each case is proportional to the number of CBMs. The link between nodes reflects the pairwise correlation profile of CBMs in foods, thus the clusters of foods seen in Fig. 4 illustrate molecular commonality between them. The foods that show greatest diversity in CBMs include tea, grape, carrot, coriander, sweet orange, dill, cabbage and wild celery.

Figure 3 Hierarchical classification of the top 110 predicted cancer-beating molecules in food with anti-cancer drug likeness of >0.7. Full size image

Figure 4 The contained profiles of compounds within selective foods, which were highly likely to be effective in fighting cancer. Each node in the figure denotes a particular food item and node size in each case is proportional to the number of CBMs. The link between nodes reflects the pairwise correlation profile of CBMs in foods, thus the clusters of foods illustrate molecular commonality between them. Full size image

Food map and phytochemical synergy

The potential of food sources to exert their preventative or therapeutic capacity depends upon the bioavailability and diversity of disease-beating molecular compounds contained therein41. A key limitation in regards to the existing literature on food-based compounds is the largely one-dimensional view that is commonly taken, with studies tending to focus on specific molecular components in isolation, for example anti-oxidants40. It is accepted that regular consumption of fruits and vegetables can reduce the risk of carcinogenesis (42). However, when antiproliferative agents acting in isolation have been subjected to clinical trial evaluation they do not appear to consistently confer the same level of benefit. The point is simply illustrated in the case of the apple; apple extracts contain bioactive compounds that have been shown to inhibit tumor cell growth in vitro. However, interestingly phytochemicals in apples with the peel preserved inhibit colon cancer cell proliferation by 43%, whereas this effect was found to be reduced to 29% when apple without peel was tested42. From these observations it is therefore clear that the successful implementation of food-based approaches in the fight against complex diseases such as cancer will rely on a consortium of biologically active substances, such as those present in whole fruits and vegetables, in order to increase the chances of success. The anti-cancer properties of a given food will thus be determined by (1) the additive, antagonistic and synergistic actions of their individual components and (2) the way in which these simultaneously modulate different intracellular oncogenic pathways. Both of these conditions are fulfilled in the case of tea for example, which we found to strongly exhibit anti-cancer drug-like properties compared with other food ingredients. Tea is a rich source of anti-cancer molecules from catechins (epigallocatechingallate), terpenoids (lupeol) and tannins (procyanidin) and, three of which exert strong and complementary anti-cancer effects, by protecting reactive oxidative species induced DNA damage, suppressing inflammation and inducing apoptosis and cancer cell cycle arrest, respectively. Correspondingly, several recent meta-analyses demonstrated that the consumption of green tea demonstrated delayed cancer onset, lower rates of cancer recurrence after treatment, and increased rates of long-term cancer remission43,44. Other examples include citrus fruits such as sweet orange, which contains dydimin (citrus flavonoid), obacunone (limonoid glucose) and β-elemene with strong anti-oxidant, pro-apoptotic and chemosensitization effects, respectively. The latter have strong effects particularly against drug-resistant and complex malignancies across different types of cancers. The inverse associations between citrus fruit intake and incidence of different types of cancers were confirmed by meta-analysis of multiple case-control and prospective observational studies45. With this understanding we have constructed the anti-cancer drug-like molecular profiles comprised of over 250 different food sources (see Fig. 4 and SI Appendix Table S1).