Promiscuous vs. Tissue and Cell line specific Drugs

We assigned a cell line sensitive to a drug if its growth inhibition IC 50 value is less than 0.5 μM, otherwise we assigned it resistant. We computed percent of cell lines resistant to each drug as well as percent of cell lines resistant in a tissue (see Supplementary Table S1). It was observed that drug Panobinostat is highly promiscuous anticancer drug, effective against more than 99% cell lines (Fig. 1). Targets of this drug are histone deacetylases (HDAC) or lysine deacetylases (KDAC), enzymes that remove acetyl groups. Another drug Paclitaxel is sensitive against 83% cell lines, as well as it is effective against 100% of the cell lines belonging to Autonomic Ganglia tissue. As shown in Table S1, out of 24 drugs only five drugs (17AAG, Irinotecan, Paclitaxel, Panobinostat, Topotecan) are sensitive against more than 50% of the cell lines assayed. Most of the kinase drugs are only sensitive against limited number of cell lines, whereas most of cytotoxic drugs are sensitive against large number of cell lines. Furthermore, it was observed that certain drugs are tissue specific; for example drug Topotecan is sensitivity against 33% of the breast cell lines (66.7% resistant) where as it is sensitive against 87% cell lines of hematopoietic and lymphoid tissue. Similarly drug Irinotecan is sensitive against 100% cell lines belonging to Autonomic-Ganglia and soft-tissue. There are drugs that are sensitive only against few cell lines like Nutlin3 (effective against less than 1% cell lines) and resistant against most of the cell lines.

Figure 1 Illustration of tissue-specific response of 24 anticancer drugs, where right column contains names of drugs and bottom row has names of tissues. Each cell shows percent of sensitive cell lines of a tissue for corresponding drug. Full size image

Genomic factors responsible for drug resistance

Each cell line has its own genomic characteristics; these genomic features might be contributing towards drugs resistant. In this study, we investigated role of mutations/variation in genes and gene expression in drug resistant and sensitive cell lines. We identified top five genes corresponding to each drug that exhibit highest difference in genomic features between drug resistant and sensitive cell lines (Supplementary Table S2). Similarly, we also identified genes involved in important activities (e.g., drug membrane transport activity, growth arrest, epigenetic factors, DNA damage, tyrosine protein kinases and tumor suppressors) for each drug (Supplementary Tables S3–S9).

Variations and mutations

In order to examine the involvement of variation and mutation in drug resistance, we calculated the frequencies of mutant cell lines in both resistant and sensitive group of cell lines (see Methods section) for each gene (Supplementary dataset). In other words, higher the difference of frequencies of mutation between resistant and sensitive cell lines, greater will be the chances of contribution of this gene-mutation combination, in drug resistance. For example, gene PDE4DIP (Phosphodiesterase 4D anchoring protein) shows highest difference, it has 38.6% higher frequency of mutation in drug resistant (PF2341066) cell lines as compare to sensitive cell lines (Table 2, Supplementary dataset). It is interesting to note that PDE4DIP mutated in 241 cell lines and most of mutant cell lines around 99% were resistant for anticancer drug PF2341066.

Table 2 Gene showed most biased mutation (fraction of mutant cell lines is more in resistant than in sensitive cell lines) for each anticancer drug. Full size table

While focusing on different biologically relevant functions, we found that DNA damage related proteins like MSH3 and UBR5 genes have >= 12% higher frequency of mutation in TAE684 resistant cell lines as compare to sensitive cell lines (Supplementary Table S7). Similarly, tumor suppressors like TP53 have 19% higher frequency mutation for each of AZD6244 and PD0325901 resistant cell lines (Supplementary Table S9). Among epigenomic factors, SMARCA4 mutations are 11.5% higher in PF2341066 resistant cell lines as compare to sensitive cell lines (Supplementary dataset). PF2341066 resistance may also be brought about by DNA damage related proteins like ATR, FMN2 and UBR5 genes, showing 13%, 14% and 12% higher frequency of mutation in resistant cell lines (Supplementary dataset).

Similar to mutations, we also looked at the presence of variations. We found PRKCB (Protein Kinase C, Beta) has 35.7% higher frequency of variation in AZD0530 resistant cell lines as compared to sensitive cell lines (Supplementary Table S2B & Supplementary dataset). Similarly, we found that genes like CYP1A2 (Cytochrome P450, Family 1, Subfamily A, Polypeptide 2) among drug metabolism genes and SLC22A3 (Solute Carrier Family 22) in drug transmembrane transport activity genes showing higher frequency of variations in TAE684 (18%) and Paclitaxel (13%) resistant cell lines as compared to sensitive cell lines, respectively (Supplementary Table S3 and S4). In contrast, epigenomic factors like SMARCB1 (SWI/SNF Related, Matrix Associated, Actin Dependent Regulator of Chromatin, Subfamily B; relieves repressive chromatin structures) and KDM6A (Lysine K-Specific Demethylase 6A; catalyzes the demethylation of tri/dimethylated histone H3) show as much as 16% more variations in RAF265 and AZD6244 sensitive cell lines, respectively (Supplementary Table S6). Among DNA damage related proteins, NUAK1 (NUAK family, SNF1-like kinase, 1) and PLK3 (polo-like kinase 3) were found to be harboring more variations (21% and 18% respectively) in TAE684 resistant cell lines than sensitive ones (Supplementary Table S7).

Gene Expression

Since the expression of a gene may be associated with the drug resistance, we calculated the average expression of resistant and sensitive cell lines. The difference of two averages shows the relation between expression of that gene and probable drug resistance caused. For example, C3orf14 shows higher average expression (4.5 fold) in AZD0530 resistant cell lines as compare to sensitive (Supplementary Table S2C). Similarly, drug transmembrane transport proteins like ATP8B1 (transport phosphatidylserine and phosphatidylethanolamine across membrane) have high average expression in PD0325901 resistant cell lines as compared to sensitive cell lines, which means its expression may lead to PD0325901 resistance (Supplementary Table S4). In contrast, higher average expression of TSPAN1 (Tetraspanin 1; involved in cell development, activation, growth and motility) may lead to Topotecan sensitivity (Supplementary Table S4). Moreover, while DNA damage related genes like CCND1 (Cyclin D1) and CDC14B (Cell Division Cycle 14B) have higher average expression in PD0325901 resistant cell lines. EYA1 (Eyes Absent Homolog 1) was also found to have higher average expression in PD0325901 sensitive cell lines (Supplementary Table S7). Among tyrosine kinases, ERBB2 and ERBB3 show high average expression in Topotecan sensitive cell lines (Supplementary Table S8). The tumor suppressors genes such as EFNA1 (Ephrin-A1), ERRFI1 (ERBB Receptor Feedback Inhibitor 1), LATS2 (Large Tumor Suppressor Kinase 2) and PLK2 (Polo-Like Kinase 2) also show higher average expression in PD0325901 resistant cell lines and thus may be contributing to PD0325901 resistance (Supplementary Table S9).

Resistance in biologically related group of drugs

In order to understand genomic factors responsible for drug resistance, we group drugs based on biological mechanisms. All the 24 drugs can be broadly divided in three categories; (a) Kinase inhibitors, (b) Cytotoxic drugs and (c) other targeted therapies. These kinase inhibitors can be further divided based on inhibition of a kinase like EGFR inhibitors, c-MET inhibitors, ALK inhibitors, Raf kinase B Inhibitor, MEK1 and MEK2 Inhibitors and Multi-kinase inhibitors. The group of cytotoxic drugs includes inhibition of DNA Topoisomerase-I and Beta tubulin. Other targeted therapies involved in inhibition of MDM2, HSP90, Gamma Secreatase and HDAC. We analyzed mutation/variation of genes in both drug resistant and sensitive cell lines (Supplementary Tables S11 and S12). It was observed that PDE4DIP has 35% higher frequency of mutation in cMET inhibitors’ resistant cell lines as compare to sensitive cell lines. On the other hand, HSPA4 has 49% higher mutation frequency in HDAC inhibitor resistant cell lines. Similarly, MDM2 inhibitors have 46% higher frequency of mutation in sensitive cell lines as compare to resistant cell lines. In case of ABL inhibitors, ADCK1 gene variation is 40% higher in sensitive cell lines.

Similarly, we examined the expression and CNV of different class of inhibitors and identified genes (top 10 and bottom 10 genes) that exhibit maximum correlation between expression/CNV and IC50 (Supplementary Table S13). Here we found certain genes showing good correlation with particular group of drugs, for example, Kinase inhibitors like CDK4/6 inhibitors and ALK inhibitors have correlation of 0.46 and 0.26 with CTTN, which is also known as Src substrate cortactin. Similarly, ETV4 and DUSP6 expression is in negative correlation of 0.4 with MEK1/2 inhibitors’ resistance.

Prediction Models

In order to develop models for prediction drug prioritization, we developed support vector machine (SVM) based models. In this study, we implemented SVM based regression models using popular software package SVMlight (http://svmlight.joachims.org/).

Mutation based models

Since the machine learning approaches required fixed length data, therefore, we took the mutation in a gene in the form of binary where mutated and wild type gene was presented as 1 and 0 respectively. Since the number of features or the vector size for mutational input was 1650, we used cfsSubsetEval algorithm of WEKA package to reduce the number of features13. The feature selection with this method reduced the number of genes up to an average of 43 per drug. Still the total number of genes required for modeling all 24 drugs was 388 (Supplementary Table S10). Therefore, we additionally adopted the strategy of F-stepping14 on the features obtained from cfsSubsetEval algorithm. Here we looked at the performance of each model by removing every feature one by one as described in Methods section. This process further reduced the number of genes to an average of 20, which are required for modeling of 24 drugs (Supplementary Table S10). After feature selection, the total number of unique genes required for model development for all drugs are 268. SVM based models were developed using selected features and got Pearson Correlation Coefficient (PCC) from 0.24 (Nutlin3) to 0.58 (Irinotecan) with average correlation 0.43 for all drugs. The performances of models after cfsSubsetEval-based feature selection are given in the Supplementary Table S10.

Models based on gene variation

Similar to the mutational feature, we have taken general variations in a gene as binary input for developing models. The feature selection methods cfsSubsetEval and F-stepping reduced the average number of required genes per drug up to 54 and 25 respectively. After F-stepping filtering on variation data, model development for 24 drugs required a total of 425 unique genes (Supplementary Table S10). As per our hypothesis, the variations present in cancer, which may and may not be mutations, can be used as features to correlate with the drug sensitivity. The average performance of variation-based models in terms of PCC was 0.52 (Table 3), which is better than correlation achieved by mutation based model. In addition we achieved correlation more than 0.60 for certain drugs like Irinotecan, L685458, PD0332991.

Table 3 The performance of SVM models developed using various genomic features that include mutant genes, variant genes, CNV, expression, hybrid. Full size table

Gene expression based models

SVM-based models developed on gene expression as input features performed very well for all the 24 drugs. The initial feature selection with cfsSubsetEval algorithm reduced the average number of required genes for model development up to 66. The total number of unique genes required for 24 drugs, were 1296 in number (Supplementary Table S10). We further reduced the average required genes up to 28 with total 619 unique genes using feature selection technique F-stepping (Supplementary Table S10). Subsequently, the models were developed with these selected numbers of genes. The mean PCC for models of all the drugs was 0.73 (Table 3), which was significantly higher than the correlation achieved using models based on models and variation. Our expression based models got high correlations more than 0.8 for certain drugs like AZD6244, Irinotecan, Nilotinib, PD0332991. The performance of models without using F-stepping is also provided in the Supplementary Table S10.

Models using copy number variations

Since the copy number variation (CNV) deals with alteration of genome with large regions resulting in deletion or duplication of many genes, we have also taken CNV as one of the features for predictive modeling. The average number of genes required for all 24 drugs were 45 and 21, before and after F-stepping respectively. After F-stepping, the final models required a total of 877 unique genes (Supplementary Table S10). With an average Pearson correlation of 0.55 for all 24 drugs, CNV as a feature performed better than mutation and variation but less than expression. It is interesting to note that we got high correlation 0.71 and 0.68 for drug Nilotinib and PLX4720 respectively.

Models based on hybrid features

In order to improve performance of our predictive models, we used hybrid features. In case of hybrid features, we combined two or more than two types of genomic features. First, we generated hybrid features by combining genomic features based on mutation, expression and CNV. These hybrid features were reduced using feature selection techniques cfsSubsetEval and F-stepping and achieved average number of genes up to 80 and 34 respectively (Supplementary Table S10). These reduced hybrid features were used for developing prediction model and achieved average correlation of 0.78, which is better than models developed using different sets of features separately. Similarly, we also developed a number of models using combination of genomic features; none of the models got average correlation more than 0.78. In addition our models were highly successful to certain drugs like LBW242, PLX4720 where correlation was 0.9.

Significant mutational features

In order to further minimize the number of genes for modeling, we looked at the drugs, which showed significant difference of IC 50 between mutated and normal cell lines for a given gene. For this, we minimized the number of drugs to as few as 7 drugs, which led us to 30 genes with significant difference (p-value < 0.05) between IC 50 of mutated and normal cell lines. Further, we performed feature selection on significant genes, by F-stepping. Our SVMlight based models achieved a maximum correlation of 0.22 using selected features (Supplementary Table S10). We have included these models in our web server for mutation-based module, which provided drug prioritization prediction with minimum number of genes.

Significant variation features

Similar to significant genes in mutation as given above, we also looked at significant genes keeping variations in mind. We got 32 significant genes, which cover at least 7 drugs. We incorporated the variation profile of these genes as machine learning input and our best model achieved correlation of 0.22. The drug wise performances are mentioned in Supplementary Table S10.

Correlated expressional features

Similar to the significant genes in mutation, we also identified genes whose expression shows high correlated with growth inhibition IC 50 . We selected top 50 such genes, which are highly correlated for all drugs. Our SVM model, based on these 50 genes, attained average correlation of 0.46 for all 24 drugs (Supplementary Table S10).

Drug targets vs selected features

We computed correlation between drug targets and genes selected for developing prediction models to understand relationship between targets and selected features (Supplementary Table S14 to S17). It was observed that number of drug targets have good correlation with genomic features. As shown in Table S14, number of genes selected for developing mutation-based models also includes drug targets (e.g., EGFR, FLT3). In addition, number of drug targets shows high correlation with genes selected for developing prediction model; for example CDK6 shows high correlation 0.41 with gene PDPK1 selected for developing mutation based model for drug PD033299. In case of expression-based models, ALK (target of TKI258) and model gene IL26 have a correlation of 0.58 in expression. In CNV-based models also the target of Nilotinib (ABL1) is in good correlation (0.9) with model genes like QRFP and NUP214 (Supplementary table S16 & S17). In variation-based models, genes selected for developing model for drugs (e.g., 17AAG, PD0325901) also include their drug targets (e.g., HSP90AA1, MAP2K1). Expression-based model for Nutlin includes the drug target MDM2 in the selected model genes. Furthermore, among CNV-based models, drug targets ERBB2 (Lapatinib) and MET (PF2341066) are part of genes selected for developing prediction models.

Comparison with existing study

In the past, several large-scale pharmacogenomics studies have been done with human cancer cell lines, for example, Cancer Cell Line Encyclopedia (CCLE), NCI-60 and Cancer Genome Project. As shown in Table 3, most of our models perform better than CCLE models developed in previous study. CCLE models got maximum correlation 0.43 where as our best model achieved average correlation up to 0.78.

Implementation of web server

In order to serve the scientific community working on cancer biology, we have developed a user-friendly webserver for predicting growth inhibition of anticancer drugs against a cancer cell using its genomic features. In addition, various useful tools are also integrated which assist users in identifying most effective drugs. Brief description of major modules is given follows:

Prioritization

This module is built on the basis of our machine learning models developed based on mutations, variations, expressions, CNVs and their hybrid as input features. User may submit related genomic information or raw NGS data (VCF/ANNOVAR input file)15 and subsequently server will predict the effectiveness of drugs based on models developed in this study.

Drug calculator

This module has been built in order to find out the contribution of each gene in drug prioritization. This module is based on probabilistic approach. First, we have divided the cell lines into resistant (IC 50 > 0.5 μM) and sensitive (IC 50 <= 0.5 μM) cell lines for each of 24 anticancer cells. Then, we developed modules based on various genomic features as shown below.

Mutation and variation-based modules

After classifying the cell lines into resistant and sensitive cell lines, then we computed probability (frequency) of finding the mutations/variations in a particular gene in both resistant and mutant cell lines. User has to enter the HUGO symbol of mutated/variant gene(s) and this module will return the probability values of finding mutations/variations in both resistant and sensitive cell lines.

Expression and CNV-based modules

In these modules, we have calculated the average expression/CNV of a particular gene in resistant and sensitive cell lines together with their differences (DIFF values) for 24 anticancer drugs. By looking at these values user can identify, which drug will be more effective for cell line having high/low expression/CNV of the query gene.

Signature module

The purpose of this module is to identify important genes corresponding to each drug. In case of mutation, we computed average IC 50 of a drug for each gene in mutant & wild cell lines to understand effect of mutation on drug sensitivity or resistance. In case of expression, we computed average expression of each gene for resistant & sensitive cell lines for a given drug.

Mutation/variation based

In these modules, server displays each gene used in this study and average growth inhibition of selected drug in cell lines contains mutated and wild form of a gene. This module also displays difference in growth inhibition with significance (p-value) in difference. These modules provide comprehensive information about a selected drug in terms of effect of mutation or variation of a gene on its growth inhibition.

Expression and CNV Based

The aim of this tool is to find out the difference between average expression/CNV of a particular genes in resistant as well as sensitive cell lines. It allows the user to select a desired drug; then these modules display a list of all genes with their average expression/CNV in resistant and sensitive cell lines. These modules are very useful in understanding effect of expression/CNV of a gene on growth inhibition of a drug.