Gene co-expression relationship and network analyses identified discrete modules across different cancers

Statistically significant differentially expressed genes were identified in normalized gene expression data from the TCGA depository for eleven cancer types and individually subjected to Weighted Gene Correlation Network Analysis (WGCNA). For each cancer this achieved clustering into a specific number of gene groups referred to as modules (Supplementary Dataset 1). Modules exhibiting a high topological overlap were visualized on a dendrogram using Dynamic Tree Cut algorithm (Fig. 1a).

Figure 1 (a) Cluster dendrograms for (a-i to a-xi) BRCA, COAD, GBM, KIRC, KIRP, LGG, LUAD, LUSC, READ, UCEC, OVCA, each colour represents a specific module, n = number of modules; (b) Percentage module preservation across different cancers (computation details in methods)—numbers in diagonal denote module preservation within each cancer type, those in bold denote strong (across BRCA-COAD-OVCA-READ) and moderate module preservation (GBM-LUAD); (c) Strongly Preserved modules and gene sets BRCA-COAD-OVCA-READ; d. Moderately Preserved modules and gene sets in GBM-LUAD. Full size image

Module preservation and validation reveals co-regulated, preserved common genes between different cancers

Module preservation analysis was carried out at two levels to probe the robustness of conservation or stability–.

Identification of common enriched modules and gene sets across different cancer types

To address our hypothesis that some modules and genes are conserved across eleven different cancers and possibly define common functions, we evaluated the co-expression relationships between modules by applying module preservation statistics Z summary and medianRank across the eleven cancers (Fig. 1b; Supplementary Dataset 2; Supplementary Table S1). Four cancers viz. BRCA, COAD, OVCA and READ expressed strong module preservation, with OVCA exhibiting maximum co-expression with BRCA-COAD-GBM-LUAD-READ, followed by BRCA with COAD-LUAD-OVCA-READ-GBM; COAD with BRCA-OVCA-READ and READ with BRCA-COAD-OVCA (Fig. 1b). Moderate module preservation was expressed by two cancers viz. GBM (with OVCA-LUAD-BRCA-COAD) and LUAD (with BRCA-COAD-GBM-OVCA). However, KIRC, KIRP, LGG, LUSC and UCEC modules show very poor preservation with any cancer. Further extracting details of preserved modules and genes from the contingency table (Supplementary Dataset 2; Supplementary Table S2) revealed three strongly conserved modules in BRCA-COAD-OVCA-READ and one of which not only preserved but was in fact, the only module conserved in GBM-LUAD (Fig. 1c). This affirms the BRCA-OVCA-COAD-READ group as expressing strong module preservation, while GBM-LUAD present moderate module preservation. One common gene sets was associated with each of the 3 preserved modules viz. Set 1 (n = 161), Set 2 (n = 50) and Set 3 (n = 44) in the BRCA-OVCA-COAD-READ group, while a subset of Set 1 genes (n = 55; referred to as the Set 1-s) was conserved in the GBM-LUAD (Fig. 1d; Supplementary Dataset 2; Supplementary Tables S2, S3). Despite a high preservation with OVCA-LUAD-BRCA-COAD, GBM stratifies into the moderate group due to conservation of a single module. Together, these results provided a first definitive proof of conserved modules and genes across different cancers.

Identification of robustly conserved modules across different validation datasets

Expression datasets other than the TCGA available in the public domain for the six conserved cancers were used to validate robustness of network module identification across different platforms and various cohorts (GEO24; Supplementary Dataset 3; Supplementary Table S1). Although this analysis revealed variations in numbers of WGCNA modules and genes, high preservation was evident across different datasets representing the same cancer type (Supplementary Dataset 3; Supplementary Figs S1-S2; Supplementary Dataset 3; Supplementary Table S2). Set 1 (n = 161) and Set 2 (n = 50) genes were moderately to strongly conserved in all validation datasets within the BRCA-COAD-OVCA-READ group, while Set 1-s genes were strongly conserved in the GBM-LUAD group (Supplementary Dataset 3; Supplementary Fig. S3); poor validation of Set 3 genes led to their exclusion from further analyses.

Important hub genes and transcriptional regulators are identified within each preserved gene set

Literature annotation identified Set 1 and Set 1-s to comprise of Transcription Factors (TFs), co-stimulatory molecules, cytokines-cytokine receptors, chemokines-chemokine receptors, cell adhesion molecules and signaling kinases involved in inflammatory and immune responses in the context of cancer development and progression, while Set 2 genes include TFs and ECM molecules involved in cell adhesion, angiogenesis, migration and invasion (Fig. 2). We also probed co-expression networks that could indicate interactions and similar functionalities between enriched genes within WGCNA. Identifying significant genes within each set as those hubs with at least ten interacting partners revealed several specific important hubs within Set 1 and Set 2 genes for all 6 cancers (Supplementary Dataset 4; Supplementary Fig. S1; Supplementary Table S1). Interestingly, although direct interactions between Set 1 and Set 2 hubs did not exist in any cancer, cross-talk through specific interactors was identified in OVCA (Supplementary Dataset 4; Supplementary Fig. S1). We scanned the interacting networks to find key TFs that could be crucial in regulating gene expression during development, differentiation and function of immune cells and/or tumor cell metastases identified AIF1, IKZF1, MNDA, SAMSN1, EOMES, GFI1 and KLHDC7B as significant TF-hubs within Set 1/Set 1-s gene networks in the six conserved tumor types (Supplementary Dataset 4; Supplementary Table S2). These TFs are known to mediate specific biological functions in immune regulation and/or cell migration. TBX21 (Tbet), EOMES and IKAROS regulate the differentiation and function of inflammatory Th1 cells; GFI modulates the differentiation of thymic regulatory CD4 T cells, while MNDA controls TRAIL induced apoptosis of granulocyte-macrophage progenitor cells. Th1 cells are known to inhibit cancer growth whereas Tregs makes the tumor microenvironment more immunologically suppressive. SAMSN1 is reported as being highly expressed in GBM tissue and has been suggested as a prognostic marker25. The Evi group of transcription factors are known to inhibit granulocyte—erythroid lineages and promote megakaryocytic differentiation26. Although KLHDC7B is reported to be highly expressed in breast cancer, its functions are not well studied27. Enriched Set 2 TFs (PRRX1-SNAI1-SNAI2-ZEB1-ZEB2-TWIST1) are extensively studied in the context of cell migration during embryonic development and tumor metastases and mediate EMT28,29,30,31,32. These TFs are also additionally known to be associated with other features besides EMT including oncogenic transformation, resistance to apoptosis and senescence, cancer cell stemness and can also promote tumor angiogenesis33,34,35,36.

Figure 2 AIF1 and PRRX1 as master regulators interact with several genes to mediate specific functionalities. Full size image

Amongst the Set 1 TFs, AIF1 appears to be the most significant hub with maximum network interactions and strong correlation with its interacting partners in the BRCA-COAD-OVCA-READ-GBM-LUAD group (Supplementary Dataset 4; Supplementary Fig. S2). Thus AIF1 within its hub may control expression of other TFs responsible for the development and function of immune cells involved in tumor growth. It is known to regulate several important co-stimulatory molecules in the tumor microenvironment that plays vital role in biological processes like immune response, inflammation, cell survival, apoptosis, proliferation and angiogenesis, is involved in regulating expression of cytokine and cytokine receptors that play a crucial role innate as well as adaptive inflammatory host defenses, apoptosis, angiogenesis, cell growth etc. AIF1 also controls key chemokine receptors involved in migration of immune cells into the tumor micrenvironment and can also modulate the various kinases and signaling molecules; it is involved in inflammatory responses, promotes cell proliferation via activation of NFκB/cyclin D1 pathway and facilitates tumor growth which implicate its association with immune modulation37,38. In addition to immunomodulation, our data indicates that several important cell adhesion molecules such as ICAM1, CD38, CD33, CD37 and CD52 may be regulated by AIF1. Further, such regulation need not be restricted only to cancer cells but could extend further to infiltrating immune and stromal cells towards making the tumor microenvironment favorable for growth and metastases. Such direct or indirect interactions with several key regulatory genes towards regulation of differentiation and function of immune cells, controlling inflammation and promoting a tolerogenic microenvironment during tumor establishment makes AIF1 a candidate ‘master’ gene regulator of immunomodulation.

Within the Set 2 interacting networks, PRRX1 was the lone significant, large conserved TF-hub in the BRCA-COAD-OVCA-READ strongly preserved tumors although other TFs including EVI2A, EVI2B, TBX21, SNAI1, SNAI2, ZEB1, ZEB2 and TWIST1 were also identified from their extended interactions with the primary important hubs. Like AIF1, PRRX1 expression strong correlated with its interacting partners, several of which were conserved across the four cancers (Supplementary Dataset 4; Supplementary Fig. S2). PRRX1 expression is implicated in the induction of EMT, while lowered expression confers colonization and stemness abilities39. Our results corroborate these findings by revealing its interactions with several EMT-TFs (SNAI1, SNAI2, ZEB1, ZEB2, TWIST1), ECM components (ADAM12, CDH11, COL5A1, THBS2, SPARC, FAP, VCAN, COL1A1, COL1A2) and also with ZNF469, which is a TF associated with collagen synthesis and organization40. A possible regulatory role for PRRX1 is thus assigned in the regulation of metastases. Effectively, a cumulative effect of inflammatory and tolerogenic immune responses in the tumor microenvironment controlled by AIF1 and PRRX1 are identified to dictate tumor growth and metastasis. These findings together very importantly, suggests that the two conserved gene sets drive biological functions contributing to some of the hallmarks of cancer41. To confirm that regulation of immunomodulation and metastasis by master regulators AIF1 and PRRX1 were not specific to the TCGA datasets, we independently validated gene interactions regulated by Set 1/Set 1-s and Set 2 for six cancers in in other datasets of six cancers (Supplementary Dataset 3; Supplementary Table 1). Consistent with the TCGA-based results, gene interactions in these validation datasets affirmed regulation of genes and TFs involved in immune responses and metastasis to be governed by AIF1 and PRRX1 respectively (Supplementary Dataset 5).

Stratification with Set1/Set 1-s and Set 2 genes correlates with clinical outcome

We further defined the 44 common significant hub markers amongst the preserved gene sets (36 and 8 genes from Set 1 and Set 2 respectively) across the 4 strongly preserved cancers (BRCA-COAD-OVCA-READ) as classifiers and applied these in the stratification of TCGA tumor sample datasets into 4 classes (Supplementary Dataset 4; Supplementary Table S3). In each cancer, Class 1 represents expressed classifiers of both gene sets; Class 2 and Class 3 are associated with upregulated Set 1 and Set 2 classifiers respectively while Class 4 represents downregulated classifiers of both sets. The 2 moderately preserved cancers (GBM-LUAD) were stratified into 2 classes using 33 Set 1-s gene classifiers. GSEA affirmed upregulation of Set 1 associated pathways in Class 1—Class 2 samples and their downregulation in Class 3—Class 4 for BRCA-COAD-OVCA-READ. In GBM-LUAD, Class 1 and Class 2 samples exhibited upregulated and downregulated pathways associated with Set 1-s genes (Supplementary Dataset 6; Supplementary Figs S1-S2; Supplementary Dataset 6; Supplementary Tables S1-S2). Set 1 & Set 1-s genes influenced pathways of inflammatory responses and are related with cancer development and progression. Similarly, Class 1 and Class 3 across all BRCA-COAD-OVCA-READ were positively associated with Set 2 driven pathways whereas class 2 and 4 show a negative association (Supplementary Dataset 6; Supplementary Fig. S3; Supplementary Dataset 6; Supplementary Table S3). Three Set 2 pathways including ECM-receptor interactions (govern tissue and organ morphogenesis to maintain cell and tissue structure and function), focal adhesion (cell motility, proliferation, differentiation, regulation of gene expression and cell survival) and integrin 1 (adhesion receptors in cell-extracellular matrix interactions) were common to all 4 cancers.

To find a correlation between these classifiers and patient prognosis, expression of Set 2 classifiers appears to be significantly correlate with poor clinical outcome in BRCA patients (Classes 1 and 3 vs. Classes 2 and 4; Fig. 3a-i,a-ii) while Set 1 classifier expression in COAD correlate significantly with poor clinical outcome (Classes 1 and 2 vs. Classes 3 and 4; Fig. 3b-i,b-ii). In OVCA, Class 1 patients in which classifiers of both gene sets are upregulated, present the worst survival as compared to any other class (Fig. 3c-i,c-ii), that could possibly result from complementation of biological functions to drive aggressive disease and poor patient survival. On the other hand, Class 4 READ patients in which classifiers of both gene sets are downregulated show best prognosis as compared to any other class (Fig. 3d-i,d-ii), suggestive that expression of classifiers from either of the two gene sets can contribute to adverse prognosis. Expression of the 33 Set 1-s classifiers in GBM and LUAD Class 1 tumors associates them with poor prognosis as compared to Class 2 patients in which these genes are downregulated (Fig. 3e-i,e-ii,f-i,f-ii). Taken together, such stratification identifies differential correlation between expression of classifiers and patient survival.

Figure 3 Heat map representation of 44 classifiers (Set 1: n = 36; Set 2: n = 8) expression based stratification into four classes and associated Kaplan Meier analysis for survival in (a) BRCA (b) COAD (c) OVCA (d) READ; and similarly for 33 Set 1-s classifier expression based stratification into two classes in (e) GBM and (f) LUAD. Full size image

This class-associated prognosis enticed us to further winnow out significant cancer-specific risk genes (prognostic factors) within the Set 1, Set 2 and Set 1-s genes (methods detailed in Supplementary-Dataset 7). Thus 23 risk genes were identified in BRCA patients and all of them were associated with Set 2 and included 4 classifiers viz. COL5A2, FAP, COL5A1 and ADAM12 (Supplementary Dataset 7; Supplementary Fig. S1; Supplementary Dataset 7; Supplementary Table S1; p ≤ 0.01). In case of COAD patients, 18 Set 1 genes were identified as risk genes, 5 of which viz. PLEK, CLEC4A, LCP2, ITK and CD53 were classifiers (Supplementary Dataset 7; Supplementary Fig. S2; Supplementary-Dataset 7; Supplementary Table S1; p ≤ 0.01). 12 risk genes from Set 1 and Set 2 (n = 2 and n = 10 respectively) were predicted for OVCA cancer, none of which were amongst the classifiers (Supplementary Dataset 7; Supplementary Fig. S3; Supplementary-Dataset 7; Supplementary Table S1; p ≤ 0.01). 39 risk genes were predicted in READ patients (Supplementary Dataset 7; Supplementary Fig. S4; Supplementary Dataset 7; Supplementary Table S1; p ≤ 0.01) 30 of which were from Set 1 (including 15 classifiers viz. PTPRC, CD33, CLEC4A, CYBB, AIF1, CD53, PLEK, EVI2A, CD74, CD48, LCP2, ITGB2, CD52, LAPTM5, ARHGAP15) and 9 from Set 2 (single classifier CDH11). This further strengthens the possibility that complementation of biological functions between Set 1 and Set 2 genes may work against survival in OVCA and READ. 24 of the 37 predicted risk genes in GBM were classifiers (Supplementary Dataset 7; Supplementary Fig. S5; Supplementary Dataset 7; Supplementary Table S1; p ≤ 0.01), while 23 of the 26 predicted risk genes in LUAD were classifiers (Supplementary Dataset 7; Supplementary Fig. S6; Supplementary Dataset 7; Supplementary Table S1; p < 0.05).

Common risk genes effectively predicts survival of cancer patients

The above risk signatures unique for each individual cancer involve a large number of genes for establishing prognosis. We hypothesized that a practical approach towards simplifying the same without compromising the predictive potential may be possible by applying selective common genes as opposed to an extensive panel of cancer-specific genes that would be a convenience in moving prognostic predictions to a next level of applications. To evaluate this, we further identified six Set 1/Set 1-s common risk factors in COAD, OVCA, READ, GBM and LUAD (PLEK, LCP2, CD53, MNDA, NCF2, CYBB; p < 0.05; Fig. 4a-i; OVCA was an outlier and failed to demonstrate any commonality in this set of genes). Similarly three Set 2 genes were identified as being common to BRCA-OVCA-READ (WISP1, CTSK, ADAM12; p < 0.05; Fig. 4b-i). Principal component analysis (PCA) of these risk genes to observe their correlation and joint behavior across patients of COAD-READ-GBM-LUAD and BRCA-OVCA-READ cancers was carried out using first three principal components (PCs) that capture 96% of expression variance. Gene weights for first PC have same sign and similar values and represent a fairly uniform shift in the overall expression (Supplementary Dataset 7; Supplementary Table S2). Further projection of COAD-READ-GBM-LUAD or BRCA-OVCA-READ tumor samples onto a plane defined by the first two PCs stratified these into two discrete clusters based on a either a positive and negative shift along the first PC that represents differential survival (Fig. 4a-ii,b-ii). The first three PCs account for 99% of expression variance for genes and as above, similar sign and weights for first PC reflect a uniform shift in overall expression (Supplementary Dataset 7; Supplementary Table S2).

Figure 4 (a-i) Venn diagram representing common risk genes (p < 0.05); (a-ii) Projection of tumor samples onto the plane defined by the first and the second principal components using 6 risk genes for COAD-READ-GBM-LUAD group, High positive shift along first PC suggests overall decrease in gene expression whereas negative shift denotes overall increase in gene expression; (b-i) Venn diagram and (b-ii) PCA plot for samples of BRCA-OVCA-READ group. Full size image

To further evaluate the efficacy of such practicality, comparative risk predictions between individual and common risk gene signatures in each cancer were determined. Thus, risk associations with gene expression were first established to (prediction of high or low risk; p < 0.05) followed by correlations with actual existing patient risk, that led to computation of sensitivity and specificity scores. In all cancers individual risk gene sensitivities and specificities ranged from 62–78% and 58–100% respectively, while the 6 common risk gene signature in exhibited 57–66% sensitivity and 65–80% specificity COAD-READ-GBM-LUAD and the 3 common risk gene signature ranges for sensitivity and specificity were 59–61% and 51–90% respectively in BRCA-OVCA-READ (Fig. 5a). Statistical evaluation of prognostic efficacies between individual and the common risk signatures using Mcnemar’s test assigned significance to prediction of high and low risk GBM-BRCA-OVCA over COAD-READ-LUAD patients (p < 0.05; Supplementary Dataset 7; Supplementary Tables S3-S4). Within the former group, the common risk signature prognostication was higher than that of individual risk genes for GBM patients, remained similar for in BRCA and showed marginally lowered specificity in OVCA (Fig. 5a).

Figure 5 (a,b) Comparison of prognostic efficacies (Sen-Sensitivity, Spe-Specificity) between (a) individual (indvl) vs. common (c) risk signatures and (b) individual vs. 15-gene signature for six cancers, $-p < 1 × 10−8, §-p < 1 × 10−7; (c–h) K-M plots of survival in predicted risk groups based on differential expression of specific gene subsets within the 15-gene signature in BRCA, COAD, OVCA, READ, GBM or LUAD cancers in HR (high risk) and LR (low risk) samples; (i) GBOCRL-IIPr chip for the 15-gene signature (first 6 genes panel (left to right)- 6 common Set 1 risk genes, second 3 genes panel—3 common Set 2 risk genes, third 3 genes panel—3 most significant genes from COAD, last 3 genes panel—3 most significant genes from LUAD); (j) Graphical representation of risk assessment of GBOCRL- IIPr panel through re-sampling in 100 random datasets of the six cancers. Full size image

To further improve the prognostic prediction value for COAD-READ-LUAD, different approaches were tested. Since READ cancer patients showed similar prognostic efficacies for Set 1 and Set 2 genes, evaluating all 9 common genes (6 from Set 1 and 3 from Set 2) enhanced sensitivity and specificity to 55% and 88% respectively (Fig. 5b). On the other hand, in COAD and LUAD a combination of the 6 common risk genes from Set 1 with 3 most significant individual risk genes for each cancer type was tested and observed to enhance prognostic efficacy (Fig. 5b; Supplementary Dataset 7; Supplementary Table S5). Kaplan-Meier analysis further supported these derivations since the defined gene subsets could successfully predict patients as being at either at a high or low risk and were further supported by Kaplan-Meier analysis (Fig. 5c–h; Supplementary Dataset 7; Supplementary Fig. S7). Thereby, 6 common Set 1 genes in GBM, 3 common Set 2 genes in BRCA-OVCA, 6 common Set 1 + 3 common Set 2 genes in READ, 6 common Set 1 + 3 most significant individual genes in COAD and LUAD are highly prognosticative. In conclusion, this 15-gene signature as identified through a comparative risk prediction between the individual and common risk genes represents in GBM-BRCA-OVCA-COAD-READ-LUAD (GBOCRL) cancers, the biological functions of immunomodulation and invasion (II) with a high prognostic (Pr) efficacy. This we termed as a GBOCRL-IIPr panel (Fig. 5i). As a final validation, we applied a random re-sampling strategy as described earlier42 within the TCGA data to generate 100 random datasets for each cancer type in which the GBOCRL-IIPr panel was further evaluated (Supplementary Dataset 8; Supplementary Table 1). The robust predictive power achieved in this reassessment confirmed the statistical significance of GBOCRL-IIPr genes in these six cancers (Fig. 5j).