Called SNVs

After variant calling and SNPiR post-processing, 96,025 and 213,020 unique variants with read coverages >=10 were found in the NSCLC and breast cancer datasets, respectively. SNV matrices were created by limiting SNVs to those that had at least 3 non-zero values across samples in the NSCLC dataset and 6 non-zero values in the breast cancer dataset. Tables 1 and 2 show the distributions of SNVs for each dataset based on region of origin as determined by RefSeq annotations.

Table 1 Lung: Distribution of SNVs by Region Full size table

Table 2 Breast: Distribution of SNVs by Region Full size table

Model performance

sPLS-DA was used to create models using different subsets of SNVs based on type (as shown in Tables 1 and 2) over different ranges of K. For the NSCLC dataset, the classification target was patient relapse within a 3 year period, labeled as Relapse, R, or Disease Free, DF. For the breast cancer dataset, the model sought to classify each sample as being from either the cancer subtype estrogen receptor positive, ER+, or hormone receptor triple negative, TR-.

Disease outcome in non-small cell lung cancer

Table 3 contains measures of performance for models trained on different subsets of SNVs in the NSCLC dataset. What is immediately apparent is that the nonsynonymous exonic model had the best performance by a large margin. The model performed better than chance as seen by its AUC and predictive accuracy. Furthermore, permutation tests reveal that the performance of the model is dependent on the true label groupings (p=0.016), thereby, suggesting that selected SNVs reflect a true biological phenomenon. With the addition of synonymous variants in the model, however, performance reflects that of a failed model. Interestingly, the distribution of AUCs from the nonsynonymous exonic model was significantly better than the distribution of AUCs from the gene expression model (Student’s t-test, p<0.001). Though some of the other models have AUC values that seem to be better than chance, their performances are not significant based on permutation test values.

Table 3 NSCLC model performances Full size table

Hormone receptor status in breast cancer

Table 4 contains measures of performance for models trained on different subsets of SNVs in the larger breast cancer dataset. Strikingly, all models tested have high AUC distributions and high predictive accuracies. In contrast to the NSCLC dataset, the addition of synonymous SNVs in the exonic model produced significantly better performance (Student’s t-test, p<0.001), while selecting roughly half as many features. The distributions of AUC values from the intergenic and all-SNVs models were significantly higher than from all other models (Student’s t-test, ps <0.001) - being able to accurately predict TR- samples 96 and 97 % of the time. Interestingly, models with relatively small amounts of starting features (5’ UTR, up/downstream models, and RNA-editing) were also able to produce accurate results. Most importantly, the biological significance of these models is supported by the result that all had permutation test p-values <0.001. Of note, the model trained on gene expression features had significantly better performance than all models trained on SNV features (Student’s t-test, p <0.001), however, the all-SNVs model surpassed the gene expression model when classifying TR- samples (Student’s t-test, p=0.018), which in this dataset can be considered the experimental group. Furthermore, only one of the top 15 features selected by the gene expression model corresponds to a gene where from a top predictive SNV feature is found: ZNF552.

Table 4 Breast model performances Full size table

Predictive SNV features: disease outcome in non-small cell lung cancer

The rankings of selected nonsynonymous features in the NSCLC dataset were significantly different than univariate rankings from Fisher’s exact and Wilcoxon rank sum tests as determined by Friedman rank sum tests (ps < 10−16). Figure 2 visualizes allele fraction distributions of the top 15 predictive SNVs identified during the creation of nonsynonymous exonic SNV model. SNVs chosen are more abundant in one of the groups and/or have higher AF values. Both classes were equally representative in the top selected SNV features (7 DF vs 8 R). Figure 3 contains a heatmap produced during hierarchical clustering analysis of the top 15 selected features. Not surprisingly, SNV-DA was able to prioritize features that segregate the two groups. The heatmap also visualizes co-occurring features, one example being the three SNV features lying in TACC3, which form their own cluster.

Fig. 2 NSCLC Exonic SNVs Alelle Fractions. Box plots of allele fraction distributions of the top 15 predictive SNVs identified during the creation of the nonsynonymous exonic SNV model in the NSCLC dataset. Only allele fractions >0 are plotted, though zero values contribute to box plot distributions Full size image

Fig. 3 Hierarchical Clustering of NSCLC Exonic SNVs A heatmap demonstrating the distinct clustering of sample groups and the co-occurrence of top nonsynonymous exonic SNV features in the NSCLC dataset. NA values are filled in with black Full size image

Additional file 1 contains the list of nonsynonymous exonic SNVs selected in this dataset as well as their respective model loadings. Limiting analysis to genes where selected SNVS are located, 19.49 % of gene/sample pairs showed significant ASE - an enrichment compared to that of the null distribution (p<0.001, 2.64X greater than the mean of the null distribution).

Table 5 contains annotations of the top 15 selected features. With 11 of the top 15 features being from genes that have previous associations with cancer, it is clear that the proposed methodology was able to identify features that have possible implications to cancer biology. Several examples include: CEACAM6, which is routinely used as a tumor marker in several cancers (including lung cancer) [9]; MTRR, in which variants have a well-documented association with increased risk of NSCLC [77]; and three SNVS in TACC3, whose high expression is associated with poor prognosis in NSCLC [37].

Table 5 Top 15 nonsynonymous exonic SNV features in NSCLC Full size table

Predictive SNV features: breast cancer hormone receptor status

Except for up/downstream and 5’UTR models, the rankings of selected SNV features for each model were significantly different than univariate rankings from Fisher’s exact test (ps <10−6). When comparing rankings to those produced by Wilcoxon rank sum test, all were significantly different (ps <10−7). The similarity to univariate rankings in the two models is likely a result of a small initial feature set size and/or the types of patterns seen in the data. For example, though the 5’ UTR model produced rankings that were significantly different than rankings from Wilcoxon rank sum test, they were not significantly different than those from Fisher’s exact test (p =0.515), suggesting that predictive power of selected SNVs in this model result more from the differential abundance of AF values (number of nonzeros) than with the differential magnitude of AF values between groups, which Wilcoxon rank sum test seeks to quantify. The distribution of SNVs by region of origin selected during the training of the all-SNVs model is given in Fig. 4. Notice that the majority of selected SNVs are located in traditional coding regions.

Fig. 4 Distribution of Selected SNV Features in All-SNVs Breast Cancer Model. 3’UTR, exonic, and intronic SNVs dominate the distribution of selected SNV features Full size image

Additional files 2, 3, 4, 5, 6, 7 and 8 contain the lists of SNV features selected during the creation of each model. In the following sections, the top 15 SNVs for selected models are highlighted to demonstrate that the genes in which they are reside are enriched for associations with cancer.

Exonic

Figure 5 visualizes allele fraction distributions of the top 15 predictive SNVs identified during the creation of the exonic SNV model in the breast cancer dataaset. SNVs chosen are more abundant in one of the groups and/or have higher AF values. Figure 6 demonstrates the clustering of samples by hormone receptor status. Though not a perfect clustering, the top 15 (of 50) features adequately segregate the two groups. Genes where selected exonic SNVs are located showed an enrichment of significant ASE events −16.67 % of gene/sample pairs (p<0.001, 1.75X greater than random mean).

Fig. 5 Breast Exonic SNVs Alelle Fractions. Box plots of allele fraction distributions of top 15 predictive SNVs identified during the creation of the exonic SNV model in the breast cancer dataset. Only allele fractions >0 are plotted, though zero values contribute to box plot distributions Full size image

Fig. 6 Hierarchical Clustering of Breast Exonic SNVs. A heatmap demonstrating the distinct clustering of sample groups and the co-occurrence of top exonic SNV features in the breast cancer dataset. NA values are filled in with black Full size image

Ten of the top 15 SNV features lie in genes that have previous associations with cancer; 6, of which, are located in genes specifically associated with breast cancer (Table 6). Several outstanding examples include: HRPAP20, a hormone regulated breast cancer oncogene that promotes malignant tumor growth [40]; ARHGEF16, which promotes migration and invasion of breast cancer cells [29]; FASN, whose upregulation is associated with HER2+ tumors and metastatic lesions [38] (confirmed in this dataset: FC = 3.6, p<10−5, DESeq2); and USP35, amplification of which is associated with significantly worse prognosis in breast cancer patients and is associated with ER- tumors [21]. The latter demonstrates a seemingly paradoxical result as the SNV in question is largely abundant in ER+ tumors. In fact, in this dataset USP35 is significantly upregulated in ER+ tumors (FC =4.2, p<10−5), perhaps providing evidence that conflict with the observation that USP35 amplification is associated with ER- tumors. It is important to note that the 5 SNV features that do not have previous associations with cancer lie in genes that are uncharacterized, 3 of which from the same gene, LRRC56. These SNVs thus implicate genes that may provide future insight into the biology of the different cancer subtypes.

Table 6 Top exonic SNVs in breast Full size table

Intronic

Figure 7 visualizes allele fraction distributions of the top 15 predictive SNVs identified during the creation of the intronic SNV model in the breast cancer dataaset. The majority of identified SNV features have obvious differences in AF distributions, except for the SNV in HFM1 (chr1:91852851 A → G) which has small AF values. Interestingly, this gene is differentially expressed in this dataset with ER+ tumors expressing less reads (FC=−1.9,p=0.011), suggesting a possible association of the SNV in the downregulation of the gene. Though not significant, the expression of HFM1 is decreased in samples in which the SNV is present (Student’s t-test, p=0.084). Furthermore, selected intronic SNVs lie in genes that are enriched with ASE events, 24.04 % of gene/sample pairs (p<0.001, 3.16X greater than random mean).

Fig. 7 Breast Intronic SNVs Alelle Fractions. Box plots of allele fraction distributions of top 15 predictive SNVs identified during the creation of the intronic SNV model in the breast cancer dataset. Only allele fractions >0 are plotted, though zero values contribute to box plot distributions Full size image

Eleven of the top 15 SNVs are located in genes that have previous associations with cancer (Table 7); 9 of which are associated with breast cancer specifically. Some interesting examples include: 2 SNVs that are located in HDAC7, which has been shown to promote breast cancer cell survival and resistance to therapy [80]; 2 SNVs lying in CTBP1 and CTBP2, both of which are associated with breast cancer cell proliferation - the former being a regulator of BRCA [23, 53]; CD151, whose deregulation is predictive of poor outcome in node-negative lobular breast carcinoma [70]; and SNED1, high expression of which is correlated with poor outcome for ER-/PR- breast cancer patients [61] (interestingly SNED1 is significantly upregulated in ER+ in this dataset: FC = 2.8, p<10−4). Also of note, several variants lie in genes that were previously implicated in the exonic model: two SNVs in ARHGEF16, one in EML4, and one in LMNTD2.

Table 7 Top intronic SNVs in breast Full size table

3’ UTR

Figure 8 visualizes allele fraction distributions of the top 15 predictive SNVs identified during the creation of the 3’ UTR SNV model in the breast cancer dataset - demonstrating differential abundance and/or AF distributions. 20.41 % of gene/sample pairs were enriched with significant ASE events (p<0.001, 1.71X greater than random mean).

Fig. 8 Breast 3’UTR SNVs Alelle Fractions. Box plots of allele fraction distributions of top 15 predictive SNVs identified during the creation of the 3’UTR SNV model in the breast cancer dataset. Only allele fractions >0 are plotted, though zero values contribute to box plot distributions Full size image

Similar to the previously described models, the top 15 SNV features are located in genes enriched with cancer associations, 10 of 15 (Table 8); 6 of which are specifically associated with breast cancer. Some interesting examples include: NOTCH1, which has been shown to promote recurrence in breast cancer [1]; IKBKE, a breast cancer oncogene which is upregulated in TR- breast cancers [6]; LAMB1, a breast cancer biomarker [85]; PDXK, which is associated with breast cancer relapse and metastasis [35]; and COL1A1, which is upregulated in progesterone receptor positive breast cancer patients [50]. Interestingly, the top predictive SNV, rs11515 (a SNP located in the tumor suppressor gene CDKN2A), is associated with poor survival in glibolastoma patients and is moderately associated with breast cancer risk [24, 71]. Also of note, one of the SNVs lies in LRRC56, which has appeared in the top selected features in the two previous models.

Table 8 Top 3’ UTR SNVs in breast Full size table

RNA-editing

Figure 9 visualizes allele fraction distributions of the 12 predictive SNVs identified during the creation of the model using SNVs located at known RNA-editing sites. Nine of the SNVs lie in genes that have previous associations with cancer (Table 9). The top SNV, a synonymous mutation lying in NEIL1, is the only exonic SNV chosen in the model. In fact, there is a paucity of exonic SNVs in the total set (n=8). Interestingly, RNA-editing sites and SNPs in this gene have been identified in other cancers [65]. Another interesting example is a SNV lying in NEAT1, a lncRNA whose overexpression is associated with poor prognosis in squamous cell carcinoma patients [19]. Intriguingly, one of the RNA-editing SNVs lies in ZNF552, which is also implicated in the 3’ UTR and gene expression models.

Fig. 9 Breast RNA-editing SNVs Alelle Fractions. Box plots of allele fraction distributions of 12 predictive SNVs identified during the creation of the RNA-editing SNV model in the breast cancer dataset. Only allele fractions >0 are plotted, though zero values contribute to box plot distributions Full size image

Table 9 Top RNA-editing SNVs in breast Full size table

Intergenic

Something immediately noticeable about the features selected by this model is that 66 of the 771 features are located proximal to each other spanning 8,610 bp (chr9: 68, 416, 905–68, 425, 515; hg19). Interestingly, 36 of these are further concentrated in a 1,635 bp region (chr9: 68, 418, 108–68, 419, 742) defined by a strong H3K27Ac peak and histone methylation peaks (H3K4me1, H3K4me3, H3K27me3, and H3K36me3) present in chronic myelogenous leukemia cell lines (K562), a moderate peak in human embryonic stem cells (H1-ESC), as well as a DNaseI hypersensitivity site and evidence of CTCF binding (Fig. 10) [25]. The enrichment of these peaks provide strong evidence that this locus is regulated in K562. Because this region is highly enriched with selected SNVs associated with the ER+ subtype, we have termed this locus estrogen receptor positive associated hotspot (ERPAHS). In fact, 9 of the top 15 selected intergenic SNVs are located in this concentrated region (Fig. 11). Furthermore, the only characterized transcripts within 100 kbp of ERPAHS are two immediately flanking miRNAs (mir4477A and mir4477B) and a pseudogene FRG1JP, all of unknown function [68], though the two miRNA were shown to be expressed by a subset of lymphoma cell lines in a published study [33].

Fig. 10 Breast Intergenic Hotspot A UCSC genome browser view demonstrating the region of enriched selected intergenic SNVs in genomic position 9q12[43]. This locus is defined by the presence of several regulatory markers including CTCF binding, H3K27Ac, histone methylation marks, and a DNaseI hypersensitvity site, all of which were found K562 cell lines. The enrichment of top selective intergenic SNV features suggests that this locus is associated and regulated in ER+ primary tumors Full size image

Fig. 11 Breast Intergenic SNVs Alelle Fractions. Box plots of allele fraction distributions of top 15 predictive SNVs identified during the creation of the intergenic SNV model in the breast cancer dataset. Only allele fractions >0 are plotted, though zero values contribute to box plot distributions. SNVs lying in the densely populated region of the estrogen positive associated hotspot are highlighted in red Full size image

To assay whether this region is regulated in ER+ and TR- breast samples, we sought to determine if the locus is differentially expressed. The location of ERPAHS (as defined by the 8,610 bp region ± 1,000 bp; chr9: 68, 415, 905–68, 426, 515) was included in the hg19 RefSeq annotation used by featureCounts during read assignment. Importantly, this region does not overlap the annotated transcripts mentioned above. Strikingly, this region is highly expressed in both ER+ and TR- breast samples and is upregulated in ER+ tumors (FC = 1.63, p=0.0265, Fig. 12). Furthermore, allele fraction values of the most predictive intergenic SNV, rs113539941 (chr9: 68418921 C →T), were significantly correlated with increased expression of this locus (Fig. 12). When comparing expression across samples classified with the binary presence of rs113539941, there is a more pronounced level of differential expression (FC = 2.67, p<10−5, Fig. 13), suggesting a functional role for associated mutations at this locus. This example provides a proof of concept of the utility of the proposed approach to identify biologically associated regions, genes, and/or SNVs. In fact, there are 7 other regions identified via the selected intergenic SNVs that have at least 10 co-occurring SNVs. These loci provide potential new avenues for breast cancer research - further work should be done to identify their biological significance and functional roles in cancer.

Fig. 12 Correlation of ERPAHS Expression with Hormone Receptor Status and rs113539941 A boxplot demonstrating the high expression of ERPAHS in both ER+ and TR- primary breast cancer samples as well as the significant upregulation of ERPAHS in ER+ tumors compared to those of TR-. Increased expression of ERPAHS is also correlated with higher allele fraction values in tumors. The grey region represents the 95 % CI from linear regression Full size image