Identification of lncRNA genes associated with metastasis in the training dataset

The 508 breast cancer patients from the GSE25066 dataset, which was the largest patient dataset used in this study, were randomly divided into a training dataset (n = 254) and a testing dataset (n = 254). A univariate Cox proportional hazards regression analysis was performed to test whether the expression level of each lncRNA was significantly associated with differences in patient MFS in the training dataset, with MFS as the continuous variable and the expression value of lncRNA as the explanatory variable. Nine lncRNAs (RP11-482H16.1, AC010729.1, RP11-983P16.4, FOXD3-AS1, LINC01249, AC096574.4, AC015971.2, AC012487.2 and RP11-15A1.2) were found to be significantly correlated with patient MFS (p < 0.002, Table 1). Of these, RP11-482H16.1 and AC010729.1 showed a positive coefficient in univariate analysis, indicating that patients with a higher expression level of RP11-482H16.1 and AC010729.1 tended to have a shorter MFS compared with patients with lower expression level of RP11-482H16.1 and AC010729.1. For the seven remaining genes, we observed negative coefficients in univariate analysis, indicating that their high expression was associated with a longer MFS. Furthermore, other clinicopathological and molecular features, such as ER status (p = 0.001; HR = 0.427, 95% CI = 0.257–0.711) and ESR1 status (p = 0.005; HR = 0.486, 95% CI = 0.294–0.803) (Table 2), were also found to be significantly associated with MFS in univariate analysis. Therefore, to further obtain the predictive power of lncRNAs after correcting for these variables, the selected nine lncRNAs and these clinical features were fitted in a multivariate Cox proportional hazards regression model in the training dataset. With the expression levels and regression coefficients of the selected nine lncRNAs, a risk score model was built to predict each patient’s risk of developing metastasis, as follows: metastasis risk score = (0.527 × expression value of RP11-482H16.1) + (0.217 × expression value of AC010729.1) + (−0.319 × expression value of RP11-983P16.4) + (−0.422 × expression value of FOXD3-AS1) + (−0.347 × expression value of LINC01249) + (−0.355 × expression value of AC096574.4) + (−0.390 × expression value of AC015971.2) + (−0.389 × expression value of AC012487.2) + (−0.549 × expression value of RP11-15A1.2). Using receiver operating characteristic (ROC) curve analysis, the prognostic power of the nine-lncRNA signature-based risk score was evaluated against the advent of a metastasis event within 5 years as the defining point in the training dataset of 254 patients. As shown in Fig. 1A, the AUC of the nine-lncRNA risk score applied to the training dataset was 0.693, indicating good performance of the nine-lncRNA signature for predicting metastasis in the training dataset. The 254 patients of the training dataset were then divided into a high-risk group (n = 127) and a low-risk group (n = 127) using the median metastasis risk score as the cut-off. The Kaplan-Meier analysis for MFS as a function of the nine-lncRNA signature showed significant differences in MFS between high-risk and low-risk groups (log-rank test p < 0.001, Fig. 1B). Distribution of the nine-lncRNA risk score, metastasis status and lncRNA expression in patients in the training dataset are shown in Fig. 1C. Median MFS of the high-risk and low-risk groups was 2.4 years and 3.0 years, respectively. At 2 and 6 years, the respective absolute difference in the MFS between the high-risk and low-risk groups was 20.4% (73.9% versus 94.3%) and 24.0% (56.1% versus 80.1%). In univariate analysis, the hazard ratio of the high-risk group versus the low-risk group for MFS was 2.993 (p < 0.001, 95% CI = 1.728–5.184) (Table 2), indicating that the high metastasis risk scores from the nine-lncRNA signature was significantly correlated with shorter MFS.

Table 1 lncRNAs significantly associated with the MFS of breast cancer patients in the training set (n = 254). Full size table

Table 2 Univariate analysis on the lncRNA signature for MFS. Full size table

Figure 1 Establishment and performance evaluation of the nine-lncRNA signature for MFS of breast cancer patients in the training dataset. (A) The ROC curves for MFS prediction by the nine-lncRNA signature in the training dataset. (B) Kaplan-Meier analysis for MFS of breast cancer patients using the nine-lncRNA signature in the training dataset. (C) The distribution of the metastasis risk score, patients’ metastasis status and lncRNA expression in the training dataset. Full size image

Performance assessment of the nine-lncRNA signature for metastasis prediction in the testing dataset, the entire GSE25066 dataset and two independent datasets

To validate the prognostic power of the nine-lncRNA signature for the prediction of metastatic risk, the predictive model was applied to the testing dataset (n = 254) and the entire GSE25066 dataset (n = 508). Patients in the testing dataset were divided into a high-risk group (n = 126) and a low-risk group (n = 128) with the nine-lncRNA signature using the same predictive model and threshold from the training dataset. As in the training dataset, MFS of patients in the high-risk group (median 2.6 years) was significantly shorter than that of patients in the low-risk group (median 3.4 years) (log-rank test p < 0.001) (Fig. 2A). MFS of the high-risk group was 77.4% and 55.6% at 2 years and 6 years, respectively, while the corresponding rates in the low-risk group were 90.9% and 84.4%, respectively. The results from univariate analysis of the testing dataset showed that the association of the nine-lncRNA signature with MFS was also significant (HR = 2.794, 95% CI = 1.517–5.148, p < 0.001) (Table 2).

Figure 2 Performance evaluation of the nine-lncRNA signature for MFS of breast cancer patients in the testing dataset and entire GSE25066 dataset. (A) Kaplan-Meier curves for patients in the testing dataset (n = 254). (B) Kaplan-Meier curves for patients in the entire GSE25066 dataset (n = 508). The two-sided Log-rank test was performed to test the difference for MFS between the high-risk and low-risk groups. The number of patients at risk was listed below the survival curves. (C) The distribution of the metastasis risk score, patients’ metastasis status and lncRNA expression in the testing dataset. (D) The distribution of the metastasis risk score, patients’ metastasis status and lncRNA expression in the entire GSE25066 dataset. Full size image

Similar results were obtained from the entire GSE25066 dataset, in which the statistically significant association between MFS and the nine-lncRNA signature risk score was observed by univariate analysis (HR = 2.908, 95% CI = 1.934–4.372, p < 0.001). Patients in the high-risk group (n = 253) had significantly shorter MFS (median 2.3 years) than those in the low-risk group (n = 255) (median 3.1 years) (log-rank test p < 0.001) (Fig. 2B). MFS of the high-risk and low-risk groups was 75.7% and 92.6% at 2 years and 58.1% and 82.2% at 6 years, respectively. The distributions of lncRNA risk score, metastasis status and lncRNA expression of patients in the testing and GSE25066 datasets were then analyzed (Fig. 2C,D). As shown in Fig. 2C,D, the lncRNAs RP11-482H16.1 and AC010729.1 were expressed at high levels in patients with high-risk scores, whereas the remaining seven lncRNAs were expressed at high levels in the low-risk patients of the testing and entire GSE25066 datasets.

The prognostic value of the nine-lncRNA signature was further validated in two independent breast cancer patient datasets (GSE4922 and GSE1456). The nine-lncRNA signature showed similar results in these datasets, confirming its prognostic power to predict metastatic risk. Using the same predictive model and threshold as in the training dataset, 249 patients of GSE4922 were classified into high-risk (n = 99) and low-risk (n = 150) groups. Patients in the high-risk group exhibited significant shorter MFS (median 7.92 years) than those in the low-risk group (median 10.0 years) (log-rank p = 2.98E-02) (Fig. 3A). The nine-lncRNA risk score also classified patients of the GSE1456 dataset into high-risk (n = 67) and low-risk (n = 92) groups with different MFSs (median 6.3 years versus 7.4 years, log-rank p = 9.65E-03) (Fig. 3B). Univariate analysis was also performed on these two independent datasets. The hazard ratio of the high-risk group versus the low-risk group for MFS was 1.584 (p = 0.031; 95% CI = 1.043–2.404) in the GSE4922 dataset and 2.257 (p = 0.012; 95% CI = 1.198–4.250) in the GSE1456 dataset. The distribution of lncRNA risk score, metastasis status and lncRNA expression of patients in each independent dataset was consistent with our findings in the training dataset (Fig. 3C,D).

Figure 3 The nine-lncRNA signature-focused risk score in predicting MFS of two independent datasets. Differences in MFS were assessed between high-risk and low-risk groups for the GSE4922 dataset (n = 249) (A) and the GSE1456 dataset (n = 159) (B). All the p values of Kaplan-Meier analysis were calculated using a two-sided log-rank test. The number of patients at risk was shown below the survival curves. The nine-lncRNA risk score distribution, patients’ metastasis status and heatmap of the nine lncRNA expression profiles were analyzed in the GSE4922 dataset (C) and GSE1456 dataset (D). Full size image

Independence of metastasis prediction by the nine-lncRNA signature from other clinical variables

To further investigate whether the predictive ability of the nine-lncRNA signature was independent of other clinical variables, multivariate Cox regression analysis was performed using lncRNA risk score, age, ER status, ESR1 status and ERBB2 status as covariables in the GSE25066 and GSE4922 datasets (no clinical information was available for the GSE1456 patient dataset). The results showed that the nine-lncRNA signature risk score maintained a significant correlation with MFS after adjustment for other clinical variables (HR = 1.791, 95% CI = 1.105–2.904, p = 0.018 for GSE25066; HR = 1.598, 95% CI = 1.018–2.508, p = 0.042 for GSE4922) (Table 3), indicating that the prognostic value of the nine-lncRNA signature was an independent prognostic factor for the prediction of developing metastatic breast cancer. Next, a data stratification analysis was performed according the ER status, which stratified 747 breast patients of three datasets with known ER status into an ER-negative group and an ER-positive group. The risk score of the nine-lncRNA signature further classified 239 patients with ER-negative status into high-risk (n = 191) and low-risk (n = 48) groups with significantly different MFSs (median MFS 2.1 years versus 3.2 years, log-rank test p = 1.16E-03, Fig. 4A). For 508 patients with ER-positive status, the prognostic value of the nine-lncRNA signature between high-risk (n = 159) and low-risk (n = 349) was similar (median MFS time 3.3 years versus 4.0 years, log-rank test p = 3.07E-02, Fig. 4B). These results suggested that the nine-lncRNA signature was also an independent prognostic variable in the subgroups stratified by ER status.

Table 3 Multivariate analysis on the lncRNA signature for MFS. Full size table

Figure 4 Kaplan-Meier analysis for MFS of breast cancer patients using the nine-lncRNA signature in the subgroups stratified by ER status. (A) Kaplan-Meier curves for breast cancer patients with ER-negative status (n = 239). (B) Kaplan-Meier curves for breast cancer patients with ER-positive status (n = 508). The differences between the two curves were access by the two-sided log-rank test. The number of patients at risk was listed below the survival curves. Full size image

Functional prediction of the nine-lncRNA signature

To infer the potential function of the lncRNAs included in the prognostic signature in breast cancer metastasis, an integrative analysis of lncRNA-mRNA functional association was performed, as previously described24,25,26. mRNAs co-expressed with the nine lncRNAs were identified by examining the correlation between expression levels of lncRNAs and those of mRNAs in the 508 breast cancer patients of the GSE25066 dataset. The expression levels of 321 mRNA were positively correlated with those of at least one of the nine prognostic lncRNAs (Pearson correlation coefficient >0.40). The results from Gene Ontology (GO) functional enrichment analysis showed that 321 mRNAs were significantly enriched in 51 GO terms (see Supplementary Table S1 online), which clustered in cell cycle, translation, DNA damage, signal transduction, response to stimulus, cell death, development, differentiation and apoptosis (Fig. 5). Three Kyoto Encyclopedia of Genes and Genomes (KEGG) pathways were found to be enriched by 321 mRNAs, including cell cycle, oocyte meiosis and the p53 signaling pathway (Fig. 5), which are all known to be associated with breast cancer27,28,29.