The Prediction of Survival in Hepatocellular Carcinoma Based on A Four Long Non-coding RNAs Expression Signature

Prognostic stratification in hepatocellular carcinoma (HCC) patients is still challenging. Long non-coding RNAs (lncRNAs) have been proven to play a crucial role in tumorigenesis and progression of cancers. The aim of this study is to develop a useful prognostic index based on lncRNA signature to identify patients at high risk of disease progression. We obtained lncRNA expression profiles from three publicly available datasets from Gene Expression Omnibus (GEO) and The Cancer Genome Atlas (TCGA). By the risk scoring method, we built an individualized four-lncRNA signature (HCCLnc-4) to predict survival of HCC patients in the discovery set (ROC curve, AUC: 0.83, 95% CI: 0.65-1.00, P < 0.05, Kaplan-Meier analysis and log-rank test, P < 0.01). Similar prognostic value of HCCLnc-4 has been further verified in two other independent sets. Stratified analysis and multivariate Cox regression analysis suggested the independence of HCCLnc-4 for prediction of HCC patient survival from traditional clinicopathological factors. Area under curve (AUC) analysis suggested that HCCLnc-4 could compete sufficiently with, or might be even better than classical pathological staging systems to predict HCC patient prognosis in the same data sets. Functional analysis and network analysis suggested the potential implication of lncRNA biomarkers. Our study developed and validated the lncRNA prognostic index of HCC patients, warranting further clinical evaluation and preventive interventions.


Introduction
Hepatocellular carcinoma (HCC), as the most common type of liver cancers, ranks as the fifth most common cancer globally and is the second most common cause of death by cancer worldwide [1]. Accurately estimating HCC patients' prognosis, choosing effective treatment protocol for high-risk patients, and prolonging the survival time are greatly important in clinical practice. Previous studies suggested numerous factors related to the prognosis of HCC, including gender [2], age [3,4], infection of HBV [5,6], cirrhosis [7], alpha-fetoprotein (AFP) levels [8,9], and various pathophysiological characteristics of tumor [8,10]. However, because of highly clinically and molecularly heterogeneity of human HCC, and hence our limited understanding of the mechanisms underlying tumorigenesis and development of HCC, evaluation based on these traditional prognosis factors is not comprehensive. Moreover, given the systematic measurement biases due to the experimental batch effects, a more personalized prognostic evaluation is needed for individual patients to guide clinical treatments [11][12][13]. With the advance of high-resolution microarrays and sequencing technology, numerous tissue-and serum-associated HCC biomarkers as potential diagnostic, prognostic, and therapeutic Ivyspring International Publisher targets were presented [14]. Among them, one class of newly discovered non-coding RNAs (ncRNAs), long non-coding RNAs (lncRNAs), have obtained increasing attention [15].
In this study, from the perspective of lncRNA expression pattern, applying three datasets of gene expression profiles from Gene Expression Omnibus (GEO) and The Cancer Genome Atlas (TCGA) for 447 HCC patients, we investigated associations between expression levels of lncRNAs and survival outcomes of HCC patients. By the risk scoring method, we established and validated an individualized four-lncRNA signature in different datasets. Functional analysis suggested the four lncRNAs and related genes might be involved in known HCC-related pathophysiologic processes.

HCC patient cohorts
The GSE14520 (https://www.ncbi.nlm.nih.gov/ geo/query/acc.cgi?acc=GSE14520) dataset contained two batches of samples. The smaller batch included 22 HCC samples, generated by Affymetrix Human Genome U133A 2.0 Array platform, were used to train a prognostic signature. The larger batch included 225 HCC samples, generated by Affymetrix HT Human Genome U133A Array, were used as the first validation dataset (test cohort-1). The second validation dataset was composed of 200 TCGA HCC samples from TANRIC database (TANRIC: An interactive open platform to explore the function of lncRNAs in cancer.), denoted as test cohort-2. All non-coding transcripts included in the three datasets were for patients treated with surgery only. In addition, the RNA and miRNA expression data, as well as the clinical data of HCC patients were obtained from TCGA (https://tcga-data.nci.nih.gov/ tcga/). The flow chart of the study is described in Figure 1.

Probe re-annotation and identification of lncRNA
The probe sets ID represented on the Affymetrix microarrays were checked by NetAffx Annotation Files ( 'anti-sense', 'sense_ overlapping', '3prime_overlapping_ncrna', or 'misc_RNA'. Next, for the probe sets from the Refseq database, those IDs beginning with 'NR' were retained, and transcript IDs labeled with 'NP' were deleted. We removed probe set IDs annotated as 'microRNA', 'snoRNAs', ' pseudo-genes' and other small RNAs. Finally, 842 lncRNA-specific probes corresponding to 351 lncRNAs were obtained for further analysis. When multiple probes were mapped to the same lncRNA, expression values of these probes were integrated by using the median value to represent the expression value of the single lncRNAs. All of the raw data were processed using affy and related R packages with Robust Multi-array Average approach for background normalization as per the package instruction.

Statistical analysis
The association between expression levels of lncRNAs and HCC patients' overall survival (OS) was evaluated by univariate Cox proportional hazards regression analysis (P < 0.001 as selection criteria). Then the combinations of lncRNAs related to HCC patients' OS were analyzed by repeated comparison analysis to identify the best prognostic model for predicting the OS of patients. Next, the lncRNA expression signature, termed HCCLnc-4, was established by the risk scoring method, as described by Lossos et al [21]. Then the risk score value that produced the shortest distance to the point of perfect prediction of the ROC curve, was selected as the cutoff point. By the cutoff value that was determined by the ROC curve, patients were divided into low-risk and high-risk groups. Kaplan-Meier survival analysis and log-rank test were used to compare the difference in OS time between the high-risk group and low-risk group. Given other clinicopathological factors associated with HCC patient OS time as confounding variables, stratified analysis and multivariate Cox regression analysis were performed to evaluate the independence of the lncRNA expression signature to predict the OS outcomes of HCC patients. Area under curve (AUC) analysis was used to determine the superiority of HCCLnc-4 for prediction of HCC patient OS comparing with traditional clinicopathological staging systems. SPSS version 19.0 software (SPSS Inc., Chicago, IL) and GraphPad Prism 5.0 (GraphPad Software, San Diego, CA) was used for statistical analysis and graphics, respectively. The statistical significance level was 0.05.

Functional enrichment analysis
Pearson correlation coefficients were calculated between lncRNAs and protein-coding genes (PCGs) or miRNAs. Functional enrichment analysis for the correlated PCGs was performed using DAVID Bioinformatics Tool (https://david.ncifcrf.gov/, version 6.7) [22]. The network was generated by highly correlated lncRNAs-miRNAs-mRNAs and visualized by Cytoscape 3.2 and ggalluvial package of R software [23].

Patient characteristics
Three HCC patient cohorts with definitive OS information were included in this study. The patient cohort from GSE14520 (GPL571 platform) was selected as a discovery cohort. The patient cohorts from GSE14520 (GPL3921 platform) and TCGA (TANRIC) were used as independent validation cohorts to verify the robustness of the lncRNA biomarkers (hereafter referred as test cohort-1, and test cohort-2, respectively). All 447 patients used in this study were clinically and pathologically diagnosed with HCC. The mean OS time was 31.8 months (range, 1.8-63.8 months, discovery cohort), 40.4 months (range, 2-67.4 months, test cohort-1), and 24.6 months (range, 0-115.9 months, test cohort-2), respectively. All the statistical information was summarized in Table 1 and Table S1.

Identification of lncRNAs associated with OS of HCC patients from the discovery set
As shown in Figure 1, the discovery set was firstly analyzed to identify the potential prognostic lncRNA biomarkers, and then the validation data sets were conducted for validation. In the discovery set, univariate Cox proportional hazards regression analysis was performed for lncRNA expression data, and lncRNAs related to patient OS (P < 0.001 as selection criteria) were identified. Then the combinations of these lncRNAs related to patient OS were analyzed, and a model consisting of four lncRNAs was identified as the best prognostic model for predicting the OS of patients. All these four lncRNAs were verified in the NCBI database and classified as ncRNAs in this website. The detailed information of these four lncRNAs is shown in Table  2. Subsequently, these four lncRNAs were integrated into a predictive signature (hereafter inferred as HCCLnc-4) by risk scoring method that was described in "Materials and methods", to predict the prognostic outcomes, as follows: HCCLnc-4 risk score = (3.6392 × expression value of ENSG00000234608) + (-2.9565 × expression value of ENSG00000242086) + (-6.9077 × expression value of ENSG00000273032) + (1.5738 × expression value of ENSG00000228463).

Association of HCCLnc-4 and patient OS in the discovery set
Univariate Cox regression analysis shows that the levels of HCCLnc-4 were significantly associated with patient survival (HR: 11.697, 95% CI: 2.257-60.618, P < 0.01, see in Table 3). As shown in Figure 2a, the AUC value by ROC analysis was 0.83 (95% CI: 0.65-1.00, P < 0.05), indicating HCCLnc-4 had high sensitivity and specificity to predict the prognostic survival of patients in the discovery set. And the risk score value of -8.77, which produced the shortest distance to the point of perfect prediction of the ROC curve, was selected as the cutoff point. Using the same cutoff point produced from the ROC curve, patients were divided into low-risk (n = 16) and high-risk groups (n = 6). Kaplan-Meier analysis and log-rank test were then performed. Compared with the low-risk group, patients in the high-risk group had significantly shorter OS time (log-rank test P < 0.01, see in Figure 2b).

Validation of HCCLnc-4 in additional independent test cohorts
Next, the robustness of HCCLnc-4 was tested in the other two validation cohorts. Univariate Cox regression analysis shows that the levels of the HCCLnc-4 were significantly associated with patient survival both in test cohort-1 (HR: 3.711, 95% CI: 2.263-6.086, P < 0.01, see in Table 3), and in test cohort-2 (HR: 2.46, 95%CI: 1.580-3.831, P < 0.01, see in Table 3). The area under ROC curves was 0.69 (95% CI: 0.62-0.76, P < 0.01) and 0.62 (95% CI: 0.54-0.70, P < 0.01) for test cohort-1 and test cohort-2, respectively ( Figure 3a and Figure 3b). In addition, Kaplan-Meier analysis and log-rank test show that compared with the low-risk group, patients in the high-risk group had significantly shorter OS time in the two validation cohorts, respectively (in both cohorts, log-rank test P < 0.01), as seen in Figure 3c and Figure 3d. Kaplan-Meier survival curve shows the correlation between HCCLnc-4 and the overall survival of patients. A two-sided log-rank test was performed to evaluate the survival differences between the two curves. The cutoff point was the value that produced the shortest distance to the point of perfect prediction in the ROC curve. The statistical significance level was 0.05. HCCLnc-4 and the overall survival of patients from test cohort-1 (GPL3921). A two-sided log-rank test was performed to evaluate the survival differences between the two curves. (d) Kaplan-Meier analysis shows the correlation between HCCLnc-4 and the overall survival of patients from test cohort-2 (TANRIC). A two-sided log-rank test was performed to evaluate the survival differences between the two curves. The statistical significance level was 0.05.

Independence of HCCLnc-4 for prediction of patient OS from clinicopathological factors
To distinguish whether HCCLnc-4 could serve as a predictor independent of other clinicopathological parameters, we conducted Kaplan-Meier analysis and log-rank test after stratification by other factors, using the same cutoff value that was determined by ROC curve of the whole group, and multivariate Cox regression analysis. For the discovery cohort, after stratification by age, Kaplan-Meier analysis and log-rank test show that the level of HCCLnc-4 has significant association with OS both in patients < 65 years old (log-rank test P < 0.01, see Figure 4a), and in patients >= 65 years old (log-rank test P < 0.05, see Figure 4a). After stratification by tumor size, Kaplan-Meier analysis and log-rank test show that the level of HCCLnc-4 has significant association with OS both in patients with small tumors (<= 5 cm) (log-rank test P < 0.01, see Figure 4b) and in patients with large tumors (> 5 cm) (log-rank test P < 0.05, see Figure 4b). After stratification by alanine aminotransferase (ALT) levels, Kaplan-Meier analysis and log-rank test show that the levels of HCCLnc-4 have a significant association with OS in patients with ALT levels <= 50 U/L (log-rank test P < 0.05, see Figure 4c), however has no significant association with OS in patients with ALT levels > 50 U/L (log-rank test P > 0.05, see Figure  4c). Similarly, after stratification by AFP levels, Kaplan-Meier analysis and log-rank test show that the levels of HCCLnc-4 have a significant association with OS in patients with AFP levels <= 300 ng/ml (log-rank test P < 0.05, see Figure 4d), however has no significant association with OS in patients with AFP levels > 300 ng/ml (log-rank test P > 0.05, see Figure  4d). For test cohort-1, after stratification by other clinicopathological factors, Kaplan-Meier analysis and log-rank test show that the levels of HCCLnc-4 has significant association with OS in patients < 65 years old (log-rank test P < 0.01, see Figure S1a), with small tumors (log-rank test P < 0.01, see Figure S1b), with large tumors (log-rank test P < 0.01, see Figure  S1b), male (log-rank test P < 0.01, see Figure S1c), with TNM staging I-II (log-rank test P < 0.01, see Figure  S1d), and with TNM staging III-IV (log-rank test P < 0.01, see Figure S1d), respectively, however has no significant association with OS in patients >= 65 years old (log-rank test P > 0.05, see Figure S1a), and female (log-rank test P > 0.05, see Figure S1c), respectively. In addition, after stratification by more clinicopathological factors, Kaplan-Meier analysis and log-rank test show that the levels of HCCLnc-4 has significant association with OS in patients with AFP levels <= 300 ng/ml (log-rank test P < 0.01, see Figure S2a), with AFP levels > 300 ng/ml (log-rank test P < 0.01, see Figure S2a), with ALT levels <= 50 U/L (log-rank test P < 0.01, see Figure S2b), with ALT levels > 50 U/L (log-rank test P < 0.01, see Figure S2b), with cirrhosis (log-rank test P < 0.01, see Figure S2c), with chronic carrier status of Hepatitis B Virus (HBV) infection (log-rank test P < 0.01, see Figure S2d), and with active viral replication chronic carrier status of HBV infection (log-rank test P < 0.01, see Figure S2d), respectively, however has no significant association with OS in patients without cirrhosis (log-rank test P > 0.05, see Figure S2c), and normal status of HBV infection (log-rank test P > 0.05, see Figure S2d), respectively. Similarly, after stratification by different Barcelona Clinic Liver Cancer (BCLC) staging, and Cancer of The Liver Italian Program (CLIP) staging, Kaplan-Meier analysis and log-rank test show that the levels of HCCLnc-4 has significant association with OS in patients with BCLC staging (0-A) (log-rank test P < 0.01, see Figure S3a), with BCLC staging (B-C) (log-rank test P < 0.05, see Figure S3a), and with CLIP staging (0-2) (log-rank test P < 0.01, see Figure S3b), respectively, however has no significant association with OS in patients with CLIP staging (> 2) (log-rank test P > 0.05, see Figure S3b).
For test cohort-2, after stratification by other clinicopathological factors, Kaplan-Meier analysis and log-rank test show that the levels of HCCLnc-4 has significant association with OS in patients < 65 years old (log-rank test P < 0.01, see Figure S4a), >= 65 years old (log-rank test P < 0.05, see Figure S4a), male (log-rank test P < 0.01, see Figure S4b), with TNM staging I-II (log-rank test P < 0.01, see Figure S4c), with TNM staging III-IV (log-rank test P < 0.05, see Figure S4c), of American Indian, Alaska Native, and Asian (log-rank test P < 0.01, see Figure S4d), and in white patients (log-rank test P < 0.05, see Figure S4d), respectively, however has no significant association with OS in female patients (log-rank test P > 0.05, see Figure S4b), and in black or African American patients (log-rank test P > 0.05, see Figure S4d).
Additionally, multivariate Cox regression analysis suggests that after adjusted by other clinicopathological factors, the levels of HCCLnc-4 was significantly related to patient survival in all the three cohorts (discovery cohort, HR: 36.227, 95% CI: 3.807-344.704, P < 0.01; test cohort-1, HR: 3.352, 95% CI: 2.011-5.590, P < 0.01, and test cohort-2, HR: 2.243, 95% CI: 1.356-3.711, P < 0.01, see Table 3). All the results above indicate that HCCLnc-4 could predict the survival of HCC patients well independent of other clinical performances. Table 4. ROC analysis of clinical and pathological parameters and the four-lncRNA signature in the three datasets.

Functional prediction of HCCLnc-4
To explore the function of HCCLnc-4, we identified highly positively or negatively correlated PCGs with at least one of 4 lncRNAs by calculating the Pearson correlation coefficient (P < 1E-10 as selection criteria) of paired lncRNA and PCGs expression profiles. The functional enrichment analysis of Gene Ontology (GO) and Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway revealed that PCGs positively correlated with lncRNAs were involved in rRNA processing process and spliceosome pathway, while PCGs negatively correlated with lncRNAs were enriched in GO terms mostly related to metabolic process and KEGG pathways, including Complement and coagulation cascades, Metabolic pathways, and PPAR signaling pathway, etc. The HCCLnc-4 associated biological processes and pathways can be found in Table S2.

Discussion
The ability to recognize patients with high risk would aid the decision for HCC management. Several clinical prognostic indicators have been proposed to discriminate patients who stand to benefit from liver transplantation. However, they are still insufficient or not sensitive enough for predicting HCC patients at high risk for recurrence and selecting those at low risk [27,28]. Highly sensitive and accurate molecular prognostic biomarkers are sorely needed [29].
As the newly identified class of ncRNA, lncRNAs participate in tumor proliferation, metastasis, invasion, energy regulation, and tumor-initiating cells self-renewal [30]. lncRNA based prognostic indexes have been proved to be useful to predict survival, metastasis, and recurrence of tumor patients [31][32][33]. And established by different methods, mRNA [34], lncRNA [35,36], miRNA [37], or mixed molecule type signature [38] had shown the prognostic value of HCC. For instance, using six prediction machine-learning algorithms, Yuan and colleagues established a metastasis-related signature comprised of five mRNAs and one lncRNA, which presented the well prognostic value of HCC [38]. In the present study, using the risk score method and directly based upon different prognostic status, we got an HCC prognosis related four-lncRNA signature and explored the potential implication of lncRNA biomarkers by functional analysis.
Recent studies found that some of the microarray probes on the commonly used arrays are likely to map to lncRNAs [39,40], which represents a cost-effective way to obtain lncRNA expression profiles. In the present study, by repurposing the expression profiles with OS information of HCC patients from GEO database and TCGA dataset, we obtained lncRNA expression data among the patients, and by Cox regression analysis and risk scoring method, we established a four-lncRNA signature (HCCLnc-4) for predicting HCC patient survival in the discovery cohort. ROC analysis and Kaplan-Meier analysis suggested that HCCLnc-4 was significantly correlated with the survival status of HCC patients. The prognostic value of HCCLnc-4 was further validated in the other two validation cohorts. Stratified analysis and multivariate Cox regression analysis indicated that HCCLnc-4 for HCC patient survival prediction was universal among different subgroups and independent of other prognostic factors. And compared to traditional evaluation systems, such as TNM staging, BCLC staging and CLIP staging systems, a similar or even better HCC prognostic value of HCCLnc-4 could be seen in the same data sets in our research. In addition, recently, Li et al. established a three-gene prognostic signature for patients with HCC, which contained UPB1, SOCS2, and RTN3 [34]. GSE14520 (containing two batches of samples, GPL571 and GPL3921) was also used as a validation cohort in Li et al's study, and in their study, AUC in time-dependent ROC curve was 0.645 for 1-year survival, 0.638 for 2-year survival, 0.618 for 3-year survival, 0.607 for 4-year survival, and 0.622 for 5-year survival, respectively, compared to 0.83 (GPL571, 95% CI: 0.65-1.00, P < 0.05) and 0.69 (GPL3921, 95% CI: 0.62-0.76, P < 0.01) of HCCLnc-4 in our study, indicating well or possibly higher sensitivity and specificity of HCCLnc-4 to predict the prognostic survival of HCC patients compared to the recently published mRNA-based signature in the same data sets. These results suggested that HCCLnc-4 was an individualized and robust prognostic marker to predict HCC patient survival.
Though more than ten thousand lncRNAs have been discovered in human, functional studies of these lncRNAs are still in its early stages. MAPKAPK5 antisense RNA 1 (MAPKAPK5-AS1, ENSG00000234608), one of 4 HCC prognosis-related lncRNAs, was differentially expressed between HCC tissues and adjacent non-tumor tissues and negatively associated with OS of HCC patients [41]. Similarly, the expression level of MAPKAPK5-AS1 is positively correlated with HCCLnc-4 and negatively associated with the OS of HCC patients in our study. Upregulation of MAPKAPK5-AS1 in HCC patients with vascular invasion was also observed [41]. Through network analysis, MAPKAPK5-AS1 is co-expressed with the genes involved in ribosome and spliceosome pathways, which have been partly elucidated elsewhere [42,43]. In addition, another lncRNA, DiGeorge syndrome critical region gene 9 (DGCR9, ENSG00000273032), one of the 4 HCC prognosis-related lncRNAs in our study, is also one lncRNA of a 9-lncRNA risk score system for the prognostic prediction of patients with HBV-positive HCC [44]. In the study conducted by Dr. Sun and colleagues [44], using one RNA-sequencing dataset from TCGA and three datasets from GEO, they built a 9-lncRNA risk score system, and the expression level of DGCR9 was negatively associated with the 9-lncRNA risk score and positively correlated with OS of HBV-positive HCC patients. In our study, a similar association between the expression level of DGCR9 and HCC patient OS was also observed, though our study included patients with HBV either positive or negative. Furthermore, in the network analysis (Figure 5a and Figure 5b), we showed that DGCR9 (ENSG00000273032) might act as a ceRNA to intervene in HCC pathogenesis by competitively associated to miR-122 and miR-885, which have been found to be involved in liver diseases or liver cancer [45][46][47]. All these results call for further studies to conclude a causal association between these relationships, which may contribute to novel targeted therapy.
Several limitations should be noted in our study. Firstly, we obtain lncRNA expression profiles from the commonly used arrays, so not expression profiles of all lncRNAs were analyzed in our study. Secondly, our model could not include all the potential important prognostic factors, such as steatosis and nonalcoholic fatty liver disease (NAFLD), because the information of these factors was unavailable in the publicly published datasets. Thirdly, prospective studies with larger sample sizes are needed to validate the prognostic value of HCCLnc-4 in HCC patients.
Fourthly, the potential biological mechanisms behind the associations between HCCLnc-4 and OS of HCC patients remains to be investigated in further studies.
In conclusion, our study suggested the potential of lncRNA signature as novel candidate biomarkers in the prognosis of HCC, or potential biological functions of lncRNAs in hepatocarcinogenesis.