A novel seven-gene signature as Prognostic Biomarker in Hepatocellular Carcinoma

Purpose: Our study is designed to develop and certify a promising prognostic signature for hepatocellular carcinoma (HCC). Materials and methods: We retrospectively analyzed mRNA expression profiles and clinicopathological data fetched from The Cancer Genome Atlas (TCGA) and Gene Expression Omnibus (GEO) datasets. We formulated a prognostic seven-gene signature composed of differentially expressed mRNAs (DEmRNAs) between HCC and nonneoplastic tissues through univariate Cox regression analysis. The receiver operating characteristic (ROC) curve, survival analysis and multivariate Cox regression analysis as well as nomograms were utilized to assess the prognostic performance of the seven-gene signature. Results: The risk score based on a seven-gene signature categorized HCC subjects into a high- and low-risk group. There was significantly discrepant overall survival (OS) between patients in both groups and the corresponding ROC curve revealed a satisfactory predictive performance in HCC survival in both TCGA and GSE76427 cohort. Multivariate Cox regression analysis demonstrated that a seven-gene signature was an independently prognostic factor for HCC. Nomograms combining this prognostic signature with significant clinical characteristics conferred a crucial reference to predict the 1-,3- and 5 years OS. Conclusions: Our study defined a promising seven-gene signature and nomogram model to forecast the OS of HCC patients, which is instrumental in clinical decision and personalized therapy.


Introduction
Liver cancer is the second most frequent reason for tumor-associated deaths, with an approximated 841, 080 new cases and 780 thousand deaths occurring globally in 2018. In particular, almost one-half of new diagnoses and deaths are Chinese [1][2][3]. HCC, accounting for almost 90% of all primary liver cancers, can be induced by multifarious risk factors, including hepatitis virus infection, metabolic disorders, aflatoxin and autoimmune hepatitis [1,4]. Despite recent encouraging progress in therapeutic interventions (such as surgery, radiofrequency ablation, targeted therapy, and radiotherapy), HCC patients display an unsatisfactory prognosis with a lower than 40% 5-year survival rate because of frequent recurrence or distant metastasis [5][6][7]. Several clinicopathological parameters, including pathologic differentiation, tumor-node-metastasis (TNM) staging and vascular invasion, constitute conventional prognostic models Ivyspring International Publisher to predict the OS of HCC patients [8]. Nevertheless, their predictive performance is not very encouraging owing to the great heterogeneity of HCC. Additionally, regarding several valuable biomarkers, such as des-γ-carboxyprothrombin [9] and alphafetoprotein (AFP) [10], their prognostic efficiency is variable among studies, which is partly ascribed to difference in sample size, assay methods or statistical methods [6].
Recent advance in genome-sequencing technologies has investigated the prognostic prediction of gene signatures in HCC. A majority of studies have concentrated on single gene molecule at mRNA level [11][12][13][14][15][16][17]. However, multiple-gene-based signatures are more robust to evaluate HCC prognosis compared with the predictive power of single-biomarker. For example, the prognostic predictive value of apolipoprotein A1 (APOA1) combined with C-reactive protein (CRP) is more favorable than that of serum AFP alone [12]. Sulfite oxidase (SUOX) integrated with AFP are sufficient to predict the performance and recurrence risk of HCC [18]. Similarly, Dong et al demonstrated that synergic analysis of STAT genes (STAT5A, STAT5B and STAT6) exhibited a more desirable predictive efficiency for HCC prognosis than did single gene [19]. The accumulating accessibility of genome-wide gene expression information in HCC potentially permits the development of a credible gene signature [20,21]. Thus, deep excavation of publicly accessible genomic data is a considerable strategy to appraise novel multi-gene signatures with reliable predictive power in the OS of HCC patients, thus conferring promise to improve patients' risk stratification and individual therapeutic interventions. In our study, we acquired the HCC mRNA expression profile from the TCGA and GEO to formulate a prognostic seven-gene signature and a nomogram with satisfactory credibility for HCC patients, which is conducive to covering the imperfection of the current staging system.

Data extraction and manipulation
The original mRNA expression information was acquired from TCGA and GEO database, respectively. HCC patients without crucial clinical information (including follow-up or survival status) or mRNA expression data were excluded. The DEmRNAs were identified between tumor tissues and normal samples. We normalized the RNA expression data through multi-array average (RMA) expression measure method. We further investigated DEmRNAs through DESeq vesion 1.38.0 R package in TCGA dataset and by the Limma version 3.36.2 R package in GEO dataset [22]. We further utilized univariate Cox regression analysis to extract DEmRNAs that were significantly related to the prognosis of HCC patients. The hazard ratio (HR)-cutoff value is conventionally set at 1 to define protective genes (HR < 1) and risky genes (HR > 1). We conducted a crosstalk between above two datasets and eventually selected seven reliable DEmRNAs associated with the OS of HCC.

Construction of prognostic signature
A prognostic risk score model was further formulated in accordance with the mRNA expression levels and its corresponding regression coefficient (β). The calculative method was as follows: risk score = the expression level of transketolase (TKT) * β TKT + the expression of TTC39B * β TTC39B + the expression level of poly-N-acetyllactosamine (PLN) * β PLN + the expression level of CBFA2T2 * β CBFA2T2 + the expression level of heat shock protein beta 3 (HSPB3) * β HSPB3 + the expression level of Progestin and adipoQ receptor 4 (PAQR4) * β PAQR4 + the expression level of C21orf58 * β C21orf58 . All incorporated HCC patients were categorized as high-and low-risk groups on the basis of a median risk score.

Establishment of nomogram
Nomogram has a robust capacity to predict tumor prognosis [23,24]. A nomogram was established through incorporating all significant prognostic clinicopathological parameters determined through multivariate Cox regression analysis, thus estimating the probability of 1 -, 3-, and 5 years-OS of HCC. We calculated the concordance index (C-index) to identify the discrimination of a nomogram. The calibration curve of a nomogram was utilized to vividly assess the consistency between its prediction probabilities and the actual observation.

Statistical analysis
To evaluate the survival differences between low-and high-risk HCC patients, we performed survival analysis through Kaplan-Meier curve combined with the log-rank test. We also conducted univariate and multivariate Cox regression analyses to identify the association between OS and risk score as well as clinicopathological features. The ROC curve with the corresponding area under the curve (AUC) was rendered to estimate the predictive performance of the prognostic gene signature for HCC survival by the R package "survival ROC" [25]. P < 0.05 was defined as statistically significant.

Recognition of differentially expressed genes associated with prognosis
Based on TCGA dataset, 3256 DEmRNAs were identified in HCC samples (n = 374) when compared with noncancerous samples (n = 50). Similarly, a total of 12674 DEmRNAs were extracted from GSE47595 (both P < 0.05). The heatmap of the DEmRNAs was revealed in Supplementary Figure 1A and 1B. We further utilized univariate Cox regression model to identify 1569 DEmRNAs from TCGA and 174 DEmRNAs from GSE47595, which were all significantly related to OS of HCC patients (P < 0.05). To narrow down the range of genes, seven differentially expressed genes associated with HCC prognosis were confirmed by overlapping the above two datasets (Supplementary Figure 1C) and were further incorporated into a prognostic gene-signature. Collectively, the seven genes were as follows: TKT, TTC39B, PLN, CBFA2T2, HSPB3, PAQR4, C21 or f58. The general characteristics of the seven genes were summarized in Table 1. Under the condition of the cutoff value of HR = 1, there were four common candidate risky genes (TKT, CBFA2T2, PAQR4, C21orf58) and three candidate protective genes (TTC39B, PLN, HSPB3) ( Table 2).

Constitution and validation of a seven-gene prognostic signature
The risk score of each HCC patient was calculated based on the equation: risk score = (0.345) * TKT value + (0.213) * CBFA2T2 value + (0.132) * PAQR4 value + (0.115) * C21orf58 value + (-0.015) * TTC39B value + (-0.043) * PLN value + (-0.077) * HSPB3 value. All HCC subjects were stratified into high-and low-risk groups in accordance with risk score. The risk score distribution, gene expression and survival status of each HCC patient were revealed in Figure 1. For the seven genes, four genes corresponded to high risk (TKT, CBFA2T2, PAQR4, C21orf58; HR > 1) and three genes seemed to be protective (TTC39B, PLN, HSPB3; HR < 1). We made a comparison in the expression discrepancies of seven genes between high-and low-risk groups. Indeed, risky genes were prone to express in patients with high-risk scores, while patients in the low-risk group were characterized with protective genes expression (Figure 2 and Figure 3). To identify the relationship between risk score model and clinicopathological characteristics in HCC patients, we further analyzed the risk score level in HCC patients at different clinical stages. As revealed in Table 3, risk score based on this seven-gene prognostic model was significantly associated with N classification, M classification, histologic grade, AJCC staging, fibrosis score (both P < 0.05). Thus, these findings highlight that the level of risk score is related to diverse crucial pathological characteristics of HCC patients.

Correlation between a seven-gene prognostic signature and HCC survival
The HCC subjects in low-risk group were characterized with more satisfactory OS in comparison to those in high-risk group (P < 0.001) (Figure 4A, C). The time-dependent ROC curves based on TCGA dataset displayed that the AUCs for 1-year, 3-year, and 5-year OS were 0.759, 0.822 and 0.914, respectively and the corresponding values based on GSE76427 dataset were 0.79, 0.657, and 0.765, respectively (P < 0.05) (Figure 4B, D), highlighting a substantially effective predictive performance of the seven-gene signature for HCC prognosis.     All HCC subjects were further stratified into different subgroups based on clinicopathologic characteristics to confirm the association between signature risk score and HCC prognosis. Survival analysis revealed that regardless of age, gender and tumor status, there was statistically significant discrepancy in HCC prognosis between the high-and low-risk groups (HR = 0.38, 95% CI = 0.21-0.69, P < 0.0001 for female patients; HR = 0.49, 95% CI = 0.31-0.77, P = 0.0005 for male patients; HR = 0.40, 95% CI = 0.23-0.68, P = 0.0002 for patients with age < 60 years old; HR = 0.45, 95% CI = 0.28-0.74, P < 0.0001 for patients with age ≥ 60 years old; HR = 0.43, 95% CI = 0.26-0.72, P = 0.0004 for patients with free tumor; Figure 5). Low signature risk scores were significantly related to more favorable OS in the subgroup of patients with elevated AFP levels (HR = 0. 46 (Figure 6). Nevertheless, for HCC patients with HBV or HCV infection, normal AFP levels and at TNM stage II and IV, the statistically significant prognostic difference was not revealed between the high-and low-risk groups. Above findings imply that the risk score signature potentially serves as a robust prognostic marker for HCC patients and fails to be affected by parameters that usually trigger variations in the performance of conventional biomarkers.
Moreover, as shown in Supplementary Figure 2, the predictive accuracy of our model was relatively satisfactory than additional clinical indicators, such as TNM stage, BCLC stage and pathological differentiation.

Cox proportional hazards regression analysis
We further evaluated the effect of the seven-gene prognostic signature on the OS of HCC patients through univariate and multivariate Cox regression. For the whole TCGA cohort, univariate Cox regression revealed that gender, histologic grade, AJCC stage and fibrosis score as well as risk score were significantly correlated with HCC survival (both P < 0.05). The corresponding multivariate Cox regression analysis demonstrated that HCC patients with poor histologic grade, advanced AJCC stage as well as high risk scores potentially exhibited more unsatisfactory prognosis (both P < 0.05) ( Table 4). Based on data extracted from GSE76427 database, BCLC stage, TNM stage and risk score were significantly associated with HCC prognosis through univariate and multivariate Cox regression analysis (both P < 0.05) ( Table 5). Therefore, low risk score was indeed an independent protective indicator for HCC prognosis.

Establishment and validation of a predictive nomogram
Eventually, we incorporated all significantly prognostic factors based on multivariate Cox regression analysis to formulate a nomogram, thus predicting OS of 342 HCC patients from TGGA. Specifically, AJCC stage and predictive risk score made the greatest contributions to HCC prognosis, followed by histologic grade in TCGA database (Figure 7). Risk score, TNM stage and BCLC stage exerted a crucial effect on the HCC prognosis in GSE76427 database (Figure 8). The C-index of nomogram associated with TCGA and GSE76427 database was 0.745 (95% CI: 0.676-0.816) and 0.7645 (95% CI: 0.699-0.835), highlighting a desirable predictive value of our nomogram models.

Discussion
Accumulating studies have highlighted that genetic changes and defects in the signaling pathways exert a crucial effect on tumorigenesis and development of HCC, indicating the potential prediction value of molecular biomarkers in HCC prognosis [26]. Furthermore, the prognostic gene signature combined with traditional clinical indicators exhibit a more satisfactory predictive performance than a single parameter [12,27]. Currently, multi-gene signatures based on abnormal mRNA levels have attracted much consideration and displayed a promising predictive potential in HCC prognosis [19,[28][29][30].
In our report, we selected the seven common genes (TKT, TTC39B, PLN, CBFA2T2, HSPB3, PAQR4, C21 or f58) most significantly related to HCC prognosis through overlapping the TCGA and GEO databases. Each gene was defined as a risky gene (HR > 1) or protective gene (HR < 1) through univariate Cox regression analysis. A signature risk score based on the nine genes was developed and it conferred a standard to stratify HCC patients into high-and low-risk groups. We further conducted univariate and multivariate Cox regression analysis to validate the independent prognostic effect of this seven-gene signature on HCC. Kaplan-Meier curves showed that HCC patients in the high-risk group were characterized with unfavorable prognosis. Moreover, we exploited a nomogram with robust predictive performance to estimate survival through combining the signature risk score and additional clinicopathological characteristics with statistically significance. These findings highlight that the risk score is a stable, independent prognostic indicator with significant and effective predictive value for HCC patients based on our seven-gene model. The mRNA TKT was one of the seven-gene prognostic signature in our study. TKT, a vital enzyme in the pentose phosphate pathway (PPP), is essential for tumor proliferation on account of its capability to influence NAPDH generation to counteract oxidative stress. Disturbing the redox homeostasis of cancer cells by genetic knockdown or pharmacologic inhibition of TKT sensitizes cancer cells to existing targeted therapy (Sorafenib) [31]. Reduced expression of TKT, which regulate flux into pyrimidine biosynthesis, correlates with better prognosis in pancreatic cancer patients on fluoropyrimidine analogs [32]. Specifically, TKT can promote the development of HCC in a non-metabolic manner via its nuclear localization and EGFR pathway [33]. Loss of TKT in the liver increased apoptosis, reduced cell proliferation, decreased TNF-α, IL-6, and STAT3 levels, and alleviated DEN/HFD-induced hepatic steatosis and fibrosis, highlighting a key role for TKT in promoting genome instability during liver injury and tumor initiation [34]. CBFA2T2 (also known as MTGR1), a member of the Myeloid Translocation Gene (MTG) family of transcriptional corepressors, can significantly predict the survival of renal cell carcinoma (RCC) patients. Knocking-down of CBFA2T2 can inhibit cell migration and invasion in RCC cells in vitro, and reduce ALDH high cancer stem cells (CSCs) populations. CBFA2T2 expression is necessary for sphere-forming ability and cancer stem cells marker expression in RCC cell lines [35,36]. CBFA2T2 is required for tumorigenesis in the murine colitis-associated carcinoma [37][38][39]. PAQR4 has a tumorigenic effect on human breast cancers, and such effect is associated with a modulatory activity of PAQR4 on protein degradation of CDK4 [40,41]. PAQR4 promotes cell proliferation and metastasis in both non-small-cell lung cancer [42] and gastric cancer [43]. C21orf58 exerts a momentous effect on breast cancer cell growth [14]. Nevertheless, the role of abnormal CBFA2T2, PAQR4 or C21orf58 in HCC remains undefined. Our study initially confirmed the negative effect of CBFA2T2, PAQR4 or C21orf58 on HCC prognosis for the first time. Conversely, tetratricopeptide repeat (TPR) domain protein 39B (TTC39B, C9orf52) (T39), a high density lipoprotein (HDL) gene discovered in human genome wide association studies (GWAS) [44,45], is associated with atherosclerosis and steatohepatitis as well as inflammation [46]. HSPB3 is an unfavorable molecular biomarker in colorectal adenocarcinoma [47]. Nevertheless, the role of HSPB3 in HCC has not been illuminated.   Significantly, we formulated and identified a predictive prognostic model composed of seven genes to confer reference for HCC patient stratification in clinical practice. All enrolled HCC patients were sorted into high-and low-risk groups through mRNA expression levels rather than gene mutations or methylation alterations of merely seven prognostic genes. This method was more accessible and economical in practice considering that it diminished the utilization of whole-genome sequencing for all HCC subjects. Additionally, a nomogram was developed by combining this signature with conventional clinical indicators such as TNM stage, pathological differentiation, thus significantly enhancing the accuracy of predictive performance. It was also beneficial for clinicians to select high-risk HCC patients for adjuvant therapy except for surgical treatment. Notably, several limitations in our study need to be discussed. Initially, the clinical data from GEO database was insufficient and there was no additional valuable information concerning prognosis, including Child-Pugh scoring, cirrhosis scoring, AFP levels, tumor size and vascular invasion as well as therapeutic interventions. Furthermore, our study merely retrospectively analyzed relatively small sample size. A majority of patients from TCGA dataset were White or Asian. We should cautiously expand the results to additional ethnicities. Thirdly, further investigation should be warranted to determine the expression and the prognostic role of the seven genes at protein level as well as their underlying mechanisms. Thus, further independent prospective cohort studies with larger sample sizes and more elaborate clinical information are essential to validate the nine-gene signature and prognostic nomogram.

Conclusion
A novel seven-gene signature for prognostic prediction in HCC was established, with higher risk scores implying unfavorable prognosis. A nomogram model integrating the seven-gene signature with additional significant clinicopathological parameters also yielded promising predictive performance in HCC survival.

Author Contributions
YG Tao and H Xie designed/planned the study. SP Liu performed computational modeling, acquired and analyzed clinical data. H Xie, SP Liu, P Chen and ZY Zhang performed imaging analysis. H Xie, SP Liu, P Chen, and ZY Zhang participated in discussion of related data. H Xie and SP Liu wrote the paper.