Prognostic implication of glycolysis related gene signature in non-small cell lung cancer

Abnormal glycolysis is one of the hallmarks of cancer and plays an important role in its development. This study was devoted to identify glycolysis related genes as prognostic biomarkers for non-small cell lung cancer (NSCLC). The mRNA expression profile and clinical follow-up data were obtained using The Cancer Genome Atlas (TCGA) database. The validation set was obtained by bootstrap method of random repeated sampling. A total of 200 glycolysis-related genes were obtained from Gene Set Enrichment Analysis (GSEA) and 46 genes were significantly associated with overall survival (OS). Five genes (PKP2, LDHA, HMMR, COL5A1 and B3GNT3) were eventually identified to calculate risk score of NSCLC patients. The univariate and multivariate Cox regression analysis indicated that the risk score was an independent prognostic factor (training set: HR=2.126, 95% CI [1.605, 2.815], p<0.001; validation set: HR=2.298, 95%CI [1.450, 3.640], p<0.001). Patients assigned to the high-risk group were associated with poor OS compared with patients in the low-risk group (training set: P=7.946e-06; validation set: P=6.368e-07). Receiver operating characteristic (ROC) curve and stratification analysis also demonstrated the potential prognostic performance. In conclusion, we constructed a novel glycolysis related risk signature which might contribute to predicting the prognosis of NSCLC.


Introduction
Lung cancer is one of the leading causes of cancer-related death [1], especially in developed countries, where it has ascended to number one in cancer deaths [2]. NSCLC accounts for approximately 80% of all lung cancer cases [3]. Despite progress in treatment, the prognosis of NSCLC remains poor, due to the lack of early identification markers [4]. To date, the prognosis mainly depends on histopathologic diagnosis and tumor staging [5]. However, this has limitations as patients with the same degree of progression will show distinct outcomes due to individual differences [6]. Some evidence has shown that the discovery and application of molecular biomarkers can provide prognostic value [7][8][9][10]. Therefore, new diagnosis markers must be urgently identified for assessing prognosis of NSCLC.
Glycolysis is one of the earliest evidence of metabolic changes in tumor [11,12]. Even in oxygenated circumstances, the glycolysis activity of tumor cells is also active. This metabolic characteristic is known as Warburg effect, which shows high glucose uptake, active glycolysis and large amounts of lactic acid production [13,14]. Increased glycolysis satisfies the increasing proliferation of cancer cells [15]. In addition, studies have shown that enhanced glucose metabolism could alter apoptosis pathway in cancer cells [16]. The oncogenic regulation of glycolysis provides us with new explanation for tumor progression [17]. Many studies have found that glycolysis changes take an important role in cancer progression. For example, tumor glycolysis promotes immune evasion by participating in immunosuppressive networks [18]. In the present study, transcriptome profiling was downloaded from TCGA database and 200 mRNAs were identified that significantly relate to glycolysis. Furthermore, a risk predictive signature including five genes was conducted to calculate the risk of NSCLC patients and further successfully validated in the validation set. Univariate and multivariate Cox regression analysis revealed the signature was an independent predictive factor for OS. The risk score also performed better than other clinical parameters in prognosis. This study might present a new thinking for assessing the prognosis of NSCLC.

Data collection and processing
Detailed information of NSCLC patients' containing transcriptome profiling and clinical follow-up data, were downloaded from TCGA database on April 2, 2020. Patients meeting the following criteria were excluded: 1) patient had no survival time or survival status, 2) patient had clinical information but no gene expression data. There were 935 patients included in this study. The detailed clinical characteristics are summarized in Table 1.

Gene set enrichment analysis
GSEA analysis was used to explore whether the selected gene sets were associated with downloaded transcriptome data [23]. We determined gene sets for further investigation at normalized p values (p<0.05) and normalized enrich score (|NES|≥1).

Establishment and validation of glycolysis related gene signature
We intersected glycolysis related genes and their expression in TCGA for matching. Differentially expressed glycolysis related genes was obtained using limma R package with |log2 FC| ≥1.5 and false discovery rate (FDR) <0.05 [24]. Then differentially expressed genes combined with TCGA clinical data for screening prognostic genes. We identified differentially glycolysis related genes whose . Independent analyses were applied to verify the independence of the signature. Then NSCLC patients were divided into two groups based on a median risk score. The Kaplan-Meier curve was visualized using the survival R package and survminer R package. P < 0.05 was regarded as statistically significant. Survival ROC R package was used to compare the ROC areas. The validation set was acquired by bootstrap method [25]. Furthermore, stratification analysis also verified the prognostic implication of the signature.

Prognostic genes expression and alteration were shown by online databases
In our study, five prognostic genes involved in the signature were all upregulated in NSCLC. The gene expression was further validated in the Oncomine database and Tumor Immune Estimation Resource (TIMER) database [26,27]. The alternation of the signature genes was shown by cBioportal for Cancer Genomics [28,29].

Establishment a prognostic signature of five glycolysis related genes in NSCLC
Clinical information and corresponding gene expression data from 935 NSCLC patients were obtained from TCGA. GSEA results showed gene sets which were significantly enriched, with normalized p-values <0.05 and NES ≥1, including glycolysis, oxidative phosphorylation, DNA repair, inflammatory response, mitotic spindle, bile acid metabolism, E2F targets and G2M checkpoint (Table 2, Figure 1). We selected GLYCOLYSIS (P = 0.000, NES =2.46) for further analysis. Total 200 glycolysis related genes were obtained from GSEA, of which 46 genes were considered statistically significant (FDR < 0.05, |log2FC)| ≥1.5). The heatmap ( Figure 2a) and boxplot ( Figure 2b) showed the expression of these genes. A total of 10 genes expression pattern had a significant association with OS (P < 0.05) using univariate COX regression analysis. Subsequently, 5 of 10 genes were identified as independent prognosis genes to establish model using multivariate COX regression analysis. Finally, five prognosis genes and the expression coefficient of each gene were obtained, and a gene-based prognostic signature was constructed for calculating risk score of each NSCLC patient. The five genes were LDHA, HMMR, B3GNT3, PKP2, and COL5A1 (Table 3). The risk score = (0.2170 × Expression of LDHA) + (0.1200 × expression of HMMR) + (0.1088 × expression of B3GNT3) + (0.0960 × expression of PKP2) + (0.1344 × expression of COL5A1). The heatmap displayed the expression of five signature genes (Figure 3a). The five genes were all significantly up-regulated (Figure 3b-3f). Based on the median risk score, the NSCLC patients were divided into high-and low-risk groups. Patients of high-risk group tended to have a shorter survival time than those in low-risk group (P=7.946e-06) (Figure 3g). The risk score and survival state distribution were also visualized (Figure 3h, 3i). Furthermore, we used cox regression analysis to identify the independence of the risk score. Univariate independent prognostic analysis illustrated that TNM stage, T stage, N stage and risk score have statistical significance with OS (P<0.001) (Figure 4a). Multivariate independent prognostic analysis illustrated that risk score was an independent prognostic factor (P<0.001) (Figure 4b). Thus, our results confirmed that the glycolysis-related gene signature could be used as an independent prognostic factor in clinical practice (Table 4).

Clinical correlation analysis
ROC curves of OS helped to assess the prognosis significance of gene signature with AUC of 0.649, whereas lower scores were shown in other clinical parameters, such as age (AUC = 0.547), gender (AUC = 0.551), TNM stage (AUC = 0.634), T stage (AUC = 0.629), N stage (AUC = 0.579) and M stage (AUC = 0.501) (Figure 5a). The connection between signature and other clinical pathological factors was also used to demonstrate prognostic performance. Patients grouped by age (≤65, >65) or gender (male, female) or TNM stage (I-II, III-IV). Risk score was found to be significantly associated with age (p = 0.027) and TNM stage (p <0.001), but not with gender (Figure 5b-d).
Additionally, these five genes in signature with showed substantial difference among some clinical parameters. Differential expression of HMMR showed across different gender, N stage and TNM stage (Figure 5e-g). The differential expression of PKP2 exhibited across age and gender (Figure 5h, 5i). B3GNT3 was expressed differently across different gender and TNM stage (Figure 5j, 5k). The differential expression of LDHA was found in N stage and TNM stage (Figure 5l, 5m).

Stratification analysis
Patients were grouped in the same manner as the previous step for stratification analysis where high-risk patients with poorer OS were grouped by age, no matter if they were older or younger than 65 (Figure 6a, 6b). The same was true of TNM stage (Figure 6e, 6f). Therefore, the potential associations between risk score with patients' age and TNM stage were statistically significant. However, there was no significant difference in gender stratification ( Figure  6c, 6d).

Prognostic significance in validation set
Internal validation was conducted using the bootstrapping method. The original dataset was acted as training set. The validation set was performed as described above, including survival analysis, risk analysis, independent prognosis analysis and ROC curve drawing. The AUC of ROC was 0.631, which showed that this model has superior accuracy ( Figure  7a). The OS was notably shorter in high-risk group than in the low-risk group (P-value=6.368e-07; Figure  7b). Besides, univariate and multivariate regression analysis were performed on the validation set to demonstrate the independent prognostic significance of the signature (Figure 4c, 4d). The heatmap, risk score curve and survival status data in validation set were shown in Figure 7c-e. The prognostic significance of our signature was further verified in the validation set.

External validation using online database
We analyzed the alterations of the five genes in 1144 NSCLC samples in the cBioPortal for Cancer Genomics. The results showed that the genes in signature were mutated in 191 (17%) of the sequenced cases. As shown in Figure 8b.The five genes in the signature were all up-regulated in NSCLC.This is also confirmed in the Oncomine database ( Figure 8a) and the Timer database (Figure 9).

Discussion
NSCLC remains a major challenge for global public health, with incidence and mortality still increasing in many countries. Although great progress has been made, the underlying molecular pathogenesis of NSCLC is still unclear. Considering the poor prognosis of NSCLC due to its later diagnosis, identification reliable prognostic markers and establishment of more accurate prognostic models are urgently needed.    In recent years, the glycolysis mechanism in various diseases has caused wide attention [30][31][32]. Growing evidence show that glycolysis key genes play an important role in regulating cancer cell metabolism and may be a potential therapeutic option [33,34]. In lung cancer, glucose metabolism feature has also been further elucidated [35][36][37]. For instance, significantly upregulated OTUB2 in NSCLC stimulated the Warburg effect and was closely related to metastasis, advanced tumor stages, poor survival, and recurrence [38]. Transcription factor BACH1 stimulates glycolysis-dependent lung cancer metastasis by increasing glucose uptake, glycolysis rates, and lactate secretion [39].
In this study, we established a new predictive signature including five glycolysis related genes. Univariate and multivariate cox regression analysis determined the independent prognostic effect for predicting the risk of NSCLC. ROC curve and stratification analysis also proved the prognostic significance. More importantly, we established a validation set to verify the reliability of this signature. Similar studies have been published before. Zhang et al. and Liu et al. respectively identified glycolysisrelated gene signature for predicting survival of lung adenocarcinoma patients [40,41]. However, their signatures were not validated internally, which is the advantage of this study.    For the five signature genes, there have been some studies on its role in cancer. LDHA, a key enzyme which catalyzes the formation of lactic acid from pyruvic acid, is a key gene in aerobic glycolysis and is widely seen as a therapeutic target for cancer [42]. Phosphorylated LDHA could promote head and neck cancer and breast cancer cells invasion and metastasis [43]. Besides, LDHA inhibitors can be used together with other chemotherapy drugs to play a synergistic role in anti-tumor [44]. HMMR plays an important role in neural development by correcting spindle position. HMMR knockout mice suffer defective neural development [45]. Furthermore, the overexpression of HMMR in numerous tumor types is also associated with tumor relapse and propagation. HMMR also supports the cancer stem cell properties in gastric cancer and glioblastoma [46,47]. HMMR might be a potential prognostic marker and therapeutic target [48,49]. Some studies have suggested that the B3GNT3 were significantly overexpressed in various tumors [50,51]. The expression level of B3GNT3 in patients with NSCLC or early-stage cervical cancer was associated with unfavourable clinicopathological parameters. These patients with high B3GNT3 expression had a shorter OS and disease-free survival (DFS) compared with those with low expression [52,53]. In contrast, B3GNT3 predicts a favorable cancer behavior of neuroblastoma [54]. In addition, down-regulation of B3GNT3 can enhance cytotoxic T-cell-mediated anti-tumor immunity in triple-negative breast cancer [55]. COL5A1 is collagen family member that encodes an alpha chain for one of the low abundance fibrillar collagens. Upregulated COL5A1 indicated poor prognosis in breast cancer, clear cell renal cell carcinoma, lung adenocarcinoma and tongue squamous cell carcinoma [56][57][58][59]. PKP2, a member of the arm-repeat protein family, is important for the assembly of junctional proteins. But as far as we know, the mechanism of PKP2 in cancer is still unclear. The risk score was based on mRNA expression which could be easily acquired. The signature can effectively evaluate the prognosis of NSCLC patients and may be a complement to CT and pathological methods. Despite practical independent prognosis of our results, our study leaves much to be desired. First, our study only involved 200 glycolysis related genes, not the entire mRNA expression profiles. Second, validating the signature in a larger independent external set is necessary. Third, cell and animal experiments are needed to clarify specific mechanism of the five mRNAs in the regulation of tumor glycolysis. Nevertheless, the present study provides a novel suggestion for predicting prognosis of NSCLC patients from the perspective of glycolysis.

Conclusion
In summary, we recognized five glycolysis related genes associated with OS of NSCLC patients, and constructed a risk score signature to make a survival prediction. Our risk score model could distinguish NSCLC patients with different survival outcomes, which may contribute to the clinical decision of individualized treatment program.