The prediction of survival in Gastric Cancer based on a Robust 13-Gene Signature

Gastric cancer represents a major public health problem. Owing to the great heterogeneity of GC, conventional clinical characteristics are limited in the accurate prediction of individual outcomes and survival. This study aimed to establish a robust gene signature to predict the prognosis of GC based on multiple datasets. Initially, we downloaded raw data from four independent datasets of The Cancer Genome Atlas (TCGA) and Gene Expression Omnibus (GEO), and performed univariate Cox proportional hazards regression analysis to identify prognostic genes associated with overall survival (OS) from each dataset. Thirteen common genes from four datasets were screened as candidate prognostic signatures. Then, a risk score model was developed based on this 13‑gene signature and validated by four independent datasets and the entire cohort. Patients with a high-risk score had poorer OS and recurrence-free survival (RFS). Multivariate regression and stratified analysis revealed that the 13-gene signature was not only an independent predictive factor but also associated with recurrence when adjusting for other clinical factors. Furthermore, in the high-risk group, gene set enrichment analysis (GSEA) showed that the mTOR signaling pathway and MAPK signaling pathway were significantly enriched. The present study provided a robust and reliable gene signature for prognostic prediction of both OS and RFS of patients with GC, which may be useful for delivering individualized management of patients.


Introduction
GC remains a major challenge for public health worldwide [1]. Accurate prediction of prognosis can confirm patients who would benefit from more radical treatment, such as chemotherapy, neoadjuvant therapy and targeted molecular therapy. Currently, the tumor-node-metastasis (TNM) classification system is still the most common method to select therapeutic strategies and evaluate the prognosis of patients with GC [2]. Nevertheless, various outcomes have been identified in patients with GC with similar clinical factors, which suggests that the predictive efficacy of conventional models may be insufficient [3][4][5]. Hence, it is crucial to develop robust and reliable prognostic signatures to improve individualized survival predictions in GC.
Modern biomedical research, such as microarray and next-generation sequencing analyses, has explored many GC prognostic genes that are crucial for risk stratification and personalized treatment decisions [6,7]. However, the vast majority of studies have concentrated mainly on a single gene, and its predictive ability is insufficient compared with multiple biomarker-based models [8,9]. In clinical practice, accurately predicting OS for patients with Ivyspring International Publisher GC may facilitate individual clinical decision-making. In this study, we established a robust 13-gene prognostic signature for gastric cancer by integrating multiple datasets, which might complement classical clinical prognostic characteristics, and further aid clinicians in personalized treatment planning.

Acquisition of gene expression clinical data
Three independent datasets and corresponding clinical information were downloaded from GEO (GEO; https://www.ncbi.nlm.nih.gov/geo/), under the accession numbers GSE15459, GSE62254 and GSE57303, and one dataset in TCGA was employed from University of California Santa Cruz Xenabrowser (UCSC Xena) (http://genome.ucsc. edu/). The mRNA expression profiles of the GSE15459, GSE62254 and GSE57303 datasets were all detected by using the GPL570 platform (Affymetrix Human Genome U133 Plus 2.0 Array). After the samples with overall survival (OS) ≤ 30 days were excluded, 884 patients were enrolled, including 182 samples from GSE15459, 299 from GSE62254, 70 from GSE57303, and 333 from TCGA (Table S1). In terms of cancer recurrence, patients from TCGA were excluded if (i) the records of primary therapy outcome and radiation therapy were not clearly presented (n = 26); (ii) patients underwent radiation therapy (n = 45); or (iii) patients with stage I gastric cancer who recurred within one year did not achieve complete resection after primary therapy (n = 2), resulting in the enrolling of a total of 191 patients (Table S2). Raw datasets from the GEO database were normalized respectively via Robust Multichip Average (RMA) in the affy package of R, while the TCGA data were normalized as transcripts per million (TPM).

Development and validation of the gene signature
In the present study, a univariate Cox proportional hazard regression model was used to assess the relationships between overall survival and the level of expression in each cohort. Statistical significance was assumed at p < 0.05, and the shared genes between four datasets were selected to formulate a signature model for prediction of prognosis. Risky genes (hazard ratio (HR) > 1) and protective genes (HR < 1) were defined by the HR from the univariate Cox regression analysis. Next, we used a risk score model to develop prognostic signatures, and validated it via four datasets and the entire cohort. A detailed flow-chart of this study is depicted in Figure 1.
Based on the expression of the gene signature and corresponding weighted coefficient, we calculated a risk score for each patient as follows: = � * =1 In the above equation, n was the number of prognostic genes, expi was the expression value of gene i, and βi was the regression coefficient of gene i. According to the median risk score obtained, patients with GC in each cohort were segmented into high-risk and low-risk group. Subsequently, we investigated the relationship between the prognosis signature and RFS based on TCGA.

Gene set enrichment analysis (GSEA)
GSEA was implemented to explore the Kyoto Encyclopedia of Genes and Genomes (KEGG) pathways that were significantly enriched in high-risk tumor samples by GSEA v4.0.3 (http://www. broadinstitute.org/gsea/index.jsp). The gene sets of canonical pathways (c2.all.v7.0.symbols.gmt) were used 1,000 times for each analysis to obtain the normalized enrichment score (NES). P<0.05 and NES >1 were considered statistically significant.

Statistical analysis
The data were analysed using R software (version 3.4.3). Comparisons between the two groups were inspected by the Kaplan-Meier curve and compared by the log-rank test. The chi-square test was implemented to evaluate the relativity between the risk score and recurrence. To investigate whether the prognostic gene signature could be independent of other clinical and pathological characteristics, we carried out univariate and multivariate Cox proportional hazard regression models. Factors with P value < 0.05 in the univariate analyses were included in the multivariate analyses [10]. Unless otherwise stipulated, P < 0.05 was set as the cut-off value. Receiver operating characteristic (ROC) curve analysis was performed to examine the predictive accuracy of the risk score for time-dependent disease outcomes within three years. Afterwards, the area under the curve (AUC) values were calculated according to the ROC curves.

Prognostic signature generation
Survival-associated genes were identified via univariate Cox regression analysis in each dataset. Under the cut-off threshold of HR < 1 and P < 0.05, 1919 genes in GSE15459, 3924 genes in GSE62254, 591 genes in GSE57303, and 274 genes in TCGA were selected as candidate protective genes. According to the screening criteria (HR > 1 and P < 0.05), 1632 genes in GSE15459, 3159 genes in GSE62254, 909 genes in GSE57303, and 6619 genes in TCGA were selected as risky candidate genes. Finally, a total of 13 common genes (ADAT3, TMEM171, DCBLD2, MARCKS,  CLIP4, CTNNAL1, PIP4K2A, ZBTB10, NRP1, CST6, PLTP, CD109, and JAZF1) were finally identified and nominated as the "13-gene signature" by overlapping these candidate genes, including 2 protective genes and 11 risky genes (Table S3). The general information of these 13 genes and the prognostic correlation of the 13 in each dataset are displayed in Tables 1 & 2.

Thirteen-gene prognostic signature validation
The risk score (RS) for each patient was achieved based on the expression levels of the 13-gene signature and corresponding weighted coefficients. The samples in each data set were sorted by risk score, and the median risk score was set as the cut-off point to stratify the subjects into high-and low-risk groups. In Figure 2, the survival status, and gene expression of patients in each dataset were profiled, which showed that the survival events decreased as RS increased. Furthermore, the survival curves demonstrated that the prognosis of patients with a high-risk score suffered a poorer prognosis than those with a low-risk score (GSE15459: P < 0.0001, GSE57303: P = 0.0013, GSE62254: P < 0.0001, TCGA: P = 0.00012) ( Figure 3). The area under the ROC curve of the 13-gene signature was 0.689, 0.701, 0.676, and 0.601 in GSE57303, GSE62254, GSE15459, and TCGA respectively, revealing that the 13-gene prognostic model had a certain accuracy in predicting outcome in GC ( Figure 3). The above data confirmed that RS derived from a 13-gene signature could appropriately predict the prognosis of patients with GC.

The 13-gene signature is an independent prognostic factor
To investigate the prognostic prediction value of the 13-gene prognostic signature and identify the independent factors in gastric cancer, univariate and multivariate Cox regression analyses were performed. Covariates are composed of RS and other clinicopathological characteristics, such as gender, age, stage, Lauren classification and recurrence. Multivariate regression analysis revealed that the 13-gene signature was significantly correlated with OS even after adjustment for other covariates in the four datasets (GSE15459: P = 9.84E-05, GSE57303: P = 2.07E-03, GSE62254: P = 4.27E-03, and TCGA: P = 4.89E-03) ( Table 3) or the entire cohort (P = 2.59E-12) ( Table 4).

The prognostic signature is associated with cancer recurrence
Since cancer recurrence and RS are both independent factors of patient survival in TCGA (Table 3), the relativity between recurrence and the risk score was tested. The results revealed significantly favourable RFS in the low-risk group ( Figure 4A: χ2 = 7.822, P = 0.0005; Figure 4B: P = 0.009), and patients with recurrence were more likely to have a higher risk score ( Figure 4C). Meanwhile, the AUC for 3-year recurrence-free survival predictions was 0.651( Figure 4D). Then we performed univariate and multivariate Cox proportional hazards models. In univariate analyses, high stage (HR = 4.2, 95% CI = 1.0-17.0, P = 0.04) and risk score (HR = 2.5, 95% CI = 1.2-4.9, P = 0.01) were associated with shorter (RFS). Multivariate analysis indicated that risk score was the only independent prognostic factor in terms of RFS (HR = 2.3, 95% CI = 1.1-4.6, P = 0.02).

Stratification analysis
Based on clinical parameters, such as age (65/>65), gender (female/male), stage (II/III/IV) and Lauren subtype (diffuse/intestinal), the patients in the entire cohort were factitiously stratified. Patients in stage I were excluded from the stratification analysis due to the small sample size. In Figure 5, the survival curves in different subgroups displayed that patients with high-risk scores in each stratum had significantly shorter OS (P < 0.05).

Identification of KEGG pathways
To explore the potential function of the 13-gene signature, we conducted GSEA among high-risk and low-risk patients in the four datasets. Oncological pathways, such as the mTOR signaling pathway

Discussion
Previous studies have developed a number of molecular signatures that divide patients into various prognostic groups and are involved in tumor progression [11][12][13][14][15][16][17]. Wang et al. built a prognostic seven-gene signature using integrated multi-step analysis [11]. Chen et al. found novel proposed clinical-immune signature based on TCGA [12]. However, these assumed prognostic signatures were mostly derived from one or two training sets. In the present study, we establish 13-gene prognostic signatures by taking the intersection of the survival-related genes from four independent datasets for the first time in gastric cancer and partially handling the problems of clinical heterogeneity and insufficient sample size, which were eventually validated.
Our study suggested that the 13-gene signature was an independent prognostic factor (Tables 3 & 4). In addition, age, stage and recurrence were significantly associated with OS of patients in TCGA; stage was correlated with OS of patients in GSE15459; and stage and Lauren subtype were prominently related to OS of patients in GSE62254. Age and stage were identified as independent prognostic factors for OS in the entire patient cohort. Interestingly, Marqués-Lespier et al. indicated that diffuse-type gastric cancer is more aggressive and has a worse prognosis than intestinal-type gastric cancer [18]. In our study, multivariate Cox regression analysis revealed that, except for patients in GSE62254, no significant correlation was found between the Lauren type and either OS or RFS in GC (Table 4, Figure 4E).
Among the identified 13 genes (ADAT3, TMEM171, DCBLD2, MARCKS, CLIP4, CTNNAL1, PIP4K2A, ZBTB10, NRP1, CST6, PLTP, CD109, and JAZF1) in our study, overexpression of MARCKS, CLIP4, NRP1, PLTP, CD109 has been previously associated with poor prognosis of GC [19][20][21][22][23]. MARCKS aggravates gastric cancer tumorigenesis and progression via EMT pathway [19]. It has been found that high expression of CLIP4 and CD109 was associated with poor GC overall survival according to assay and RNA-Seq data of patients with GC [20,21]. Loss of PLTP expression has been reported to inhibite the proliferation in both AGS and SGC-7901 cell lines [23]. NRP1 plays an essential role in the proliferation, migration, and invasion of gastric cancer cells [24]. Our findings are consistent with these studies. In addition, ZBTB10 was regarded to regulate specificity proteins (Sp) by reactive oxygen species (ROS)-microRNA27a [25]. Chen et al. proposed that JAZF1 suppresses proliferation and induces apoptosis via TAK1/NF-κB pathways [26]. DCBLD2 over-expression has been implicated in causing tumorigenesis, invasion and metastasis in colorectal cancer expression [27]. CTNNAL1 can contribute to drug-resistance of melanoma through the way of activating NF-κB and AP-1 [28]. Shin et al. believed that PIP4K2A played a negative regulatory role on PI3K in PTEN-deficient glioblastoma [29]. Emerging evidence has shown that High CST6 expression predicts poor prognosis in Triple-Negative Breast Cancer [30]. For ADAT3 and TMEM171, there is little published data on TMEM171 and ADAT3 function in cancer. Except for MARCKS, CLIP4, NRP1, PLTP, CD109, other mRNAs are newly reported to be associated with GC survival. The TNM staging system, currently used as the most important and basic tool for GC patient stratification, is deficient in accurately predicting individual survival [3][4][5]. This happens because nearly one-third of patients experienced recurrence after surgery, whereas the current staging system cannot accurately reflect it [31], which is also displayed in our study ( Figure 4E). This is directly linked to chemotherapy strategy after gastrectomy. GSEA disclosed that several oncological pathways, including the mTOR signaling pathway and MAPK signaling pathway, were significantly concentrated in the high-risk group. Among them, the activated mTOR signaling pathway is critical for cell transformation, growth, metastasis and predicts poor prognosis in gastric cancer [32,33]. The consequences of increased mTOR pathway signaling can also lead to drug resistance, which continues to be the principal limiting factor to achieving cures in patients with cancer [34,35]. In trastuzumab-resistant HER2 positive gastric cancer cells, mTOR pathway is among multiple signaling pathways that mediate trastuzumab resistance [36]. Notably, suppressing the mTOR pathway can significantly inhibit tumor progression and improve the efficacy of trastuzumab in GC [36,37]. The MAPK signaling pathway is reported to be frequently activated in the process of tumor development, such as proliferation, migration, and invasion [38,39]. The expression of MAPKs is elevated in almost all high-grade GC and is correlated with tumorigenesis and metastatic potential [40]. Similarly, lots of studies have demonstrated that the MAPK signaling pathway has an effect on chemotherapy resistance in gastric cancer. Guo et al. found that the p38-MAPK pathway was activated in vincristine-resistant gastric cancer SGC7901/VCR cells and confirmed its regulatory effect on multidrug resistance [41]. In addition, emerging evidence suggests that SB203580, a selective inhibitor of p38 MAPK, increases sensitivity to doxorubicin in two gastric cancer cell lines (SGC7901 and BGC823) by inducing apoptosis in vitro and in vivo [42]. In general, these oncological pathways may provide further understanding of the risk score model in GC, and repression of these pathways might be a useful therapeutic strategy for high-risk patients.
Some flaws in our study should be taken into consideration. First, this is a retrospective study and validation of the signature for each patient in a large-scale prospective clinical trial is necessary. Second, the molecular mechanism of these genes remains unclear, and further functional experiments are needed to in the future.

Conclusions
Collectively, we identified a novel 13-gene signature as a potential prognostic biomarker based on four independent datasets for the first time in GC, that was independent of other clinical factors. Besides, most of the 13 genes were the first time finding associated with the prognosis of GC. Last but not least, the signature may be associated with cancer recurrence and chemotherapy resistance in GC ( Figure 7). Further experimental studies are warranted to elucidate the mechanisms of the 13 genes in gastric cancer.