The use of DNA repair genes as prognostic indicators of gastric cancer

DNA repair genes can be used as prognostic biomarkers in many types of cancer. We aimed to identify prognostic DNA repair genes in patients with gastric cancer (GC) by systematically bioinformatic approaches using web-based database. Global gene expression profiles from altogether 1,325 GC patients' samples from six independent datasets were included in the study. Clustering analysis was performed to screen potentially abnormal DNA repair genes related to the prognosis of GC, followed by unsupervised clustering analysis to identify molecular subtypes of GC. Characteristics and prognosis differences were analyzed among these molecular subtypes, and modular key genes in molecular subtypes were identified based on changes in expression correlation. Multivariate Cox proportional hazard analysis was used to find the independent prognostic gene. Kaplan-Meier method and log-rank test was used to estimate correlations of key DNA repair genes with GC patients'overall survival. There were 57 key genes significantly associated to GC patients' prognosis, and patients were stratified into three molecular clusters based on their expression profiles, in which patients in Cluster 3 showed the best survival (P < 0.05). After a three-phase training, test and validation process, the expression profile of 13 independent key DNA repair genes were identified can classify the prognostic risk of patients. Compared with patients with low-risk score, patients with high risk score in the training set had shorter overall survival (P < 0.0001). Furthermore, we verified equivalent findings by these key DNA repair genes in the test set (P < 0.0001) and the independent validation set (P = 0.0024). Our results suggest a great potential for the use of DNA repair gene profiling as a powerful marker in prognostication and inform treatment decisions for GC patients.


Introduction
Gastric Cancer (GC), including the gastro-esophageal junction (GEJ) adenocarcinoma, is one of the most common cancers worldwide [1]. In China, the disease burden of GC is extremely high. Previous study reported that an estimated 4,292,000 new cancer cases and 2,814,000 cancer deaths occurred in China in 2015 [2]. Surgery, chemotherapy, and molecular targeted therapies are the mainstay of treatment for GC patients. Most patients, however, are diagnosed at an advanced stage and, therefore, miss operation chance [3]. Therefore, detection of potential prognostic biomarker is crucial for GC patients prognosis and treatment.

Ivyspring International Publisher
DNA repair is a collection of processes by which cells identify and correct damages to the DNA molecules that encode its genome. The reaction may restore the DNA structure to its original function, whereas sometimes it cannot completely eliminate DNA damage and only enable cells to tolerate and survive with DNA damages. When normal repair processes fail without activation of cellular apoptosis, irreparable DNA damage may occur [4]. This can eventually lead to cellular senescence and even malignancies in human entities. DNA damage can be caused by both extracellular (e.g. radiation and virus) and intracellular (e.g. reactive oxygen) environmental factors and normal metabolic processes. Helicobacter pylori is a well-established risk factor of GC and may cause injuries to genomic integrity through an inefficient DNA repair [5].
DNA damage repair, proceed by several mechanisms, including base excision repair, mismatch excision repair, nucleotide excision repair and homologous recombination, is crucial to maintaining the integrity of the genome and the exercise of normal function. Variations in DNA repair capacity resulting from somatic mutations could therefore correlate with GC progression [6]. DNA damage caused by internal and external factors can be repaired by DNA repair genes, which involving in activation or inhibition of specific DNA repair related pathways. Multiple genes that were initially shown to influence life span have turned out to be involved in DNA damage repair and protection [7]. Large-scale genome sequencing of GC indicated that somatic mutations in genes involved in homologous recombination DNA repair are common features [8,9]. A thorough understanding of the DNA repair genes expression profile in tumor tissues would be shed new light on cancer prognostication and improvement of therapeutic response.
In this present study, we selected 1,325 GC samples from the Gene Expression Omnibus (GEO) and The Cancer Genome Atlas (TCGA) GC database, aimed to clarify the relationship between the expression patterns of GC associated DNA repair genes and GC prognoses. Different molecular subtypes were built for further identification of the correlation between gene expression and GC prognosis. With repeated unsupervised clustering analysis, multivariate Cox proportional hazard analysis and Robust likelihood-based modeling, 13 key prognostic DNA repair genes were finally identified for GC. Our study provided a group of biomarker that identifies the subset of GC deficient in DNA gene repair, which is critical for the rational design of clinical trials using DNA mismatch repair targeting agents.

Data download and preprocessing
The Microarray gene expression profiles of  gastric cancer, GSE33335, GSE27342, GSE63089,  GSE62254 and GSE26253, were downloaded from  Gene Expression Omnibus database (GEO http://www.ncbi.nlm.nih.gov/geo/). The general information of each dataset was shown in Supplemetary Table 1. All patients in all six datasets were staged by the 7th of the AJCC gastric cancer staging system [10]. Formalin fixed paraffin embedded (FFPE) tissues were used in GSE26253 and GSE62254, and frozen tissue specimens were used in GSE33335, GSE27342, GSE63089 and the TCGA GC datasets. On the basis of maximum possible inclusion strategy, a total of 727 DNA repair genes were obtained from KEGG and reported in previous articles (http://www.cgal.icnet.uk/DNA_Repair_Genes.htm l, up to June 2017) [11] (Supplementary Table 2). Actually, there were many outdated genesymbol, that is, one specific gene matched to different names in the 727 gene lists, so finally we obtained 215 genes with expression level in the learning set (GSE62254).
The expression profile data were matched to all genes and all no-load probes were removed. The median value is selected if multiple probes matched to a single gene, and further quantile standardization was used to standardize each dataset, finally the expression profile of DNA repair genes were extracted.

Analysis expression profiles of DNA repair genes
The expression profiles of DNA repair genes in cancer and adjacent normal tissue samples were analyzed in GSE33335, GSE27342 and GSE63089 datasets, GSEA (hallmarks sets_BP) was implemented by using R survival software, to observe the GO biological processes that DNA repair genes involved.

Screening of DNA repair genes which related to prognosis
Cancer samples at different stages may show different DNA gene expression patterns, which is closely related to patients' prognosis. We analyzed DNA repair gene expression profile of each sample in the GSE62254 dataset and classified samples using hierarchical clustering, then analyzed the differences on patients' prognosis between different categories. Given that patients' prognosis altered with different gene expression levels, we selected DNA repair genes whose expression dysregulation variance is greater than 0.1 for further analysis. Potentially altered genes, which were screened from patient samples, were separately analyzed using univariate COX proportional hazards model. Genes with P-values less than 0.05 were selected.

GC molecular subtype construction and prognostic genes analysis
GC molecular subtypes were built by unsupervised clustering of prognostic DNA repair genes. The Kaplan-Meier curve with log-rank analysis was used for prognosis analysis of each molecular subtypes and patients with different risk. Clinical characteristics were identified and compared through observing individual subtypes. The expression correlation among these genes was analyzed by Pearson correlation coefficient.

Independent prognostic genes and formula exploring
We did a multivariate Cox regression analysis using a backward stepwise approach to test if the gene was an independent prognostic factor of overall survival, and derived a formula to calculate the risk score for every patient from the expression values of independent prognostic DNA repair genes, weighted by regression coefficient. Risk score=expgene1* βgene1+expgene2*βgene2+...expgene10*βgene10 (exp: expression level, β: the regression coefficient derived from the multivariate Cox regression model). We used receiver operating characteristics (ROC) curves to analyze the sensitivity, specificity and Youden index for the prediction of survival by the DNA repair gene signature [12].

Verifying of key prognostic genes through external data
The influence of DNA repair genes on prognosis was verified in GSE26253 and TCGA GC dataset, respectively. The RNA sequence data of gastric cancer in The Cancer Genome Atlas (TCGA) was downloaded from (http://cancergenome.nih.gov/). A total of 433 patient samples and their clinical follow-up information were available.

Statistical analysis
All statistical analyses were done with R survival package two-tailed tests, and significance was defined as p values of less than 0.05 [13].

Sample data download and pre-processing
A total of 1,325 GC samples which included 23,521 gene expression values were obtained from 5 GEO GC datasets (GSE33335, GSE27342, GSE63089, GSE62254 and GSE26253) and the TCGA GC dataset. A total of 727 expressed DNA repair genes were selected from these GC samples. The flow diagram of key prognostic gene identification and performance evaluation was shown in Figure 1.  Clustering analysis of 727 DNA repair genes in indicated GEO GC database. The horizontal axis represents sample, using Euclidean distance to calculate distance; the vertical axis stands for genes, using Pearson correlation coefficient to calculate distance.

Expression pattern of DNA repair genes in GC and adjacent normal tissues
We analyzed the expression profiles of DNA repair genes in three groups of cancer and adjacent normal tissues (GSE33335, GSE27342 and GSE63089). As shown in Figure 2, the average expression level of DNA repair gene in GC samples was higher than that in adjacent tissues, suggesting DNA repair was more active in cancer tissues than normal GC tissues.
In order to observe the biological processes associated with the DNA repair genes in GC, we used GSEA database (GO gene sets) to analyze DNA repair gene expression enrichment in these three GEO GC datasets. As shown in Supplementary Figure 1 and Supplementary Table 3, they were significantly enriched in several biological processes including cell cycle (GO cell cycle G1/S phase transition and GO cell cycle phase transition), metabolism (GO DNA metabolic process) and related molecular function (GO positive regulation of molecular function). It is suggested that cell cycle abnormalities in GC tissues were associated with DNA repair genes.

Relationship between different expression patterns of DNA repair genes and GC prognosis
We analyzed the expression profiles of DNA repair genes in GSE62254 dataset (in which we extracted 215 DNA repair genes) by hierarchical clustering and categorized the expression profiles of these DNA repair genes into three groups (group IV was excludes because it only has one sample), in which Group III showed the lowest overall DNA repair activity while Group II showed the highest overall DNA repair activity (Supplementary Figure  2A). Next, we analyzed the overall survival of patients in these three groups, which showed that the patient in Group III had a significantly worst outcome than those patients in Group I (P < 0.0001, Supplementary Figure 2B) and Group II (P = 0.0021, Supplementary Figure 2C). This indicates that DNA repair genes could significantly distinguish GC patients' prognostic outcomes.
Among these 215 genes in the GSE62254 dataset, we enrolled 156 genes whose variance > 0.1 in all samples for Cox univariate survival analysis and got 57 genes with significant influence on GC patients' prognosis (all P < 0.05, Table 1), in which the Hazard ratio (HR) of 52 genes were less than 1. We then analyzed the correlation among the expression level of these 57 prognostic genes by Pearson correlation coefficient, the hierarchical cluster of correlation between each gene showed that 50 of the 57 DNA repair genes were positively correlated with each other, while 7 genes had negative correlations (Supplementary Figure  3). By performing unsupervised clustering of these 57 DNA repair genes, we divided GC patients into three clusters (Figure 3A), and DNA repair gene expression significantly differed in these three clusters ( Figure  3B). Moreover, patients Cluster 2 had a better outcome than patients in Cluster 1 and Cluster 3 (P < 0.05, Figure 3C-E). The analysis of clinicalpathological characteristics distribution of the three groups is shown in Supplementary Figure 4. Based on this analysis, the age distribution of groups III was significantly lower than group I (P = 0.0002) and group II (P < 0.0001). This suggests that the better prognosis of groups II should be age-related.

Development of a prognostic risk score formula using DNA repair genes in GC
We further identified 13 independent prognostic genes by Cox multivariate survival analysis from these 57 DNA repair genes, and derived a formula to calculate the risk score for each patient from the expression values of the 13 DNA repair genes, weighted by regression coefficient. The specific formula for the GSE62254 dataset was as follows: Risk score = (0.41 * expMCM2 + 0.2 * expMLH1 + 0.25 * expCLK2 -0.34 * expFANCG + 0.38 * expEXO1 -0.33 * expPARP1 + 0.29 * expCETN2 -0.25 * expNEIL3 + 0.24 * expALKBH3 + 0.24 * expPOLI -0.18 * expPOLD3 + 0.22 * expERCC1 -0.29 * expFANCF). The forest map of these 13 genes is shown in Supplementary Figure 5, in which five genes (FANCG, PARP1, NEIL3, POLD3 and FANCF) showed HR < 1, and eight genes (MCM2, MLH1, EXO1, CLK2, CETN2, ALKBH3, POLI and ERCC1) showed risk HR > 1. We divided the patients into two groups (group I and group II) by the expression spectrum clustering analysis, and found that the expression level of seven genes (FANCF, POLD3, PARP1, MCM2, EXO1, FANCG and NEIL3) in Group I was higher than that in Group II (Figure 4A). Compared with patients in group I, patients in group II had significantly shorter overall survival (P = 0.0066, Figure 4B). In Receiver operating characteristic (ROC) curve to analyze sensitivity and specificity of survival prediction, the area under the AUC line was 0.780 (P = 5.7528e-12, Figure 4C). With this risk score formula, patients in the learning set were divided into high-risk or low-risk groups with the Youden index (0.2446) as the cutoff. Compared with patients in the low-risk group, patients in the high-risk group had shorter overall survival (P < 0.0001, Figure 4D).

Independent datasets test and validation
To assess whether these 13 DNA repair genes had the same or similar prognostic value in different populations, we extracted the expression profiles of 13 DNA repair genes in the test datasets GSE26253 (n = 432), 12 genes (except of NEIL3) were found. With the same algorithm used in the learning set, we developed a new formula for the expression data from the 432 GC tissues from the test set as follow: Risk score = (-0.1003 * expMCM2 -0.0247 * expMLH1 -0.1412 * expCLK2 -0.1752 * expFANCG -0.4061 * expEXO1 +0.1372 * expPARP1 + 0.1547 * expCETN2 + 0.0116 * expALKBH3 -0.0509 * expPOLI -0.0159 * expPOLD3 + 0.1722 * expERCC1 -0.1339 * expFANCF). By the expression spectrum clustering analysis, we also divided the patients into two groups (Group I and Group II, Figure 5A), and compared with patients in Group I, patients in Group II had significant shorter overall survival (P = 0.042, Figure  5B). In ROC analysis, the area under the AUC line was 0.671 (P = 0.0002, Figure 5C). With this risk score formula, patients in the test set were divided into high-risk or low-risk groups with the Youden index (3.9377) as the cutoff. As expected, patients in the test set with high risk-scores had shorter overall survival than those with low risk-scores (P < 0.0001, Figure  5D).
And then we further assess the prognostic value of these 13 DNA repair genes in an independent validation set (TCGA GC dataset, n = 443). With the same algorithm used in the learning set, we developed a new formula for the expression data from the 443 GC tissues from the independent validation set: Risk score = (-0.9094 * expERCC1 -0.1896 * expCETN2 -0.2124 * expPARP1 -1.70426 * expCLK2 -0.1481 * expFANCG -1.3774 * expMCM2 + 0.9026 * expALKBH3 -2.1348 * expFANCF + 5.4205 * expEXO1 -1.4920 * expPOLI + 1.5893 * expMLH1 -2.8104 * expPOLD3 -6.6587 * expNEIL3). By the expression spectrum clustering analysis, we also divided the patients into two groups (Figure 6A), whereas there was no significant difference on overall survival between group I and group II (P = 0.86, Figure 6B). In ROC analysis, the area under the AUC line was 0.719 (P = 0.0498, Figure 6C). With this risk score formula, patients in the learning set were divided into high-risk or low-risk groups with the Youden index (0.1516) as the cutoff. As expected, patients in the independent validation set with high risk-scores had shorter overall survival than those with low risk-scores (P = 0.0024, Figure 6D). The clinicalpathological characteristics distribution between the two groups divided by the 13 DNA repair genes showed significantly differences (Supplementary Figure 6).

Discussion
Cumulative studies have reported DNA damage repair deficiencies in multiple solid tumor types, including GC [14,15]. DNA repair genes are differentially expressed in normal tissues and cancer tissues, the abnormal expression of DNA repair genes in human cancers is significantly associated with the patients' prognosis [9,16,17]. The existence of tumor heterogeneity and complexity specificity, to some extent, greatly limits the prognostication value (specificity, reproducibility and generalisability) of individual gene [18,19]. Thus, employing a gene expression panel instead of an individual gene as a biomarker provides a rational option to circumvent the limitation in genes utilisation in predicting GC outcomes. However, the globe expression profile of DNA repair genes and their clinical application prospect has yet to be elucidated in GC. In the current study, we identified an expression profile of 13 independent key DNA repair genes which could classify the prognostic risk of patients.
With the development of microarray and next-generation sequencing technology, increasing data resources regarding DNA repair genes, such as TCGA and GEO databases, are available for comparing and analysing in the human malignancies  [20,21]. In the current study, bio-information of gene spectrum in 150 GC samples obtained from three GEO GC datasets, was analyzed to determine the expression activity of DNA repair genes in GC. As a result, the expression activity of 727 DNA repair genes we examined were significantly higher in GC samples compared with normal gastric mucosa, which indicated that DNA damage repair is a common event in GC tumorigenesis and progression. Therefore, their expression levels may be an indicator of the intrinsic characteristics of GC. DNA repair genes could potentially be sensitive and specific prognostic predictors. Indeed, we confirmed that the less active of DNA repair genes expression in cancer samples, the worse the prognosis. All this suggested that patients with higher DNA repair gene expression abundance tend to have a better outcome and most of the DNA repair genes were prognostic protective factors.
Unsupervised clustering analysis further verified that the prediction by using the molecular signature of these prognostic effective DNA repair genes is matched to the clinical outcomes for these samples. The expression profile of DNA repair genes in cluster 2 were significantly higher than the others two clusters. And the overall survival of cluster 2 was significantly higher than of the others two clusters, indicated that most of the 57 genes may be the promoters of DNA damage repair and tumor suppressor in GC. Interestingly, the differentially expressed DNA repair genes described here, including POLD3, MLH1, MCM2 and ERCC1, are similar to those listed in previous reports [22][23][24][25]. These aforementioned studies, together with our results, firmly support the notion that DNA repair gene expression profile may generate a unique molecular signature for prognostication of GC.
The most striking results came from the analysis of 13 DNA repair genes (MCM2, MLH1, CLK2, FANCG, EXO1, PARP1, CETN2, NEIL3, ALKBH3, POLI, POLD3, ERCC1 and FANCF) whose altered expression profile was significantly related to the survival of GC patients. DNA damage repair has been list as the main process involving in cell cycle, and specific genetic changes to a great extent abrogate the fidelity of DNA replication [26]. Our GESA analysis results also proposed that the dysregulated DNA repair genes in GC are enriched in cell cycle related biological process. Thus, the reason that these 13