Significant prognostic values of differentially expressed-aberrantly methylated hub genes in breast cancer

Introduction: Abnormal status of gene expression plays an important role in tumorigenesis, progression and metastasis of breast cancer. Mechanisms of gene silence or activation were varied. Methylation of genes may contribute to alteration of gene expression. This study aimed to identify differentially expressed hub genes which may be regulated by DNA methylation and evaluate their prognostic value in breast cancer by bioinformatic analysis. Methods: GEO2R was used to obtain expression microarray data from GSE54002, GSE65194 and methylation microarray data from GSE20713, GSE32393. Differentially expressed-aberrantly methylated genes were identified by FunRich. Biological function and pathway enrichment analysis were conducted by DAVID. PPI network was constructed by STRING and hub genes was sorted by Cytoscape. Expression and DNA methylation of hub genes was validated by UALCAN and MethHC. Clinical outcome analysis of hub genes was performed by Kaplan Meier-plotter database for breast cancer. IHC was performed to analyze protein levels of EXO1 and Kaplan-Meier was used for survival analysis. Results: 677 upregulated-hypomethylated and 361 downregulated-hypermethylated genes were obtained from GSE54002, GSE65194, GSE20713 and GSE32393 by GEO2R and FunRich. The most significant biological process, cellular component, molecular function enriched and pathway for upregulated-hypomethylated genes were viral process, cytoplasm, protein binding and cell cycle respectively. For downregulated-hypermethylated genes, the result was peptidyl-tyrosine phosphorylation, plasma membrane, transmembrane receptor protein tyrosine kinase activity and Rap1 signaling pathway (All p< 0.05). 12 hub genes (TOP2A, MAD2L1, FEN1, EPRS, EXO1, MCM4, PTTG1, RRM2, PSMD14, CDKN3, H2AFZ, CCNE2) were sorted from 677 upregulated-hypomethylated genes. 4 hub genes (EGFR, FGF2, BCL2, PIK3R1) were sorted from 361 downregulated-hypermethylated genes. Differential expression of 16 hub genes was validated in UALCAN database (p<0.05). 7 in 12 upregulated-hypomethylated and 2 in 4 downregulated-hypermethylated hub genes were confirmed to be significantly hypomethylated or hypermethylated in breast cancer using MethHC database (p<0.05). Finally, 12 upregulated hub genes (TOP2A, MAD2L1, FEN1, EPRS, EXO1, MCM4, PTTG1, RRM2, PSMD14, CDKN3, H2AFZ, CCNE2) and 3 downregulated genes (FGF2, BCL2, PIK3R1) contributed to significant unfavorable clinical outcome in breast cancer (p<0.05). High expression level of EXO1 protein was significantly associated with poor OS in breast cancer patients (p=0.03). Conclusion: Overexpression of TOP2A, MAD2L1, FEN1, EPRS, EXO1, MCM4, PTTG1, RRM2, PSMD14, CDKN3, H2AFZ, CCNE2 and downregulation of FGF2, BCL2, PIK3R1 might serve as diagnosis and poor prognosis biomarkers in breast cancer by more research validation. EXO1 was identified as an individual unfavorable prognostic factor. Methylation might be one of the major causes leading to abnormal expression of those genes. Functional analysis and pathway enrichment analysis of those genes would provide novel ideas for breast cancer research.


Introduction
Breast cancer is the most frequently diagnosed cancer among females worldwide following lung cancer [1]. Aberrant gene expression plays an important role in tumorigenesis, progression and metastasis of breast cancer and it is considered to be the consequence of not only genetic defects (such as TP53, PIK3CA mutation, BRCA1/BRCA2 inactivation, Cyclin D1 amplification [2]) but also epigenetic modifications [3]. Epigenetic alterations in breast cancer consist of DNA methylation, RNA methylation, histone modification , non-coding RNAs (especially miRNA and lncRNA) regulation and so no [4]. This study focused on DNA methylation, one of the most widely studied epigenetic modifications. DNA methylation occurs with the addition of a methyl (CH3) group from S-adenosylmethionine (SAM) into cytosine residues of the DNA template [5], mostly located on cytosine-phosphate-guanine (CpGs) dinucleotides. Both DNA hypermethylation and hypomethylation can be involved in diverse processes of breast cancer development and prognosis [6].
In clinical practice, though breast cancer is classified into three subtypes according to hormone receptor status, growth factor receptor status and Ki-67 which reflected partial prognostic information. And serum CA 15-3, CEA level, BRCA1/2 mutation status, PALB2 mutation status and circulating tumor DNA methylation might provide additional information for prognosis. However, heterogeneity of prognosis still exists. Therefore, more biomarkers are still urgently needed for more accurate prognosis. To date, there are many public databases for gene expression and methylation whose data was provided by published researches. Among them, plenty of researches have demonstrated the correlation between DNA methylation and prognosis of breast cancer, but the comprehensive profile and the interaction network of these aberrantly-expressed methylated genes still remain elusive. This study was aimed to identify aberrantly expressed hub genes that might be regulated by DNA methylation in breast cancer and to evaluate the prognostic value of these genes by using public databases. Several accessible software, databases, simple operations and basic bioinformatic knowledge were needed to complete this study and results might provide directions for further research.

Microarray and RNASeq data
In the initiation of present study, we screened the breast cancer expression microarray and methylation microarray datasets in GEO DataSets of NCBI (https://www.ncbi.nlm.nih.gov/gds/),sorted by sample number (From high to low). Study type was restricted to expression profiling by array and methylation profiling by array, and datasets both including breast cancer and normal breast samples were utilized. Finally, expression microarray datasets GSE54002, GSE65194 and methylation microarray datasets GSE20713, GSE32393 were included. 16 normal breast tissues and 200 primary breast cancer tissues were selected in GSE54002. 11

Screening for upregulated-hypomethylated and downregulated-hypermethylated genes
GEO2R, an online analyzing tool of GEO DataSets, was utilized to analyze differentially expressed genes (DEG) between breast cancer tissues and normal tissues in expression microarray datasets and differential methylation in methylation microarray datasets. P value < 0.05 and |t| >2 were used as cutoff criteria to identify differential expression and methylation genes. Next, FunRich software (Functional Enrichment analysis tool, latest version 3.1.3 was download from http://www.fun rich.org/) [7] was used to identify overlapping genes from GSE54002, GSE65194, GSE20713 and GSE32393. Finally, the overlapping genes were identified as upregulated-hypomethylated and downregulatedhypermethylated genes in breast cancer from the previous four datasets.

GO and KEGG pathway enrichment analysis
DAVID (The Database for Annotation, Visualization and Integrated Discovery, https:// david.ncifcrf.gov/), an online analysis tool box consists of an integrated biological knowledgebase and analytic tools aimed at systematically extracting biological meaning from large gene/protein lists [8], was used for gene functional and pathway enrichment analysis. The gene ontology (GO) analysis [9] consisted of analysis of biological process (BP), cellular component (CC), molecular function (MF) and Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway enrichment analysis [10] were performed for the identified downregulatedpromotor hypermethylated genes. The top 5 GO analysis terms and top 10 KEGG pathways were visualized in the result. P value < 0.05 was used as statistical significance.

Construction of protein-protein interaction (PPI) network and identification of hub genes
STRING (Search Tool for the Retrieval of Interacting Genes, https://string-db.org/), an online protein interaction analysis tool, was performed to construct PPI network of the previous identified downregulated-promotor hypermethylated genes. Homo sapiens were selected as the organism for subsequent analysis. Medium confidence 0.4 of Interaction score was regarded as the cut-off criterion for network visualization and disconnected nodes was hidden. Subsequently, Cytoscape software (latest version 3.7.0, download from http://www.cytoscape. org/), a network data integration, analysis and visualization tool, was conducted to identify hub genes and modules within PPI network. Degree >25 was used as cutoff criteria for hub gene identification. MCODE score >3 and number of nodes >4 were utilized as cutoff criteria for module identification.

Validation of gene expression in UALCAN database
UALCAN (http://ualcan.path.uab.edu/), an online cancer transcriptome database, is designed to provide easy access to publicly available cancer transcriptome data (TCGA and MET500 transcriptome sequencing) [11]. This database was used to compare expression level of hub genes between normal breast tissue and primary invasive breast carcinoma.

Validation of gene methylation in MethHC database
MethHC (http://MethHC.mbc.nctu.edu.tw), a database of DNA Methylation and gene expression in Human Cancer, integrates data from TCGA (The Cancer Genome Atlas), which includes 18 human cancers in more than 6000 samples, 6548 microarrays and 12 567 RNA sequencing data. Differential DNA methylation was compared by average beta value in tumor sample and matched normal samples in different genes [12].

Survival analysis of genes in Kaplan Meier-plotter database
Clinical outcome analysis of hub genes in breast cancer was performed by Kaplan Meier-plotter database [13] whose background database was established using gene expression data and survival information of 1,809 patients downloaded from GEO (Affymetrix HGU133A and HGU133+2 microarrays). Overall survival analysis of 16 hub genes was performed separately in its default settings with 300 months follow-up.

COX regression analysis and Kaplan
Meier-plotter construction IBM SPSS statistics 22 was used for cox regression analysis. Relative ratio (RR) and 95% CI were used for statistical analysis. After univariable analysis, hub genes were enrolled in multivariable analysis whose p value <0.1. Backward regression was used for multivariable regression analysis. Finally, hub genes included in the last step was selected as combined genes for prognostic analysis. Next, the upper 50% gene expression patient was defined as high expression group and lower 50% gene expression was defined as low expression group. Exp(B) from multivariable analysis was saved for the following classification. The upper 50% RR was defined as high risk group and lower 50% was defined as low risk group. ROC curve analysis was performed for single gene and combine genes group for predicting 5-year overall survival. Kaplan Meier-plotter analysis was performed for specific group identified previously. Log Rank (Mantel-Cox) was used for statistical analysis.

Tissue microarrays (TMAs) and IHC staining
TMAs were purchased from Outdo Biotech (Shanghai, China), and contained 140 breast cancer samples. But 7 samples were missing after we finished IHC staining. Clinicopathological characteristics of these 133 samples are listed in Table S2. This experiment was approved by the Ethics Committee of the Second Affiliated Hospital, Zhejiang University School of Medicine. Primary antibodies against EXO1 (Huabio, ER1908-42, China) (1:200) was used for IHC staining.
EXO1 expression scores were blindly evaluated by two pathologists using the immunoreactivity score (IRS), based on the percentage of positive cells and the intensity of staining. When the two pathologists had a very different score, we asked a third pathologist to evaluate the slide. The percentage of positive cells was graded as follows: 0 (negative), 1 (<10%), 2 (10%-50%), 3 (51%-80%), 4 (>80%). The intensity of staining was graded as follows: 0 (no color reaction), 1 (mild reaction), 2 (moderate reaction), 3 (intense reaction). IRS was multiplied by the two scores. In this study, EXO1 expression was defined as low (IRS ≤6) or high (IRS >6).

GO functional enrichment analysis
Functional enrichment analysis of 677 upregulated-hypomethylation genes and 361 downregulated-hypermethylation genes was performed by GO analysis. The result was listed in  Table 1. The top 5 terms of analytical result including biological process, cellular component and molecular function, were visualized in the table ranking by P-value from low to high (p<0.05). The most significant biological process, cellular component and molecular function enriched in the 677 upregulatedhypomethylation genes was viral process, cytoplasm and protein binding, separately. For downregulatedhypermethylated genes, the result was peptidyltyrosine phosphorylation, plasma membrane, transmembrane receptor and protein tyrosine kinase activity, separately.

KEGG pathway enrichment analysis
The top 10 KEGG pathways enriched by DAVID were demonstrated in Table 2

PPI network construction and hub gene validation
PPI network was constructed by STRING database and hub gene validation was identified by Cytoscape software. Module analysis was performed by MCODE, an application in Cytoscape software. PPI network was displayed in Fig. S1 and  (Table. 3), 12 genes for upregulated-hypomethylated and 4 genes for downregulated-hypermethylated genes group.

Expression validation of the hub genes in TCGA dataset through UALCAN database
In order to validate the expression status of 16 hub genes in breast cancer compared to normal breast tissue, UALCAN database was utilized to confirmed the result. All of the 16 hub genes were found to be significantly differential expressed in invasive breast carcinoma. 12 upregulated hub genes (TOP2A, MAD2L1, FEN1, EPRS, EXO1, MCM4, PTTG1, RRM2, PSMD14, CDKN3, H2AFZ, CCNE2) identified from GSE database were confirmed to be high expression genes in breast cancer ( Fig. 2 and 3) (p<0.05) and 4 downregulated hub genes (EGFR, FGF2, BCL2, PIK3R1) confirmed to be low expression genes (Fig. 4) (p<0.05). P values for those genes were listed in Table. 3. What's more, we compared expression of 16 hub genes using breast cancer RNASeq data downloaded from TCGA and the result was consistent with the previous data (Fig. S3).

Prognosis in patients with protein expression of EXO1
Since high EXO1 mRNA expression was identified as a strong individual prognostic factor in the previous part. Then, we performed IHC to investigated if EXO1 protein levels were significantly associated with OS in breast cancer patients.
Finally, there were 133 of 140 tumor samples used for EXO1 protein expression analysis since 7 samples were missing in the process of IHC staining within the TMAs. Representative images of EXO1 low and high expression were shown (Fig.11A). EXO1 was mainly expressed in nuclear. 48.1% (64 of 133) patients exhibited EXO1 high expression (IRS >6). Median follow-up time was 123 months. For survival analysis, Kaplan-Meier analysis showed that high EXO1 protein expression was significantly associated with decreased OS in breast cancer patients (p=0.03, Fig.  11B). There were no differences in clinical Clinicopathological characteristics between EXO1 high expression and low expression cohorts (Table  S2). The results suggested that in protein levels, high EXO1 expression was correlated with poor OS.

Discussion
Recurrence and drug resistance are the main causes of mortality in breast cancer [14], therefore it is of great significance to evaluate the prognosis precisely and individually before progression. In this study, we discovered 677 upregulated-hypomethylated and 361 downregulated-hypermethylated genes from GSE54002, GSE65194, GSE20713 and GSE32393 by GEO2R and FunRich. Among them, 12 upregulated hub genes and 3 downregulated genes turn out to contribute to significant adverse clinical outcome in breast cancer.
As was suggested in KEGG pathway enrichment analysis by DAVID, MAD2L1, PTTG1, MCM4 and CCNE2 are both involved in cell cycle pathway. MAD2L1 is a mitotic spindle checkpoint gene. Among patients with primary breast cancer, higher expression of MAD2L1 and BUB1 existed in patients with ER-, PR-, and high-grade tumors compared to those with ER +, PR+, and low-grade tumors. High MAD2L1 expression was associated with poor overall survival [15], which is consistent with our results. PTTG1 is a regulator in chromosomal segregation. It can promote the proliferation of breast cancer cell through binding to P27 directly to induce nuclear exclusion of P27 [16]. MCM4 gene encodes a kind of minichromosomal maintenance proteins. MCM4 overexpression was found only weakly associated with shorter survival in breast cancer alone while the MCM complex had a better prognosis value [17]. CCNE2 is known to promote G1-S transition. It has been demonstrated that CCNE2, targeted by miR-26a, miR-30b, might play an important role in acquired trastuzumab resistance in HER2+ breast cancer [18]. TOP2A and RRM2 are both essential enzyme in DNA replication. TOP2A amplification often occurs with HER2 amplification [19]. Previous researches have confirmed that upregulated TOP2A has unfavorable prognosis in breast cancer in both 5-year disease-free survival [20] and adjuvant treatment [21]. High expression of RRM2 was associated significantly with decreased survival in all breast cancer subtypes and increased expression was shown in tamoxifenresistant patients [22]. Moreover, RRM2 can be targeted and suppressed by miR-204-5p and RRM2 overexpression can promote the proliferation and metastasis of breast cancer cells and suppressed cell apoptosis [23]. FEN1 and CDKN3 are tumor suppressor genes while elevated expression of these genes may not seem to protect against carcinogenesis. FEN1 is a DNA repair-specific nuclease. High level of FEN1 expression in breast cancer cells could reflect the enhanced proliferation or increased DNA damage of cancer cells [24]. The level of FEN1 is inversely associated with cancer drug and radiation resistance [25]. YY1 [25] and Nrf2 [26] can down-regulated FEN1 expression through binding to the FEN1 promoter region and inactivating it. CDKN3 is involved in mitosis. Overexpression of CDKN3 predicts poor prognosis in cervical cancer [27] and lung adenocarcinoma [28] and it is also an effective biomarker in digestive system carcinomas [29,30]. However, there are not many relative researches about CDKN3 in breast cancer. EPRS is one of the Aminoacyl-tRNA synthetases, which are involved in protein translation. EPRS copy number gains in breast cancers tumors [31]. Additionally, EPRS was selectively carbonylated in tumor tissue compared to matched adjacent healthy tissue in breast cancer [32]. EXO1 is an exonuclease, which belongs to the mismatch repair system. EXO1 plays a role in replication fork degradation in BRCA1-and BRCA2-deficient cells [33]. A meta-analysis of transcriptomic data of primary breast tumors also supports our finding about EXO1 [34]. PSMD14 is a kind of deubiquitinating enzymes. There are few researches on PSMD14 in breast cancer. While recent studies have shown that high expression level of PSMD14 predicts poor prognosis of human esophageal squamous cell carcinoma [35]. H2AFZ is an oncogenic histone variant that is expressed independently of DNA replication. SMYD3-mediated H2AFZ methylation accelerates G1-S transition and promotes breast cancer proliferation [36]. 4 downregulated-hypermethylated hub genes (EGFR, FGF2, BCL2, PIK3R1) were sorted out by PPI network, and 3 of them turned out to be significantly associated with poor overall survival except EGFR. EGFR is a commonly mutant gene in many malignant tumors. EGFR is often overexpressed in breast cancer, especially in triple-negative breast cancer [37], while hypermethylation of EGFR can contribute to cetuximab resistance [38]. FGF2 involves in cell proliferation and angiogenesis. FGF2 is hypermethylated in HR+ breast cancers and may have prognosis value [39] while it has low prognosis implication in triple-negative breast cancer [40]. BCL2 belongs to the anti-apoptotic and anti-proliferative. A lot of studies have proved that positive BCL2 shows better prognosis in breast cancer [41][42][43]. Therefore, we can infer that down-regulated expression of BCL2 predicts unfavorable prognosis.  PIK3R1 encodes the p85α regulatory subunit which regulates and stabilizes p110α. PIK3R1 has a lower frequency mutation than PIK3CA in breast cancer [2]. Previous studies have confirmed that Low expression of PIK3R1 is associated with poor prognosis [44,45].The prognosis value of some genes has been practiced in previous studies which is in accordance with our findings but more validation studies will still be needed. And the rest also needs to be explored in the future.
To the best of our knowledge, this is the first study to screen the differentially expressed-aberrantly methylated hub genes of significant prognosis value in breast cancer by simple bioinformatics. In this study, 4 microarray datasets were used to screen these genes to make the results much more convincing. Moreover, databases from TCGA dataset were used to validate the expression and DNA methylation of these 16 hub genes. In addition, the prognosis values of the 16 genes in breast cancer were analyzed through Kaplan Meier-plotter database. Except EFGR, other 15 differently-expressed genes are all significantly associated with prognosis. Also, the KEGG pathway analysis of these hub genes can provide guidance for further researches in breast cancer. What's more, we analyzed the prognostic value of combined genes and found that it was no better than EXO1 in both datasets. Not enough patients in either dataset or overexpression of EXO1 which was really a strong prognostic factor may cause this result. Protein levels of EXO1 were also analyzed and we found that high expression of EXO1 was associated with significant poor OS which suggested that EXO1 was a competitive prognostic factor for clinical application.
However, there were still some limitations in this study should be acknowledged. Though, expression of these genes were all remained to be significantly upregulated or downregulated after validation in UALCAN database, but only 7 of 12 hypomethylated and 2 of 4 hypermethylated genes was validated in another methylation database. Expression analysis may be more stable across different database. However, methylation status of some genes was varied among different platforms and databases. In our study, two methylation microarray datasets were performed on the same platform (Illumina HumanMethylation27 Bead Chip) in order to eliminate platform diversity and using another methylation database to make the result more convincing.

Conclusion
In general, our study identified 9 aberrantly expressed-methylated hub genes in breast cancer significantly contributing to poor prognosis by bioinformatic analysis. Functional analysis and pathway enrichment analysis of those genes would provide novel ideas for further breast cancer research. 12 upregulated hub genes (TOP2A, MAD2L1, FEN1, EPRS, EXO1, MCM4, PTTG1, RRM2, PSMD14, CDKN3, H2AFZ, CCNE2) and 4 downregulated hub genes (EGFR, FGF2, BCL2, PIK3R1) might serve as diagnosis and poor prognosis biomarkers in breast cancer in the future by more research validation. Especially, EXO1 was identified as an individual unfavorable prognostic factor in this research.