Identification of key genes and pathways associated with esophageal squamous cell carcinoma development based on weighted gene correlation network analysis

Background: As one of the most aggressive malignancies, esophageal squamous cell carcinoma(ESCC) remains one of the leading causes of cancer related death worldwide. The majority of ESCCs are diagnosed at advanced stages with poor five-year survival rate, making it urgent to identify specific genes and pathways associated with its initiation and prognosis. Materials and Methods: The differentially expressed genes in TCGA were analysed to construct a co-expression network by WGCNA. Gene ontology (GO) terms and Kyoto Encyclopedia of Genes and Genomes (KEGG) pathways analysis were performed for the selected genes. Module-clinical trait relationships were analyzed to explore the genes and pathways that associated with clinicopathological parameters of ESCC. Log-rank tests and COX regression were used to identify the prognosis-related genes. Results: The brown module containing 716 genes which most significantly contributed to ESCC. GO analysis suggested enrichment of adaptive immune response, cyclin-dependent protein serine, regeneration and mRNA metabolic process. KEGG analysis indicated pathways including Cellular senescence, Ribosome biogenesis, Proteasome, Base excision repair and p53 signaling pathway. Clinical stage was associated with cyan module; clinical M was associated with grey60 module; clinical T was associated with darkturquoise module; while clinical N, histological type and cancer location were associated with turquoise module. Key genes of TCP1, COQ3, PTMA and MAPRE1 might be potential prognostic markers for ESCC. Discussion: Differentially expressed genes and key modules contributing to initiation and progression in ESCC were identified by WGCNA. These findings provide novel insights into the mechanisms underlying the initiation, prognosis and treatment of ESCC.


Introduction
As one of the most aggressive malignancies, esophageal cancer remains one of the leading causes of cancer related death worldwide [1,2]. Among the two main histological types of esophageal cancer including esophageal squamous cell carcinoma (ESCC) and esophageal adenocarcinoma (EAC), ESCC represents the dominant subtype in Asians and EAC incidence is higher in Western countries [3]. These two subtypes have different pathogenesis, biology features and clinical outcomes [4]. At present, the majority of ESCCs are diagnosed at advanced stages, the five-year survival rate of which still remains unfavourable with frequently metastasis and recurrence [5,6]. Therefore, comprehensive researches Ivyspring International Publisher are urgently needed to fully clarify critical molecular mechanisms related to the initiation, progression and prognosis of ESCC.
It has been widely accepted that alcohol drinking and tobacco consumption are certain risk factors for ESCC with synergistic effects [7,8]. There is also evidence suggesting that possible association of human papillomavirus (HPV) with ESCC risk according to meta-analysis [9]. Studies have found that polymorphisms in aldehyde dehydrogenase 2 (ALDH2) gene also lead to the development of ESCC especially in Asian populations [10][11][12]. Multiple studies including high-throughput analysis revealed that genes and pathways involved in ESCC were related with cell cycle, differentiation, and Epidermal Growth Factor Receptor signalling [13]. In addition, epigenetic alterations including DNA methylation of such as APC, RB1 and CDKN2A, histone modification, and loss of genome imprinting also contribute to ESCC [14][15][16].
Although a number of genes and mechanisms have been proved to be closely implicated in the development of ESCC, the comprehensive picture of the whole genes and regulations of ESCC is still unclear. In recent years, bioinformatic methods become increasingly effective in exploration and analysis of multiple genes or proteins of complicated diseases. Weighted gene co-expression network analysis (WGCNA), a new gene co-expression network-based method, has been successfully used to screen biomarkers and pathways that could be applied in susceptibility genes, diagnose and treatment of cancer [17]. In this study, WGCNA was conducted to analyse data of TCGA data repository of ESCC to identify gene modules and biomarkers (hub genes) implicated in the pathogenesis, progression and prognosis of ESCC.

Acquiring and preparing genetic and clinical data
The RNA sequencing and clinical data of esophageal squamous cell carcinoma patients were downloaded from TCGA data repository (https://cancergenome.nih.gov/). The level of gene expression was measured as fragments per kilobase of transcript per million mapped reads (FPKM). Clinical data included the sample type, clinical TNM stage, histologic grade, cancer position and survival information. Samples with incomplete pathologic stage or histologic grade information were not included. As genes with little variation in expression usually represent noise, the most variant genes were filtered for network construction. Gene variabilities were measured by median absolute deviation (MAD).

Constructing gene co-expression network
Gene co-expression network was constructed by the WGCNA package in R and was visualized by Cytoscape software [18]. Power values were screened out by WGCNA algorithm in the construction of co-expression modules. Scale independence and average connectivity analysis of modules with different power value were performed by gradient test (power value ranging from 1 to 20). Appropriate power value was determined when the scale independence value was equal to 0.9. WGCNA algorithm was then used to construct the co-expression network and extract the gene information in the most relevant module. The criterion of co-expression weight> 2 was used to select the candidate network. Heatmap tool package in R language was adopted to describe the strength of the relationship among modules (strong or weak degree).

Relating modules to cancer risk and identifying the prognosis related genes
One of the advantages of WGCNA is that the correlation between modules and clinical parameters can be analyzed. The module eigengene (ME), which can be regarded as a representative of the gene expression profiles from a module, is defined as the first principal component of a given module. Given that the ME can summarize the gene expression profiles, we calculated the correlation between MEs and external sample type data. The gene in the most relevant module was chosen as the risk-related gene. Log-rank tests and COX regression were used to select the prognosis-related genes in the risk-related genes. R package survival was used to carry out log-rank tests and survminer package was used to plot Kaplan-Meier survival curves. In order to validate our findings, microarray dataset of GSE53625 from Gene Expression Omnibus (GEO, https:// www.ncbi.nlm.nih.gov/geo/) in NCBI (The National Center for Biotechnology Information) containing 179 esophageal squamous cell carcinoma patients with prognosis information were used to confirm our results.

Gene ontology and pathway Enrichment analysis
To explore the potential biological themes and pathways of genes in risk-related module, the clusterprofiler package in R was used to annotate and visualize gene ontology (GO) terms [19] and Kyoto Encyclopedia of Genes and Genomes (KEGG) pathways [20].

Screening for candidate module with clinical significance and KEGG analysis
We calculated the correlation between MEs and external clinical data. p <0.05 suggested significant correlation. We selected the most relevant module in each clinical data for analysis. KEGG pathway analysis was used to enrich the related genes.

Gene co-expression network of ESCC
Clinical and level-3 RNA sequencing data for 162 esophageal squamous cell carcinoma samples and 11 normal samples were obtained from TCGA database. For module detection, the 15121 most variant genes were selected according to MAD value for further analysis. When the value of soft thresholding power β was 4, the connectivity between genes met a scale-free network distribution (Fig. S1). Altogether 21 modules were identified by hierarchical clustering and the Dynamic branch Cutting. Each module was assigned a unique color as an identifier. Interaction relationship analysis of co-expression genes was shown in Figure  1. The number of genes in modules ranged from 34 to 2353. The grey module represented a gene set that was not assigned to any of the modules.

The risk related module in the WGCNA algorithm
We then use the cancer type as the Clinical phenotype to select the most relevant module with the occurrence of ESCC. Finally, the brown module was selected (Fig. S2). The brown module contained 716 genes. Then we constructed the co-expression network with the criterion of co-expression weight >2. As was shown in Figure 2, a total of 161 genes were selected and their multiple interactions were visualized. The size represents the connection numbers and the color represents the fold change compared by the cancer and normal samples.

Enrichment analysis of the brown module
GO and KEGG enrichment analysis was performed on the genes in the constructed cancer-risk-related module. Altogether 984 terms showed difference in GO enrichment ( Table 1). As was illustrated in Figure 3, this module in GO analysis was related with biological process of regulation of adaptive immune response, regulation of cyclin−dependent protein serine, regeneration and positive regulation of mRNA metabolic process. According to the KEGG analysis, 19 pathways were associated with brown module including Cellular senescence, Ribosome biogenesis in eukaryotes, Proteasome, Base excision repair and p53 signaling pathway.

Prognosis analysis of the brown module
Since the brown module were selected from the modules related to patients' cancer type. It was of great significance to evaluate the potentiality of them to serve as prognostic biomarkers. Among the 716 genes, 39 genes were associated with the prognosis of esophageal squamous cell carcinoma ( Table 2). As was summarized in Figure 4, the top four up-regulated genes which indicated a poor prognosis included TCP1 (HR=2 Altogether 179 esophageal squamous cell carcinoma patients with prognosis information of microarray GSE53625 were used to validate our results. As was shown in Table 4 (Table 4).

Module-clinical trait relationships
Identifying genes associated with a certain clinical trait is of great value to explore the molecular mechanisms behind the trait. In the present study, the clinical parameters of ESCC patients, including clinical T, clinical N, clinical M, clinical stage, recrudescence stage, histological type and cancer location were involved in the module-trait relationship (MTR) analysis. As was suggested in Figure 5, clinical stage was associated with cyan module; clinical M was associated with grey60 module; clinical T was associated with darkturquoise module; while clinical N, histological type and cancer location were associated with turquoise module. No module correlated with recrudescence.

KEGG analysis of clinical related module
We then performed KEGG pathway analysis of the candidate module. The results indicated that genes of cyan module were enriched in Cell cycle; grey60 module was associated with Ribosome; darkturquoise module was related with RNA transport; turquoise module was associated with Fat digestion and absorption, Tight junction, Maturity onset diabetes of the young, Protein digestion and absorption and Fructose and mannose metabolism (Table 3).  Fructose and mannose metabolism 11/607 3.566E-05 11 Figure 5. The module-clinical trait relationships of genes involved in ESCC.

Discussion
Although significant progresses have been made in studies concerning the risk and development of ESCC, our understanding of the complex mechanisms of ESCC is still limited. In this study, we conducted a differential expression analysis followed by WGCNA to detect genes and pathways associated with the occurrence, clinical parameters and prognosis of ESCC. Finally, key modules were identified which contribute to the risk (brown module) and clinicopathological parameters (cyan module, grey60 module, darkturquoisemodule and turquoise module) of ESCC. In addition, enrichment analysis of the genes in core modules suggested significant involvement of pathways such as Cellular senescence, Ribosome biogenesis in eukaryotes, Proteasome, Base excision repair and p53 signalling pathway.
In this present study, RNA sequencing data for 162 esophageal squamous cell carcinoma samples and 11 normal samples from TCGA were systematically analyzed. According to the WGCNA analysis of most variant genes, altogether 21 modules were identified and each module was assigned a unique color as an identifier. The number of genes in modules ranged from 34 to 2353. The brown module containing 716 genes was selected when we used the cancer type as the Clinical phenotype, indicating its close involvement in the development of ESCC. GO enrichment analysis were then performed on the genes in the constructed cancer-risk-related module, suggesting significance of biological process including adaptive immune, cyclin−dependent protein serine, regeneration and mRNA metabolic process. It has been reported that immune response and immune escape might play a critical role in ESCC progression and therapy [21,22]. In addition, key mRNA metabolic procedure such as alternative splicing has also been detected in ESCC. Alternative splicing isoforms of LOXL2, VIL2, OSMRβ and MUC1 have been reported to contribute to ESCC development and progression [23][24][25][26]. From this point of view, the specific role of immune response and altered mRNA metabolic process in ESCC require future investigations to elucidate. As for key pathways of KEGG enrichment, aberrant base excision repair capacity and altered p53 signaling pathways have been found to be associated with ESCC development by a number of studies [27,28]. Pathways of cellular senescence, ribosome biogenesis in eukaryotes and proteasome might be novel research directions for physicians and surgeons for ESCC.
The five-year survival rate of ESCC currently remains unsatisfactory despite recent improvements in treatments of surgical resection and adjuvant chemotherapy. It was therefore of great significance to evaluate the potentiality of key factors to serve as prognostic biomarkers. We eventually screened 39 prognosis-related genes in ESCC after analyzing the risk-associated brown module. Aberrant TCP1 gene status has been detected in hepatocellular carcinoma, breast cancer and colorectal cancer, but no study has been performed for its effect in ESCC [29][30][31]. MTHFD2 RNA and protein are remarkably increased in many types of cancers and correlated with worse survival in breast cancer according to a comprehensive analysis of RNA profiles of 1,454 metabolic enzymes across 1,981 tumors spanning 19 cancer types [32]. Although some of the prognosis-related genes we identified have been investigated in various types of cancers, limited information was known about their roles in ESCC until now. One limitation of this study is that most results were generated by analysis of publicly available data, further studies based on larger populations and molecular mechanism researches are therefore needed to confirm the results of our study.
In order to explore the genes and modules that associated with clinicopathological parameters of ESCC, we also performed module-clinical trait relationships analysis. We finally detected specific modules which determine the clinical stage, TNM classification, histological type and cancer location. Pathway analysis of the selected modules indicated that these genes enriched in Cell cycle, Ribosome, RNA transport, Fat digestion and absorption, Tight junction, Maturity onset diabetes of the young, Protein digestion and absorption and Fructose and mannose metabolism. Remodeling of energetic metabolism is a hallmark of malignant tumor, by which tumor cells changes themselves and the environment into more suitable conditions for growth and invasion [33]. According to our findings, aberrant metabolism of carbohydrate, fat and protein might be closely related to the progression of ESCC. In addition, alternations in functions of ribosome, RNA transport might also be promising research directions for ESCC.
In summary, differentially expressed genes and key modules contributing to initiation and progression in ESCC were identified by means of WGCNA. Pathways including cellular senescence, ribosome biogenesis in eukaryotes, Proteasome, base excision repair, fat digestion and absorption and p53 signalling might be closely related with ESCC development. Key genes of TCP1, COQ3, PTMA and MAPRE1 might be potential prognostic markers for ESCC. These findings provide novel insights into the mechanisms underlying the etiology, prognosis and treatment of ESCC.