Survival Risk Prediction Models of Gliomas Based on IDH and 1p/19q

Gliomas have been classified into different molecular subtypes based on their molecular features. To explore the prognostic factors of different subtypes of gliomas, we performed a univariate survival analysis based on the RNA-seq data of 653 patients obtained from The Cancer Genome Atlas. We identified 12205 (20.18%), 6125 (10.13%) and 5206 (8.61%) genes associated with the overall survival (OS) of the IDH-wildtype, IDH-mutation 1p/19q intact and IDH-mutation 1p/19q codeletion gliomas, respectively. Pathway enrichment analysis revealed that OS related genes were mainly involved in alcoholism, systemic lupus erythematosus, hematopoietic cell lineage and diabetes. The OS related genes were further selected using Lasso regression, and three prognostic risk score models were constructed to effectively predict the OS of the patients with different subtypes of gliomas. In total, 76 signature genes were identified and were selected to construct the three models. Moreover, neither of the 76 genes overlapped between different models, which suggested the enormous difference among the three subtypes, although some signature genes (SERPINA5, RP11.229A12.2 and RP11.62F24.2) were also identified as the OS related genes in different glioma subtypes. Interestingly, five genes (RP11.229A12.2, RP11.62F24.2, C3orf67, RP11.275H4.1 and TBX3) played opposing roles (protective or risk factor) in different subtypes. Additionally, the prognosis models consisted of a substantial proportion of non-coding RNA (58.74%, 70.13% and 58.11% in the IDH-wildtype, IDH-mutation 1p/19q intact and IDH-mutation 1p/19q codeletion). Furthermore, multivariate analysis integrating clinical variables demonstrated that risk group predicted by the prognostic models was an independent prognostic factor for gliomas. In conclusion, we have constructed and validated three models that have the potential to predict the prognosis of glioma patients. The genes and pathways identified in this study require further investigation for their underlying mechanisms and potential clinical significance in improving the OS of the glioma patients.


Introduction
Glioma is the most common malignant Central Nervous System (CNS) tumor, with a 5-year overall survival (OS) no greater than 35% [1]. Despite multiple therapy options, it is still challenging for researchers and doctors to improve the OS [2][3][4][5]. The reasons are complex but may lie in RNAs because gene expression profiling has been verified as a promising tool to classify tumors and predict the prognosis of cancer [6,7]. To find out the OS related genes, gliomas were classified into several molecular Ivyspring International Publisher subtypes because not only their prognosis differs stupendously between each other, but also the molecular alteration is an origin that causes glial or precursor cells evolving into different histological types [8,9].
The World Health Organization has reclassified glioma according to the IDH and 1p/19q status in 2016 [10]. In this study, we selected the genes associated with OS to establish three prognostic models by lasso regression to predict the survival risk of patients from The Cancer Genome Atlas (TCGA) with IDH-wildtype glioma, IDH-mutation 1p/19q intact gliomas and IDH-mutation 1p/19q codeletion gliomas, respectively. All of the three models are able to accurately predict the survival rate of 1 year, 3 years and 5 years. Therefore, the genes in these models are the potential factors that can influence the OS greatly. Also, these genes will be able to give insight into glioma and provide directions for subsequent research.

Data collection
All data of the 702 patients including RNA-seq data, clinical information and molecular features were obtained from TCGA (https://cancergenome.nih. gov/). The data collection was conducted in compliance with the publication guidelines and policies for the protection of human subjects provided by TCGA. Ensembl gene ID annotated RNA expression profiles were based on the V24 (hg19) of GENCODE (https://www.gencodegenes.org). Patients with the following conditions were eliminated in the study: (1) no IDH or 1p/19q molecular features (2) no survival information.

Identification of OS related genes
The analysis was carried out among 233 IDH-wildtype, 254 IDH-mutation 1p/19q intact and 166 IDH-mutation 1p/19q codeletion glioma samples using the "survival" and "survminer" packages in R/Bioconductor. All the total 60483 genes were analyzed one by one to perform a univariate survival analysis, and genes with P-value < 0.05 were selected for model establishment.

Lasso regression
Because genes affect each other can cause several colinear gene groups, we performed Lasso using the "lars" and "glmnet" packages in R/Bioconductor to reduce the collinearity influence and enhance the prediction accuracy and interpretability of the statistical mode. Cross-validation was used to select the regularization parameter. In order to quantify the risk of OS, a standard form of risk score (RS) for each patient was calculated combining the expression levels of the RNAs (Expi) and LASSO coefficients (Li), Risk score = ∑ ×

ROC and Kaplan-Meier curve
To investigate the performance of the prognostic classifier in predicting patient outcome, the area under the curve (AUC) of the receiver-operator characteristic (ROC) was calculated and compared. To validate the efficiency and repeatability of the models, we randomly divided the patients in each group into training set and validation set with a ratio of 6:4 by R.
To divide the patients into the high or low risk group in the training set, the best cutoff was determined with the "survival" and "survminer" packages in R/Bioconductor. Kaplan-Meier curves were plotted to estimate the survival status for patients in the high risk and the low risk group in both the training set and the testing set. In order to further assess the accurate role of the genes, heatmap was drawn using the R/Bioconductor package "ggplot2".

Cox regression
Whether the prognostic value of the multi-RNA-based classifier is independent of clinical features was assessed by multivariate Cox regression model.
All the statistical tests were done with R software version 3.5.1 and corresponding fundamental package. The figures were drawn with R software with or without subsequent modification by Adobe Illustrator CS6.

Enrichment analysis of KEGG pathways
Genes associated with OS were included in the Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway enrichment analysis using the WEB-based Gene SeT AnaLysis Toolkit (http://www.webgestalt. org). The hypergeometric test statistical method and the Benjamini-Hochberg multiple test adjustment method were used. All genes from human beings were used as reference. Top 10 pathways with at least 10 genes involved were considered significantly enriched (FDR < 0.05).

Results
The overall flowchart was summarized in Figure  1.

Clinicopathological features of patients in the TCGA
The clinicopathological features of 233 IDH-wildtype, 254 IDH-mutation 1p/19q intact and 166 IDH-mutation 1p/19q codeletion patients were listed in Table 1. The features of IDH-wildtype gliomas are quite different from that of the IDH-mutation gliomas, while the two types of IDH-mutation are similar to each other. Consistent with previous studies, IDH-wildtype patients were older than IDH-mutation patients at diagnosis (50-70 years old vs 20-50 years old), with a poorer prognosis and mainly consisted of glioblastoma and G4 instead of G2 and G3. IDH-mutation 1p/19q intact and IDH-mutation 1p/19q codeletion patients had no significant difference among age, grade, OS and gender, except the histologic types because IDH-mutation 1p/19q intact patients mainly harbored astrocytoma while IDH-mutation 1p/19q codeletion patients mainly harbored oligodendroglioma.

Identification of genes associated with OS
To begin with, we identified three sets of genes by performing univariate survival analysis using Cox Proportional Hazard Regression Model, with the threshold of P-value set as 0.05. The overlapping condition of the genes contained in the three sets is shown in Figure 2. Respectively, 12205 (20.18%), 6125 (10.13%) and 5206 (8.61%) genes were included in the IDH-wildtype, IDH-mutation 1p/19q intact and IDH-mutation 1p/19q codeletion set. The non-coding RNAs accounted for 58.74%, 70.13% and 58.11% of the OS related genes in the IDH-wildtype, IDH-mutation 1p/19q intact and IDH-mutation 1p/19q codeletion set respectively.  Figure 1. The overall workflow chart. 702 glioma samples with RNA-seq data were obtained from TCGA. After data screening, 49 samples were excluded. The remaining 653 samples were divided into three subtypes. Univariate survival analysis identified 12205, 6126, and 5206 candidate genes that were associated with the OS of IDH-wildtype, IDH-mutation 1p/19q intact and IDH-mutation 1p/19q codeletion gliomas, respectively. These genes were included in the KEGG pathway enrichment analysis. Three models were established using Lasso regression based on the OS-related genes. The models were validated in the testing sets.

Further genes screening and model establishment
We next screened genes associated with OS based on Lasso. Three sets of significant genes optimally predicting the survival risk score of patients with glioma were selected with cross-validation method ( Figure 3A, Figure 4A and Figure 5A) and three survival risk score systems based on the 76 genes were constructed (Table 2). Moreover, non-coding RNAs accounted for almost half in IDH-mutation types while less in IDH-wildtype.
Then we plotted time-dependent ROC curves for 1, 3 and 5 years. the AUCs were high enough to support the efficiency of the genes and models (IDH  Figure 4B and Figure 5B).  To investigate the relationship between RS and survival status of each group of the glioma patients, Kaplan-Meier analysis and log-rank test were conducted. The optimal cutoff was calculated with the methods described above and shown in Figure 3C, Figure 4C and Figure 5C. Obviously, patients with higher RS generally had a significantly worse OS than those with lower RS (all P-values of the three groups are less than 0.0001) ( Figure 3D, Figure 4D and Figure  5D). Furthermore, the ratios of high-risk patients and low-risk patients are 8.7, 0.3 and 0.1 in IDH wildtype, IDH mutation 1p/19q codeletion and IDH mutation 1p/19q intact patients respectively. The results corresponded to the recognition that prognosis of IDH mutation is much better than that of IDH wildtype.

Validation of the models in the TCGA
To test the models, we randomly divided all patients into training sets and testing sets at a ratio of 6:4. Within the training sets, RS cutoff ( Figure 6A, Figure 6D and Figure 6G), Kaplan-Meier analysis and log-rank test were made for each subtype. Obviously, patients with higher RS generally had a significantly worse OS than those with lower RS (p<0.0001; Figure  6B, Figure 6E and Figure 6H). Additionally, the RS cutoff was calculated in the training set, and Kaplan-Meier analysis was performed for both the training and testing sets to evaluate whether the models were powerful enough to separate patients into high-risk group and low-risk group. Consequently, all the subtypes' models were good enough to separate high-risk group and low-risk group apart from each other in the testing set ( Figure  6C, Figure 6F and Figure 6I).

Gene function inversion in different subtypes
Surprisingly, there are no overlapping genes between different models, which suggested the enormous difference among the three types. The overlapping condition among genes in model of one subtype and genes related to OS of the other two subtypes was shown in Figure 7. SERPINA5 in IDH-mutation 1p/19q codeletion type, RP11.229A12.2 and RP11.62F24.2 in IDH-mutation 1p/19q intact type were also identified as OS related genes of other two subtypes. Additionally, the heatmap also showed that a protective or risk factor of a glioma subtype can sometimes play a different or even an opposing role in other subtypes of gliomas. Genes with reverse function (changing from protective factor to risk factor or from risk factor to protection factor) included RP11.229A12.2, RP11.62F24.2, C3orf67, RP11.275H4.1 and TBX3. The Kaplan-Meier curves were plotted for the five genes in Figure 8.

Prognostic value of the models is independent of clinical features
To assess whether the models represent an independent indicator in glioma patients, the effect of each clinicopathologic feature on survival was analyzed by Cox regression. As shown in Table 3, in the univariate analysis, gender was not a powerful factor in all the three subtypes; histology was not powerful in IDH-mutation types but powerful in IDH-wildtype; age was not a powerful factor in 1p/19q intact type but powerful in the others. Both grade and risk score were powerful in all the three subtypes. Furthermore, hazard ratio (HR) of risk score was the highest (IDH wildtype 33.72 (14.13-80.45), P-value = 2.20E-15, IDH-mutation 1p/19q codeletion 18.07 (6.928-47.13), P-value = 3.28E-9 and IDH-mutation 1p/19q intact 10.59 (6.319-17.74), P-value < 2E-16).
After multivariable adjustment, the risk scores of these models remained powerful and independent. Except for age in IDH-wildtype and 1p/19q codeletion types which remained powerful, other factors, even grade was not identified as powerful or independent. The HRs of risk score calculated by the three models (IDH wildtype 125.7044 (32.1569-491.390), P-value = 3.67E-12, IDH-mutation 1p/19q codeletion 7.590 (2.4850-23.181), P-value = 0.00037 and IDH-mutation 1p/19q intact 10.6033 (6.1490-18.284), P-value < 2E-16) were also much higher than other factors, which illustrates the efficiency of the three models.

KEGG pathway enrichment analysis
To identify the key pathways associated with OS, we performed KEGG pathway enrichment analysis for the three sets of OS related genes (Table  4). Cell-energy related pathways, GTP and cAMP were included. This may illustrate that energy requirement of cancer cells was related to the OS. Surprisingly, the genes of IDH-mutation were enriched in alcoholism, systemic lupus erythematosus, hematopoietic cell lineage and diabetes. This might give a cue of the connection among these diseases and might represent the relationship between unhealthy lifestyle and the OS of glioma patients.

Discussions
In this study, we identified 76 genes to establish three risk-score models that can effectively predict the overall survival of patients with IDH-wildtype, IDH-mutation 1p/19q intact and IDH-mutation 1p/19q codeletion gliomas respectively. The genes identified by univariate survival analysis are enriched in alcoholism, systemic lupus erythematosus and hematopoietic cell lineage. Recent studies suggested that alcoholism may share a common pathway with glioma through proteasome and oxidative stress, and disulfiram (a drug used for the treatment of alcoholism for decades) has been proven effective in glioma treatment [11,12]. As for the systemic lupus erythematosus, more studies are needed to investigate its relationship with glioma [13]. IDH mutation can promote leukemogenesis, and hematopoietic cell lineage was suggested in gliomas in previous studies [14][15][16][17]. However, whether glioma patient with these diseases has a short OS and the underlying mechanisms require further studies. Additionally, the RAGE pathway in diabetes and diabetes itself and insulin usage are proved associated with glioma [18,19].
Further studies are required for the 76 genes in the models to explore their roles in affecting the OS of patients with glioma, as only a small part of them has been reported to be related to the OS of cancer. Additionally, the non-coding RNA accounted for almost half of the 76 genes but has seldom been studied in glioma, and they should be attached enough attention for their increasing importance in glioma. Some non-coding RNAs have been verified to play roles in glioma [20][21][22], and therapies targeting non-coding RNAs have become promising in glioma [23,24]. Molecular pathways regulating the expression of protective RNAs or risk RNAs might provide potential therapeutic targets to improve the prognosis of glioma patients.
Interestingly, several genes changed from protective factors to risk factors in different glioma types and vice versa. For example, TBX3 is associated with a number of cancers, including head and neck squamous cell carcinoma, gastric, breast, ovary, cervical, pancreatic, bladder, liver cancers and melanoma [25]. Its special stemness function and role as a microenvironment-dependent and epigenetically regulated lineage-commitment factor makes it a risk factor in cancers [25,26]. Besides, it has been related to the hormonal therapy resistance in breast cancers [27]. In this study, TBX3 is a protective factor in IDH-wildtype but a risk factor in IDH-mutation 1p/19q codeletion gliomas. These opposing roles may suggest a complex underlying mechanism for the pathway regulation in gliomas [28]. Although we have associated these genes with different molecular features of gliomas, such as IDH and 1p/19q, further studies are needed to explore this phenomenon.
One limitation of this study is that we did not validate the models in other databases because no studies with qualified data for a reliable validation were available. With the rapid progress of the next-sequencing applied in cancer, the robustness of the models would be further validated in the future.