Novel biomarkers and prediction model for the pathological complete response to neoadjuvant treatment of triple-negative breast cancer

Objective: To develop and validate a prediction model for the pathological complete response (pCR) to neoadjuvant chemotherapy (NCT) of triple-negative breast cancer (TNBC). Methods: We systematically searched Gene Expression Omnibus, ArrayExpress, and PubMed for the gene expression profiles of operable TNBC accessible to NCT. Molecular heterogeneity was detected with hierarchical clustering method, and the biological profiles of differentially expressed genes were investigated by Gene Ontology, Kyoto Encyclopedia of Genes and Genomes analyses, and Gene Set Enrichment Analysis (GSEA). Next, machine-learning algorithms including random-forest analysis and least absolute shrinkage and selection operator (LASSO) analysis were synchronously performed and, then, the intersected proportion of significant genes was undergone binary logistic regression to fulfill variables selection. The predictive response score (pRS) system was built as the product of the gene expression and coefficient obtained from the logistic analysis. Last, the cohorts were randomly divided in a 7:3 ratio into training cohort and validation cohort for the introduction of a robust model, and a nomogram was constructed with the independent predictors for pCR rate. Results: A total of 217 individuals from four cohort datasets (GSE32646, GSE25065, GSE25055, GSE21974) with complete clinicopathological information were included. Based on the microarray data, a six-gene panel (ATP4B, FBXO22, FCN2, RRP8, SMERK2, TET3) was identified. A robust nomogram, adopting pRS and clinical tumor size stage, was established and the performance was successively validated by calibration curves and receiver operating characteristic curves with the area under curve 0.704 and 0.756, respectively. Results of GSEA revealed that the biological processes including apoptosis, hypoxia, mTORC1 signaling and myogenesis, and oncogenic features of EGFR and RAF were in proactivity to attribute to an inferior response. Conclusions: This study provided a robust prediction model for pCR rate and revealed potential mechanisms of distinct response to NCT in TNBC, which were promising and warranted to further validate in the perspective.


Background
Triple-negative breast cancer (TNBC) is a special subtype of breast cancer, of which accounts for around 15% proportion, and marked by the absent expression of estrogen receptor (ER), progesterone receptor (PR), and human epidermal growth factor receptor 2 (HER2) [1]. The biological pro les of TNBC tend to be aggressive, which is characterized by the early relapse and distant metastases in addition to the inferior prognosis with 5-year survival rate less than 30% [2]. Systemic treatment has been considered as the mainstay, which cytotoxic agents are widely applied in the overall course of therapeutics for TNBC.
Considering the acknowledged aggressiveness, the neoadjuvant chemotherapy (NCT) has been recommended for TNBC within a broader range, in comparisons with the other subtypes, and pathological complete response (pCR) rate was used to assess the e cacy in associations with a better prognosis of TNBC undergone NCT [3].
Although systemic treatment is the essential composition of the therapeutic introduction for triplenegative breast cancer, the pCR rate is just 35-40% of the whole group of patients based on the application of standard-of-care protocols. Besides, several efforts have been put in the eld of pCR prediction including both clinical and experimental management, however, except for a few predictors such as BRCA1 de ciency for TNBC treated with platinum-containing NCT, no e cient biomarkers have been yet generally recommended [4][5][6][7][8][9]. This dilemma could be the result of molecular heterogeneities of TNBC. Previous studies have managed to explore the heterogenous subtypes of TNBC and propose different classi cations with distinct biological pro les, which some speci c subtypes tend to chemosensitive to neoadjuvant therapy [10,11]. Additionally, current ndings remain still in the translational or even experimental phases, indicating the ful llment of general application in clinical practice is challenging.
Recently, the dissection of multi-omics data through bioinformatic tools have been used to provide precise insight on the cancer biology and novel parameters for cancer therapy [12,13]. Howbeit, it is crucial to systematically integrate the molecular data and clinicopathological characteristics for the broad application. Herein, we carried out this study to construct and validate a prediction model for the pCR rate of neoadjuvant chemotherapy for TNBC patients, with the aim of the accumulation of solid evidence for clinical practice and the potential survival bene t.

Datasets selection and preprocessing
The gene expression pro les of operable TNBC accessible to NCT were systematically searched on Gene Expression Omnibus (GEO), ArrayExpress, and PubMed. Cohorts datasets were included if adopted operable TNBC patients, complete clinicopathological information, and de nite clinical e cacy of neoadjuvant treatment. Raw gene expression matrix and supplementary materials were downloaded and data from TNBC patients were identi ed through subtype selection. The multiple gene panel corresponding to a single probe were divided into individuals, and only the maximum expression values of genes was preserved for the following analysis. Each cohort dataset was independently processed and further collected with the removal of batch effect using R package limma (version 3.42.2) [14].

Exploration of subtypes and biologic pro les
To identify the TNBC subtypes with the potentially distinct response to therapeutics, the consensus clustering method using R package ConsensusClusterPlus (version 1.50.0) was performed based on hierarchical clustering algorithms. The differential expressed genes (DEGs) were retrieved using R package limma, which the absolute of fold change (FC) more than 1 and P value adjusted by Benjamini-Hochberg method less than 0.05 were considered as the criteria for signi cant DEGs. Biological pro les of both up-regulated and down-regulated genes were respectively elucidated through enrichment analyses with the basis of Gene Ontology (GO) and Kyoto Encyclopedia of Genes and Genomes (KEGG) using Cytoscape software (version 3.8.0). The STRING database (http://string.db.org/) was utilized to clarify the interactive correlations among DEGs with protein-protein interaction (PPI) networks established and further visualized by Cytoscape software.

Identi cation of predictive biomarkers
The retrieved DEGs were prepared for machine-learning algorithms to facilitate the dimensionally reduction and selection of features with the simultaneous performance of random-forest analysis and least absolute shrinkage and selection operator (LASSO) analysis using R package RandomForest (version 4.6-14) and glmnet (version 4.0-2), respectively [15]. Then, the signi cant genes identi ed from the couple methods were collected for the intersected proportions. Next, binary logistic regression analysis was carried out to discover the genes with the predictive values for pCR rate using R package survival (version 3.2-3). To quantitatively assess the promising values of biomarkers, the predictive response score (pRS) was de ned as: The exprsk and coefk was the gene expression and regression coe cient for DEGs of which odds ratio (OR) more than 1, while exprsi and coe was for the genes of which OR less than 1. The pRS of each patient was calculated and recorded for the following analysis.

Construction and validation of prediction model for pCR rate
The included datasets were randomly in a 7:3 ratio divided into training cohort and validation cohort. Subsequently, the predictive model was constructed in training cohort with the combination of pRS and the clinicopathological factors through logistic regression analysis, and the characteristics with independent predictive values for pCR rate were adopted for nomogram using R package rms (version 6.0-0). Receiver operating characteristic (ROC) curves with the calculated area under curve (AUC), using R package pROC (version 1.16.2) were utilized to validate the discriminative power of this model, while calibration plot was adopted for calibrating capability.
Quest for promising clinical response Eligible patients were classi ed into high pRS group and low pRS group with the borderline of median value, which was followed by differential analysis to investigate the potential mechanisms of distinct response. After the DEGs were identi ed, the Gene Set Enrichment Analysis (GSEA) were performed with 1000 permutations referred to gene sets of h: hallmarks (h.all. v7.1. symbols) and c6: oncogenic signatures (c6.all.v7.1. symbols) downloaded from Molecular Signatures Database (MSigDB, https://gsea-msigdb.org/) using R package ClusterPro ler (version 3.14.3).

Statistical analysis
In this study, statistical tests were two-sided and P value less than 0.05 was considered as signi cance. All the statistical analyses were accomplished by SPSS (version 26.0) and R software (version 3.6.4).

Results
A total of eighteen cohort datasets were initially retrieved, and 217 TNBC patients from four GEO datasets (GSE32646, GSE25065, GSE25055, GSE21974) were eligible with 193 patients nally adopted. The procedure of population selection and ow diagram was presented in Figure S1, and the detailed information were listed in Table S1 and Table S2. Three phenotypes and differential signatures of TNBC On the basis of gene expression pro les, three stable phenotypes of TNBC subject to neoadjuvant chemotherapy were identi ed (Fig. 1a), with the assumption that the response to therapeutics and clinical evolvements were intrinsically heterogenous. Differential analysis was performed to clarify the potential mechanisms attributed from DEGs, and there were 1487 DEGs identi ed including 326 up-regulated genes and 381 down-regulated genes with statistical signi cance ( Fig. 1b; Table S3). Among this proportion of DEGs, SMARCC2, VHL, KANSL3, SOX3, and NPAS3 were the utmost signi cant up-regulated genes, while SYN2, FGF4, ADAM21, KLF7, and PPP1R11 were the foremost down-regulated genes.
To illustrate the biologic pro les of the DEGs, the signi cant up-regulated genes and down-regulated genes were undergone enrichment analyses, respectively. Results from GO analysis demonstrated that the up-regulated DEGs were remarkably mapped to the GO terms including outer membrane, lymph vessel morphogenesis, regulation of mitochondrial organization and the G protein-coupled receptor signaling pathway, and the downregulated DEGs were substantially enriched in plasma membrane region, cell part morphogenesis and positive regulation of protein kinase B signaling pathway (Fig. 2a). Results of KEGG analysis suggested that the up-regulated DEGs were signi cantly functional at the biologic pathways of focal adhesion and glucagon signaling, and the downregulated DEGs mainly ensembled in the biologic pro les of regulation of action skeleton and GABAergic synapse (Fig. 2b). The interactive correlations among functional products of gene expression were measured in degree, and visualized in circular output as protein-protein interactive networks (Fig. 2c). It was evident that the core functional expression lay in the panels comprising CASP8, ITGA1, ITGAX, PAK1, and EXOSC10.
Predictive biomarkers and score system To retrieve the most signi cant variables with predictive values, we dimensionally reduced the volume and selected the foremost characteristics based on machine-learning algorithms, which random forest analysis and Lasso regression analysis were concurrently carried out. There were 131 genes and 94 genes were respectively chosen, and 22 genes in total intersected which were considered to carry the foremost predictive power for pCR rate (Fig. 3a-c). To further achieve the shrinkage of variables, the binary logistic regression analysis was performed toward each gene, of which the results revealed a total of 6 genes were statistically signi cant ( Fig. 4a; Table S4). Next, the pRS of each patient was calculated as the product of gene expression and the corresponding coe cient obtained from the logistic analysis.
ROC curve was used to depict the predictive power of this score system and the computational AUC was 0.696 (Fig. 4b).
Prediction model for pCR rate Populations were randomly divided in a 7:3 ratio to training cohort and validation cohort to establish and validate model. The integrity of characteristics consisted of clinicopathological factors, including age at diagnosis, clinical stage, histopathological grade, tumor size, nodal status, and pRS was adopted to construct the predictive model for pCR rate of neoadjuvant chemotherapy. Through binary logistic regression analysis, the signi cant variables were identi ed, which pRS (P < 0.0001) and tumor size (P = 0.024) were in statistical signi cance to this predictive value (Table S5). Then, nomogram was established and validated through ROC curve for the discrimination power and calibration curve for the calibrating capability, respectively. The AUC of ROC curves from training cohort was 0.704 and from validation cohort was 0.756, while the calibration curve presented a slope in around 45° angle.
Collectively, this predictive model for the pCR rate was well-performed.

Potential mechanisms of distinct response
To further investigate the potential mechanisms of distinct response to neoadjuvant therapy in TNBC, differential analysis was performed between high pRS group and low pRS group, and followed by enrichment analysis based on the rendered gene sets. A total of 98 genes were considered as statistically signi cant, which GIT2, DYNC1H1, FN1, CNPY2, PTPN11 were evidently up-regulated, and AFF4, FGD2, MYO16, GLS2, CCDC132 were of the foremost down-regulated proportions (Table S6).Results of GSEA revealed that the biological processes associated with apoptosis, hypoxia, mTORC1 signaling, and myogenesis were up enriched, while the oncogenic signatures including EGFR and RAF were up regulated and CTIP and RELA were downregulated.

Discussion
Overall, this study identi ed a prediction score system curated from a six-gene panel, and constructed a predictive model adopting clinicopathological characteristics for pCR rate of NCT in TNBC. Given the distinct response to systemic therapy, potential mechanisms were investigated and the promising signatures were identi ed.
The heterogeneity of molecular features in TNBC has been long investigated, and it is well recognized that this kind of divergence could result in distinct clinical pro les and survival outcomes [10,11,16]. With the consideration of molecular heterogeneities, we identi ed three distinct subtypes of TNBC with the receipt of NCT through consensus clustering method, and further investigated the genomic difference based on gene expression pro les. Then, enrichment analyses were carried out to comprehensively explore the cellular component, molecular functions, and biological process. Both the up-regulated genes and down-regulated genes were undergone analyses, respectively, of which the results revealed that the increased proactivity of mitochondrion organization, lymph vessel morphogenesis, and focal adhesion could attribute to this kind of molecular heterogeneity. The core functional group of genes was presented in PPI network, suggestive of the signi cant roles in the biological course.
To facilitate the introduction of a practical prediction model, we conducted analyses on the basis of random forest analysis and Lasso analysis to retrieve the intersected proportion for dimensionally reduction and features selection, which has been widely used for characteristics of cancer diagnosis and therapy [17][18][19]. After this group of genes identi ed, the respective prediction values for pCR was successively evaluated by logistic analysis, and the a six-gene panel comprising ATP4B, FBXO22, FCN2, RRP8, SMERK2, TET3 were nally recognized. With the application of mathematic formula, pRS system was quantitatively established with a decent value for pCR rate. To optimize this prediction method, the clinicopathological characteristics were integrated and assessed the predictive values to select the valuable multi-variables, which pRS and clinical stage of tumor size was ultimately determined.
With the selected gene panel and variables, the nomogram which was considered as an intuitive method was constructed to quantitatively assess the predicted pCR rate in TNBC [20]. Results from validation analyses was suggestive of the well performance of this model and the rationale of broad application in clinical practice. Several have managed to establish prediction models for pCR of NCT in TNBC which was estimated to perform well in clinical practice [21][22][23]. However, most of them basically focused on the characteristics from imaging or laboratory indexes. This nomogram took full consideration of the molecular heterogeneities detected from transcriptome information and clinicopathological characteristics to provide practical prediction of pCR rate for TNBC patients planned to NCT. Besides, we also discussed the potential mechanisms of bene t and non-bene t phonotypes in order to provide promising evidence for practice. Biological processes including apoptosis, hypoxia, mTORC1 signaling and myogenesis, and oncogenic features of EGFR and RAF were in proactivity to attribute to an inferior response. In fact, the potential triggers of distinct response to NCT remain undetermined, considering, these ndings to some extent enlighten the quest for improvement of e cacy, which were in accordance with the previous studies [24][25][26][27]. Current trials have exerted efforts and assess the e cacy of this kind of targeted therapies in TNBC [28,29], however, these results were controversial and remained to be updated through randomized controlled trials with a large sample cohort.
Indeed, there were some inevitable limitations of this study. Firstly, this was a retrospective analysis with the adoption of the identi ed datasets from publicly databases, and the heterogeneities among populations from different cohorts could not be removed. Secondly, a few characteristics, such as the Ki67 index and therapeutic protocols, could not be taken into considerations due to the lack of records in publicly available database, which could potentially weaken the prediction power of this model. Thirdly, the cutoff value of pRS was determined as the median which was practical yet less precise, which was necessary to be validated and optimized. Last, both experimental and clinical research are supposed to conducted and validate these ndings obtained this study.

Conclusion
In conclusions, our study established a robust prediction model based on the transcriptome signatures and clinicopathological features, taking molecular heterogeneities into consideration, for the pCR rate, and discussed the potential mechanisms of distinct response to NCT in TNBC, which were promising for the exploration of novel therapeutics. This prediction model is warranted to be further applied and validated among large-scale cohorts in the upcoming future.   Figure S1. Flow diagram of selection and identi cation of cohort datasets.

Figure 1
Identi cation and differential analysis of triple-negative breast cancer subtypes. (a) Three stable phenotypes of TNBC subject to neoadjuvant chemotherapy were identi ed through consensus clustering algorithms based on hierarchical clustering method. (b) A total of 1487 differentially expressed genes were recognized by differential analysis, which included 326 up-regulated genes and 381 down-regulated genes with statistical signi cance.    Exploration of the mechanisms for distinct pCR rate. (a) Differential analysis was performed between high pRS group and low pRS group. (b) Gene set Enrichment Analysis were conducted with the reference to gene lists of cancer hallmarks signatures and oncogenic signatures.

Supplementary Files
This is a list of supplementary les associated with this preprint. Click to download.