A Stemness and EMT Based Gene Expression Signature Identifies Phenotypic Plasticity and is A Predictive but Not Prognostic Biomarker for Breast Cancer

Aims: Molecular heterogeneity of breast cancer results in variation in morphology, metastatic potential and response to therapy. We previously showed that breast cancer cell line sub-groups obtained by a clustering approach using highly variable genes overlapped almost completely with sub-groups generated by a drug cytotoxicity-profile based approach. Two distinct cell populations thus identified were CSC(cancer stem cell)-like and non-CSC-like. In this study we asked whether an mRNA based gene signature identifying these two cell types would explain variation in stemness, EMT, drug sensitivity, and prognosis in silico and in vitro. Main methods: In silico analyses were performed using publicly available cell line and patient tumor datasets. In vitro analyses of phenotypic plasticity and drug responsiveness were obtained using human breast cancer cell lines. Key findings: We find a novel gene list (CNCL) that can generate both categorical and continuous variables corresponding to the stemness/EMT (epithelial to mesenchymal transition) state of tumors. We are presenting a novel robust gene signature that unites previous observations related either to EMT or stemness in breast cancer. We show in silico, that this signature perfectly predicts behavior of tumor cells tested in vitro, and can reflect tumor plasticity. We thus demonstrate for the first time, that breast cancer subtypes are sensitive to either Lapatinib or Midostaurin. The same gene list is not capable of predicting prognosis in most cohorts, except for one that includes patients receiving neo-adjuvant taxene therapy. Significance: CNCL is a robust gene list that can identify both stemness and the EMT state of cell lines and tumors. It can be used to trace tumor cells during the course of phenotypic changes they undergo, that result in altered responses to therapeutic agents. The fact that such a list cannot be used to identify prognosis in most patient cohorts suggests that presence of factors other than stemness and EMT affect mortality.

Authors developed and analyzed a data resource which contained gene expression, copy number and drug cytotoxicity data for 947 cell lines. They showed that cell lines do represent subtypes of various cancer and drug response data generated here could help in development of personalized therapeutic regimens.
We used 56 breast cancer cell lines' gene expression data to find differentially expressed genes between CS/M and NS/E groups. Drug cytotoxicity data was used to identify drugs which could target discovered groups separately.
E-MTAB-783 (CGP) Cell lines were screened with 130 different drugs. We used 39 breast cancer cell lines' gene expression data to find differentially expressed genes between CS/M and NS/E groups. Drug cytotoxicity data was used to identify drugs which could target discovered groups separately.

GSE24717
Authors developed a stemness signature to differentiate between cancer stem cell enriched samples. This signature has prognostic importance and authors recommend to treat stem cell enriched samples with topoisomerase inhibitors and resveratrol.
To identify this signature, authors did not used breast cancer specific cancer stem cell markers (CD44/CD24) but used CD133. But they later showed if they classify stem enriched cell lines from the rest then both markers show same pattern. But on the other hand our signature can identify not only CS/M group (CD44+/CD24-) from NS/E group (CD44-/CD24+) but can also differentiate between resistant and sensitive cell lines to several commercial drugs and also our classification overlaps with epithelial and mesenchymal classification as evidenced in literature as well. We used this dataset to show that using our differentially expressed signature can classify breast cancer cell lines in the same groups which were formed in our discovery dataset, CCLE and CGP.

GSE50811
Authors performed gene expression profiling of breast cancer cell lines and used this data to identify genes which can be related with paclitaxel and eribulin sensitivity. They showed that EMT genes were related eribulin sensitivity.
In this paper authors treated cell lines with paclitaxel and eribulin only for 24 hours before checking their gene expression. This time is not enough for such experiment. So we used only untreated cell line data to further validate our gene signature in clustering breast cancer cell lines. Cell lines were clustered into same groups which were formed in our discovery dataset, CCLE and CGP.

GSE73526
Authors performed shRNA dropout screens on 77 breast cancer cell lines to identify vulnerabilities in breast cancer and associated this data with genomic and proteomic data of those cell lines. Additionally comparing those vulnerabilities with drug data showed potential resistance mechanisms, anticancer effects and need for combination therapies.
We used this dataset to show that using our gene signature can classify breast cancer cell lines in the same groups which were formed via our discovery datasets, CCLE and CGP.

GSE15192
Authors showed that a subpopulation of MCF-10A cells acquire CD44+/CD24-phenotype, and that a few EMT related genes play a role in this switch. They found 2035 genes as differentially expressed, and validated some.
We developed a gene signature that can differentiate between CD44+/CD24-and CD44-/CD24+ phenotypes. And we used gene expression data uploaded by the authors to validate this signature and successfully clustered samples as expected.

GSE36643
Authors investigated a new CSC marker GD2, in HMLER cells and proposed to use this as a single marker of CSC as opposed to CD44 and CD24 markers for breast cancer.
We utilized CD44 and CD24 based distinctions to validate our gene list successfully.

GSE52327
Authors sorted patient derived breast cancer cells based on ALDH, another marker for stemness. We showed that CD44 and CD24 gene expression does not correlate with ALDH gene expression. We used this dataset to validate this observation.

GSE9691
Authors investigated the role of E-cadherin loss in promoting metastasis and concluded that its loss in breast cancer HMLE cells not only increases their metastatic potential but also increases their invasiveness, motility and resistance to apoptosis.
We could identify E-cad downregulated samples as CS/M from control and beta catenin downregulated samples as NS/E.

GSE24202
Authors used this dataset to associate EMT with breast cancer stem cells. They generated mesenchymal cells HMLE cells by overexpressing TGF beta, Twist, Gsc and by downregulating E-cad. They identified a gene signature of 159 transcription factors responsible for clustering mesenchymal/stem cells from Epithelial/non stem cells.
We used this dataset to successfully distinguish epithelial and mesencymal cell groups generated by the authors with the exception of siEcad cells, which we explain in figure 3.

GSE7515
Mammosphere culture is associated with enriching cells for cancer stem cells. Authors generated this dataset from human breast tumor cells cultured in adherent conditions and mammosphere culture. Their aim was to identify genes which could distinguish adherent cells from mamamospheres.
We used this datasets to further validate CNCL and most of mamamospheres were clustered as CS/M and primary breast cancer cell lines as NS/E.

GSE24460
Authors generated doxorubicin resistant MCF7 cells which were highly invasive, tumorigenic and formed mamamospheres when compared to control cells. 30% of these MCF7 doxorubicin resistant cells showed CD44+/CD24-phenotype. Genes responsible for drug resistance and stem cell characteristics were high in resistant cells when compared to sensitive cells.
We used this dataset to find if CNCL can identify resistant MCF7 cells from controls. Upon hierarchical clustering, as expected resistant cells were clustered as CS/M separately from control cells as NS/E.

GSE10281
Stem cells are responsible for drug resistance. Authors took biopsies from patients before treatment and after treatment with letrozol for 3 months. They looked at the mesenchymal and epithelial markers and these were differentially expressed in samples before and after letrozol therapy.
We used this dataset to show that CNCL can identify patients before and after undergoing treatment. Half of NS/E samples switched to a CS/M phenotype and only one patient switched in the opposite direction while others maintained their phenotype.

GSE12791
In this dataset, authors developed Paclitaxel resistance in breast cancer cell line MDAMB231 by prolonged drug treatment and studied the effect of bexarotene in switching resistant phenotype back to sensitive.
We used this dataset to successfully cluster Paclitaxel resistant phenotype (CS/M) from sensitive phenotype (NS/E).

GSE23399
Breast cancer associated fibroblasts (CAF) were isolated from patients tumor specimens and were treated with Paclitaxel over a prolonged time. These chemotherapy resistant CAFs are responsible for tumor growth and aggression.
We used this datasets to successfully demonstrate that drug resistant phenotype behaves as CS/M and control cells behave like NS/E cells.

GSE16179
Authors treated breast cancer cell line BT474 with lapatinib over a prolonged period of time and demonstrated that AXL plays a novel role in acquiring resistance to Lapatinib.
We used Lapatinib sensitive and resistant cell models to successfully demonstrate that the resistant phenotype is of a CS/M, while the sensitive phenotype is classified as NS/E.

GSE28844
in this study authors aimed to identify such pathways which confer resistance to tumors post chemotherapy. We used this dataset to show that tumors treated with Taxane have a higher CS/M score when compared to pre treated samples Survival analysis related datasets GSE1456 Authors developed a 64 gene signature which can estimate breast cancer patients response to adjuvant therapy.
Our survival analysis using CNCL revealed that patients with NS/E phenotype showed worse prognosis significantly when compared with CS/M phenotype, using disease specific survival, Overall survival and relapse free survival data. GSE2034 Authors developed a 76 gene signature which can identify patients at high risk of distant recurrence from patients with favorable prognosis.
CNCL showed no difference in recurrence between CS/M and NS/E patients.

GSE2603
Authors identified genes which are responsible for breast cancer metastasis to bone and lung tissue.
CNCL showed that patients with CS/M phenotype had worse prognosis when compared with NS/E patients for lymph node metastasis free survival.

GSE3494
Authors identified a 32 gene signature which can differentiate between p53 wild type and mutant samples, and predicts survival independent of all other prognostic factors.
CNCL showed no significant difference when patients with CS/M phenotype were compared with NS/E patients.

GSE4922
Authors identified a genetic grade signature which can separate low and high grade disease and can improve therpaeutic decision making for breast cancer patients.
CNCL showed patients with CS/M phenotype showed better prognosis when compared with NS/E patients with border line significance.

GSE6532
Authors developed a gene grade index which defined histologic grade and found 2 distinct ER+ subgroups with survival difference.
CNCL showed patients with CS/M phenotype showed significantly better prognosis when compared with NS/E patients.

GSE7390
Authors validated a 76 gene signature for distant metastasis free survival, overall survival, relapse free survival, time to distant metastasis survival.
CNCL showed no survival difference between CS/M and NS/E patients.

GSE11121
Authors generated and associated several metagenes with distant metastasis free survival (proliferation metagene and B cell metagene) CNCL showed patients with CS/M phenotype showed better prognosis when compared with NS/E patients which was statistically insignificant.

GSE12276
Authors identified genes which are responsible for breast cancer metastasis to brain (COX2, HBEGF and ST6GALNAC5).
CNCL showed no survival difference between CS/M and NS/E patients.

GSE19615
Authors identified 2 genes (LAPTM4B and YWHAZ) as responsible for generation of chemoresistance to anthracyclines.
CNCL showed no survival difference between CS/M and NS/E patients.

GSE20685
Authors identified molecular subtypes of breast cancer and proposed these subtypes to better customization of breast cancer treatment.
CNCL showed no survival difference between CS/M and NS/E patients.

GSE21653
Authors suggested ECRG4 as tumor suppressor gene which can be used to better breast cancer prognostication.
CNCL showed no survival difference between CS/M and NS/E patients.

GSE58812
Authors identified 3 subtypes of triple negative breast cancer and proposed that immune mediation in these tumors can be channeled to treat specific subtypes.
CNCL showed patients with CS/M phenotype showed better prognosis when compared with NS/E patients significantly for metastasis free survival and insignificantly for overall survival. GSE25066 Authors developed a genomic predictor for patients treated with taxane and anthracycline chemotherapy.
CNCL showed patients with CS/M phenotype showed worse prognosis when compared with NS/E patients with statistical significance.

Metabric British Cohort
Authors performed unsupervised analysis of paired DNA RAN profiles and found novel groups with distinct clinical outcomes and then validated these in another cohort.
CNCL showed patients with CS/M phenotype showed worse prognosis when compared with NS/E patients with statistical significance.
Metabric Canadian Cohort CNCL showed no survival difference between CS/M and NS/E patients.