Open access peer-reviewed chapter

Genome-Wide Gene Expression Analysis to Identify Epistatic Gene-Pairs Associated with Prognosis of Breast Cancer

Written By

I-Hsuan Lin and Ming-Ta Hsu

Submitted: June 7th, 2014 Reviewed: October 7th, 2014 Published: March 25th, 2015

DOI: 10.5772/59462

Chapter metrics overview

1,424 Chapter Downloads

View Full Metrics

1. Introduction

Breast cancer is the most common cancer in women in the world [1]. According to the most recent estimates from GLOBOCAN published by the International Agency for Research on Cancer (IARC) [2], there were nearly 1.7 million new breast cancer cases diagnosed in 2012 (25.2% of all cancers in women) and 6.3 millions have been diagnosed with breast cancer in 2007-2012. Breast cancer incidence has been increasing by more than 20% and mortality increased by 14% since 2008 and is the most common cause of cancer death in women in less developed regions (324,000 deaths, 14.3% of total). Breast cancer is less favorable in the under-developed countries due to less advanced medical diagnosis and treatments. Therefore a good diagnosis/prognosis would help to prevent as well as provide effective clinical treatments.

Biomarker testing is an essential step in the evaluation of breast cancer and help medical doctors and patients in deciding the best treatment strategy. There are several commercial products or services developed towards this purpose. The Oncotype DX (Genomic Health) measures the expression levels of 21 genes and is most helpful for patients of early stage breast cancer with estrogen receptor (ER) positivity and no cancer cells in the lymph node. The HERmark assay (Monogram Biosciences) can quantitatively measure the HER2 total proteins with greater sensitivity than immunohistochemistry (IHC) which is an important indicator of predicting response of HER2-positive breast cancer patients to trastuzumab therapy. There are also tests for BRCA1and BRCA2mutations for the hereditary breast cancer patients. The targeted sequencing-based breast cancer panels such as BreastNext (Ambry Genetics) and BROCA (University of Washington) can be used to screen for mutations and copy number variants in genes implicated in breast cancer, including BRCA1and BRCA2.

Despite the relative success of these tests, there is a need for more efficient biomarkers in specific groups of breast cancer, such as lobular carcinoma [3, 4], triple-negative breast cancer [5] and early-onset breast cancers [6, 7] for diagnostic and/or prognostic application. We believe the discovery of more useful markers using the wealth of gene expression data available publicly nowadays would help medical doctors in the decision of the best way to help breast cancer patients, especially if the markers are correlated with specific therapeutic interventions. In the past decade, the high throughput microarray technique has been widely used to identify potential biomarkers for various cancers [8-12]. Recent years, the employment of RNA sequencing (RNA-seq) allows researchers to obtain transcriptome information and differential gene expression profiling at a much higher resolution. With the huge amount of data generated by these technologies, we are able to study the association of genes with cancer survival and identify novel potentially prognostic biomarkers for cancers with improved estimation. Traditionally, genetic search identifies genes that correlate with poor or good prognosis of patients. However, it is important to consider the epistatic gene-gene interactions underlying gene expression in complex diseases such as cancer. The epistatic (second or higher order) information would allow more refined prognostic evaluation that may help clinical treatments. Furthermore, epistatic analysis could be useful for identifying hub genes involved in prognosis and help to identify the major genetic risk factors and pathways in breast cancer.

In this work, we utilized the breast tumor RNA-seq data from The Cancer Genome Atlas (TCGA) as well as microarray-based expression datasets from Gene Expression Omnibus (GEO) to detect differentially expressed genes in various subsets of breast cancer patients, to identify genes whose expression profile is associated with survival of breast cancer patients and to examine the influence of co-expression of a second gene in the survival of patients. This analysis identifies specific gene groups differentially expressed between early-onset vs. late-onset breast cancer, between ductal vs. lobular carcinoma, between early vs. advanced stage breast tumors and tumor of various receptor status. Furthermore, epistatic interactions among these genes demonstrate the gene-gene interactions in patient survival and identify several hub genes that may be important determinants of breast cancer.


2. Statistical analysis of gene expression data

A global change in gene expression is a common theme in many human cancers. High-throughput techniques such as microarrays and next generation sequencing allow investigators to observe and compare the transcriptional landscapes of tumor cells in different biological states [13-18]. In this work, we integrated multiple gene expression data from several large-scale breast cancer studies to improve the assessment of differential gene expression in breast tumor cells and to effectively increase statistical power.

We collected 3,188 breast cancer related Affymetrix expression microarray data from GEO ( from the following 16 series: GSE2603, GSE4922, GSE2990, GSE3494, GSE6532, GSE9195, GSE7390, GSE20194, GSE20271, GSE20685, GSE25066, GSE16391, GSE19615, GSE42568, GSE45255 and GSE50948. We also obtained 1,172 breast invasive carcinoma (BRCA) RNA-seq Level 3 data from TCGA Data Portal ( The demographic and clinicopathological characteristics of the breast cancer patients from each study were also retrieved.

2.1. Processing of gene expression data and differential gene expression analysis

The CEL files obtained from microarray experiments were pre-processed by subjecting to quality check using Bioconductor in the R environment to ensure comparability between the different series and microarray platforms. The following quality measurements from the simpleaffyand affypackages were performed: average background (avbg), scale factor (sfs), percent present (percent.present), and possible RNA Degradation (AffyRNAdeg) of the array. Additionally, the relative log expression (RLE) and normalized unscaled standard error (NUSE) was also estimated using the affyPLMpackage. 466 arrays that did not pass the quality control tests were removed. For the 2,722 arrays that had sufficient quality, the quantile normalization and background correction were performed using the justRMA (robust multi-array average) algorithm of the affypackage and the gene (probe set)-level log2-transformed expression values were summarized with Custom CDF file annotations (version 18.0.0. ENSG) [19]. Lastly, the COMBAT method available in the inSilicoMergingpackage was used to remove batch effect when combining the final expression data from the HG-U133A and HG-U133 Plus 2.0 arrays [20]. The RMA-normalized expression values from microarrays and the raw count data from RNA-seq datasets were then analyzed using the edgeRpackage [21]. The differentially expressed genes were selected with a threshold of FDR adjusted P-value < 0.05.

2.2. Chinicopathological characteristics of breast cancer patients

We include 2,722 breast cancer patients from various microarray-based studies (referred as GEO cohort) and 1,052 breast cancer patients from the TCGA project (referred as TCGA cohort) following differential gene expression analyses (Table 1). All patients were women in the GEO cohort with a median age of 53 years. The patients from the TCGA cohort were older with a median age of 58 years and approximately 96% of patients were women. There was a significant amount of clinicopathological data not available from the GEO cohort as noted in Table 1. In both cohorts, there were more stage I/II breast cancer cases than advanced stage cases, and invasive ductal carcinoma (IDC) being the major histological subtype diagnosed. The data also contained status of tumor receptors such as the estrogen receptor (ER), progesterone receptor (PR) and HER2 which are frequently used prognostic factors to aid therapeutic decisions. Many patients were positive for the ER and/or PR, and/or negative for the HER2 receptor.

CharacteristicNo. of Patients
Microarray (n = 2722), %RNA-seq (n = 1052), %
Missing data00.0%363.4%
Median Age (range)53 (24-93)58 (26-90)
Younger than 4030011.0%716.7%
40 to 55118443.5%36534.7%
Older than 55122445.0%58055.1%
Missing data140.5%363.4%
Early (Stage I and II)123645.4%75171.4%
Late (Stage III and IV)37013.6%24623.4%
Missing data111641.0%555.2%
Histologic Subtype
Missing data213778.5%373.5%
ER Status
ER positive171062.8%74971.2%
ER negative64723.8%22221.1%
Missing data36513.4%817.7%
PR Status
PR positive101737.4%65061.8%
PR negative68425.1%31830.2%
Missing data102137.5%848.0%
HER2 Status
HER2 positive2027.4%15014.3%
HER2 negative94634.8%52449.8%
Missing data157457.8%37835.9%
Female patients with a least one type of survival data229484.3%99936.7%

Table 1.

Patient characteristics of the GEO and TCGA cohorts.

2.3. Differentially expressed genes in patients from different age groups

Differential gene expression analysis was performed to identify over- and under-expressed genes specific to tumors derived from young, middle-aged and elderly breast cancer patients. As presented in Figure 1, there were very few middle-aged-specific expression signatures, indicating that the gene expression pattern of middle-aged patients was not significantly different from the young adults and/or elderly patients. In contrast, the elderly breast cancer patients possessed a high number of differentially expressed genes specific to this age group. IPA analysis of the differentially expressed genes from tumor cells obtained from older patients have decreased cell proliferation, movement, migration and cell cycle progression (activation z-score between -2.677 and -1.611) and increased cell death (activation z-score = 1.321). On the contrary, tumor cells from young patients were predicted to have increased proliferation of cells and DNA synthesis (activation z-score between 2.000 and 2.117) and decreased cell death and apoptosis (activation z-score between -0.586 and -0.299).

Figure 1.

Number of significantly over- and under-expressed genes in the three age groups presented with the jvenn Venn diagram viewer [22].

It is interesting to note that 14 genes that were over-expressed in young patients were under-expressed in elderly patients, and conversely, 15 genes under-expressed in young patients were over-expressed in elderly patients (Table 2). Several of these genes such as BIRC5(survivin), KPNA2, PLAC8(onzin), TFPI2, CITED2, NKX3-1, PIPand ZNF395have been found to play a role in cancer cell proliferation and cancer progression [23-28].

TypeSymbolEntrez Gene NameLocationType(s)
Young-Up Elderly-DnBIRC5baculoviral IAP repeat containing 5CytoplasmOther
GALgalanin/GMAP prepropeptideExtracellular SpaceOther
HN1hematological and neurological expressed 1NucleusOther
KPNA2karyopherin alpha 2NucleusTransporter
NOL11nucleolar protein 11NucleusOther
NUP85nucleoporin 85kDaCytoplasmOther
PLAC8placenta-specific 8NucleusOther
POLR3Gpolymerase (RNA) III (DNA directed) polypeptide GNucleusEnzyme
PSMA4proteasome alpha 4 subunit isoform 1CytoplasmPeptidase
RAPGEFL1Rap guanine nucleotide exchange factorOtherOther
TFPI2tissue factor pathway inhibitor 2Extracellular SpaceOther
UCHL1ubiquitin carboxyl-terminal esterase L1CytoplasmPeptidase
XDHxanthine dehydrogenaseCytoplasmEnzyme
Young-Dn Elderly-UpABCC6ATP-binding cassette, sub-family C, member 6Plasma MembraneTransporter
ACAA1acetyl-CoA acyltransferase 1CytoplasmEnzyme
CCDC28Acoiled-coil domain containing 28AOtherOther
CITED2Cbp/p300-interacting transactivatorNucleusTranscription regulator
CLMNcalmin (calponin-like, transmembrane)CytoplasmOther
CTDSPLsmall CTD phosphatase 3 isoform 1NucleusOther
CTSFcathepsin FCytoplasmPeptidase
FMO5flavin containing monooxygenase 5CytoplasmEnzyme
GPC4glypican 4Plasma MembraneTransmembrane receptor
KIF13Bkinesin family member 13BCytoplasmOther
NDST1N-deacetylase/N-sulfotransferase (heparan)CytoplasmEnzyme
NKX3-1NK3 homeobox 1NucleusTranscription regulator
PIPprolactin-induced proteinExtracellular SpacePeptidase
ZNF385Dzinc finger protein 385DNucleusOther
ZNF395zinc finger protein 395CytoplasmOther

Table 2.

Concordant differentially expressed genes identified in the young and elderly breast cancer patients.

2.4. Differentially expressed genes in patients with early stage versus advanced stage breast cancer

We compared the gene expression profile of patients diagnosed with early stage (stage I and II) breast cancer with those with advanced stage (stage III and IV) breast cancer. We identified 79 over-expressed and 140 under-expressed genes in early stage breast cancer. IPA analysis showed 121 of the total 219 differentially expressed genes were associated with cancer (P-value = 4.81E-02), and 24 were specifically associated with breast cancer (P-value = 3.25E-03, Table 3). Also, there were 17 under-expressed genes in early stage breast cancer (i.e. over-expressed in advanced stage tumors) that were found to be cancer recurrence-associated (ADORA3, FLT4, GSR, HSP90AA1, TEKand TXNRD1) and metastasis-associated (ACP5, FLT4, FTL, GSR, HSP90AA1, MAPK11, MMP9, NRAS, PGF, SCDand TEK). Interestingly, we detected over-expression of the DNA methyltransferase DNMT1in early stage tumors. In cancer cells, the over-expression of this gene can lead to hypermethylation of CpG islands and epigenetically silences multiple tumor suppressor genes and hence promotes tumorigenesis in early stage cancers [29-31].

SymbolEntrez Gene NameLocationType(s)DE Status
ACP5acid phosphatase 5, tartrate resistantCytoplasmphosphataseDown
APOEapolipoprotein EExtracellular SpacetransporterDown
ARRB1arrestin, beta 1CytoplasmotherDown
CDKN1Acyclin-dependent kinase inhibitor 1ANucleusotherDown
ETV1ets variant 1Nucleustranscription regulatorDown
FLT4fms-related tyrosine kinase 4Plasma Membranetransmembrane receptorDown
GPC3glypican 3Plasma MembraneotherUp
GPR126G protein-coupled receptor 126Plasma MembraneG-protein coupled receptorDown
HBBhemoglobin, betaCytoplasmtransporterDown
HIC1hypermethylated in cancer 1Nucleustranscription regulatorDown
HSP90AA1heat shock protein 90kDa alpha (cytosolic)CytoplasmenzymeDown
HSPB7cardiovascular heat shock proteinCytoplasmotherDown
MMP15matrix metalloproteinase 15 preproproteinExtracellular SpacepeptidaseDown
MMP28matrix metalloproteinase 28 isoform 1Extracellular SpacepeptidaseDown
MMP9matrix metalloproteinase 9 preproproteinExtracellular SpacepeptidaseDown
NOS3nitric oxide synthase 3 (endothelial cell)CytoplasmenzymeDown
PPM1Dprotein phosphatase 1DCytoplasmphosphataseDown
PSIP1PC4 and SFRS1 interacting protein 1NucleusotherUp
S100A2S100 calcium binding protein A2NucleusotherDown
SELLselectin LPlasma Membranetransmembrane receptorDown
TAL1T-cell acute lymphocytic leukemia 1Nucleustranscription regulatorDown
TEKTEK tyrosine kinase, endothelialPlasma MembranekinaseDown
TNCtenascin CExtracellular SpaceotherUp

Table 3.

The 25 differentially expressed genes associated with breast cancer in the early versus advanced stage analysis.

2.5. Differentially expressed genes in patients with Invasive Ductal Carcinoma (IDC) versus Invasive Lobular Carcinoma (ILC)

The invasive ductal carcinoma (IDC) and invasive lobular carcinoma (ILC) are the two major histological subtypes of breast cancer. They also represented the main types of breast cancer cases gathered in this study. We compared the gene expression profiles of patients with IDC and ILC and identified 216 over-expressed and 126 under-expressed genes in IDC as compared to patients with ILC. IPA analysis showed 66 genes were related to breast cancer (Table 4), including 12 transcription regulators (ATF3, BTG2, EGR1, EZH2, FOS, FOSB, JUN, JUNB, MTDH, STAT1, ZFP36and ZNF423) and a translation regulator (EIF4EBP1). There were also 12 genes annotated as tumor suppressor genes in the TSGene database [32], where CDH1, DKK1and S100A2were over-expressed in IDC and BTG2, CAV1, EGR1, GPC3, MUC1, NR4A1, SLIT2, TGFBR2and ZFP36were under-expressed in IDC. IPA predicted common upstream regulators KDM5B, STUB1, CDKN1A, HIF1Aand TGFB1to be inhibited whereas FOXM1, IFNB1, IFNGand PPARGwere in activated states. Additionally, the activities of several disease functions such as cell proliferation, invasion and DNA replication were predicted to be increased in IDC (activation z-score between 1.342 and 3.092).

SymbolEntrez Gene NameLocationType(s)DE Status
ACACBacetyl-CoA carboxylase betaCytoplasmenzymeDown
ALDH1A1aldehyde dehydrogenase 1 family, member A1CytoplasmenzymeDown
APOBEC3Bapolipoprotein B mRNA editing enzyme, catalyticCytoplasmenzymeUp
ATF3activating transcription factor 3Nucleustranscription regulatorDown
BIRC5baculoviral IAP repeat containing 5CytoplasmotherUp
BTG2BTG family, member 2Nucleustranscription regulatorDown
CAV1caveolin 1, caveolae protein, 22kDaPlasma Membranetransmembrane receptorDown
CCL21chemokine (C-C motif) ligand 21Extracellular SpacecytokineDown
CD34CD34 moleculePlasma MembraneotherDown
CD69CD69 moleculePlasma Membranetransmembrane receptorDown
CDC20cell division cycle 20NucleusotherUp
CDH1cadherin 1, type 1, E-cadherin (epithelial)Plasma MembraneotherUp
CDH3cadherin 3, type 1, P-cadherin (placental)Plasma MembraneotherUp
CDH5cadherin 5, type 2 (vascular endothelium)Plasma MembraneotherDown
CDK1cyclin-dependent kinase 1NucleuskinaseUp
CXCL14chemokine (C-X-C motif) ligand 14Extracellular SpacecytokineDown
CXCL2chemokine (C-X-C motif) ligand 2Extracellular SpacecytokineDown
CYR61cysteine-rich, angiogenic inducer, 61Extracellular SpaceotherDown
DKK1dickkopf WNT signaling pathway inhibitor 1Extracellular Spacegrowth factorUp
DSCC1DNA replication and sister chromatid cohesion 1NucleusotherUp
DUSP1dual specificity phosphatase 1NucleusphosphataseDown
EGR1early growth response 1Nucleustranscription regulatorDown
EIF4EBP1eukaryotic translation initiation factor 4ECytoplasmtranslation regulatorUp
EZH2enhancer of zeste homolog 2 (Drosophila)Nucleustranscription regulatorUp
FABP7fatty acid binding protein 7, brainCytoplasmtransporterUp
FOSFBJ murine osteosarcoma viral oncogene homologNucleustranscription regulatorDown
FOSBFBJ murine osteosarcoma viral oncogene homolog BNucleustranscription regulatorDown
GPC3glypican 3Plasma MembraneotherDown
GRB7growth factor receptor-bound protein 7Plasma MembraneotherUp
HSPB8heat shock 22kDa protein 8CytoplasmkinaseUp
IER2immediate early response 2CytoplasmotherDown
IGF1insulin-like growth factor 1 (somatomedin C)Extracellular Spacegrowth factorDown
IGFBP6insulin-like growth factor binding protein 6Extracellular SpaceotherDown
ITIH5inter-alpha trypsin inhibitor heavy chainOtherotherDown
JUNjun proto-oncogeneNucleustranscription regulatorDown
JUNBjun B proto-oncogeneNucleustranscription regulatorDown
KPNA2karyopherin alpha 2NucleustransporterUp
KRT6Bkeratin 6BCytoplasmotherUp
MMP1matrix metalloproteinase 1 preproproteinExtracellular SpacepeptidaseUp
MMP9matrix metalloproteinase 9 preproproteinExtracellular SpacepeptidaseUp
MRPL13mitochondrial ribosomal protein L13CytoplasmotherUp
MRPL15mitochondrial ribosomal protein L15CytoplasmotherUp
MTDHmetadherinCytoplasmtranscription regulatorUp
MUC1mucin 1, cell surface associatedPlasma MembraneotherDown
NR4A1nuclear receptor subfamily 4, group A, member 1Nucleusligand-dependent nuclear receptorDown
ORM1orosomucoid 1Extracellular SpaceotherUp
PCNAproliferating cell nuclear antigenNucleusenzymeUp
PDK4pyruvate dehydrogenase kinase, isozyme 4CytoplasmkinaseDown
PGK1phosphoglycerate kinase 1CytoplasmkinaseUp
RFC4replication factor C (activator 1) 4, 37kDaNucleusotherUp
RRM2ribonucleotide reductase M2NucleusenzymeUp
S100A2S100 calcium binding protein A2NucleusotherUp
SLIT2slit homolog 2 (Drosophila)Extracellular SpaceotherDown
SPP1secreted phosphoprotein 1Extracellular SpacecytokineUp
SQLEsqualene epoxidaseCytoplasmenzymeUp
STAT1signal transducer and activator of transcriptionNucleustranscription regulatorUp
TCP1t-complex 1CytoplasmotherUp
TGFBR2transforming growth factor, beta receptor IIPlasma MembranekinaseDown
TIMP4TIMP metallopeptidase inhibitor 4Extracellular SpaceotherDown
TOP2Atopoisomerase (DNA) II alpha 170kDaNucleusenzymeUp
TPD52tumor protein D52CytoplasmotherUp
TYMSthymidylate synthetaseNucleusenzymeUp
UBE2Cubiquitin-conjugating enzyme E2CCytoplasmenzymeUp
VEGFAvascular endothelial growth factor AExtracellular Spacegrowth factorUp
ZFP36ZFP36 ring finger proteinNucleustranscription regulatorDown
ZNF423zinc finger protein 423Nucleustranscription regulatorDown

Table 4.

The 66 differentially expressed genes associated with breast cancer in the IDC versus ILC analysis.

2.6. Differentially expressed genes in patients with different receptor status

In the last part of the differential gene expression analysis, we sought to examine the differentially expressed genes of breast cancer patients of different receptor status: (1) estrogen receptor positive (ER+) versus ER negative (ER–), (2) progesterone receptor positive (PR+) versus PR negative (PR–), (3) HER2 receptor positive (HER2+) versus HER2 negative (HER2–), and (4) triple-negative breast cancer (TNBC, also known as basal-like breast cancer) versus non-TNBC. The Venn diagram in Figure 2 summarized the intersections between the differentially expressed genes identified in the four assays. There were 57% and 65% of breast cancer patients that were both ER+ and PR+ in the GEO and TCGA cohorts respectively; hence the patient pools divided by the ER positivity for gene expression assays are similar to that divided by the PR positivity. Because of this fact, it is not surprising to observe genes that were found over- or under-expressed in the ER assay were also differentially expressed in the same direction in the PR assay. Likewise, genes that were over-expressed in TNBC were under-expressed in the ER and PR assays (n = 74) and ER, PR and HER2 assays (n = 35), and vice versa for the under-expressed genes in TNBC (n = 87 and 2 respectively).

Figure 2.

Number of differentially up- and down-regulated genes in the ER, PR, HER2 or TNBC receptor status assays.

The GALNT6(polypeptide N-acetylgalactosaminyltransferase) and SCGB2A2(secretoglobin) are the two genes consistently over-expressed in ER+, PR+, HER2+ breast tumors but under-expressed in TNBC. There were also 87 genes over-expressed in ER+ and PR+ breast tumors and under-expressed in TNBC, including seven transcription regulators (EGR3, FOXA1, GATA3, INSM1, NRIP1, TBX3and XBP1) and 11 breast cancer associated genes (ABAT, AGR2, CXCL14, GSTM3, HSPB8, MUC1, NR4A2, PGR, PIP, PLATand PSD3). On the other hand, there were more under-expressed genes (n = 35) shared among ER+, PR+ and HER2+ breast tumors that were over-expressed in TNBC. Among these are four transcription regulators (ELF5, EN1, FOXC1and ZIC1) and 12 extracellular proteins (CHI3L1, CHI3L2, COL2A1, COL9A3, CRLF1, KLK6, KLK7, MMP12, MMP7, PTX3, SERPINB5and SOSTDC1), and some of these genes are known TNBC-associated markers [33-37].


3. Identifying survival-related genes of patients with breast cancer

In cancer survival analysis, survival time is often defined as the period of time from the beginning of the medical process (treatment, surgery, etc.) until the death (or some other events such as development of a particular symptom or to relapse after the remission of disease) of the observed patient or until the end of observation. The goal of such analysis is to link the time to event (i.e. survival time) to certain explanatory variables. New methodologies were developed for calculating the survival probabilities using gene expression profiles when genome-wide expression data becomes increasingly available in the past two decades [38-42].

In this work, we analyzed associations between breast cancer patient survival and gene expression of breast tumors from published microarray and the RNA-seq datasets, denoted as the GEO and TCGA cohorts respectively. Survival analysis was performed separately for each cohort and the median times from diagnosis to death or last follow-up were 99.5 and 21.4 months in the GEO and TCGA cohorts respectively. We transform the expression values into gene expression status (i.e. 0 for low expression and 1 for high expression) using the modified auto_cutoff function of the R script available from the Kaplan Meier-plotter website ( The survival probability is calculated using the “survival” package and modified kmplotfunction ( is used to plot Kaplan-Meier curves. The hazard ratio with 95% confidence intervals and log-rank P-value are estimated using the Cox proportional hazards model. All analyses were conducted within the R statistical environment.

3.1. Univariate gene selection and survival analysis

We extract the gene expression profiles of 1,694 genes that were found differentially expressed (consistently in microarray and RNA-seq datasets) in any one type of the assays discussed in section 2.3 to 2.6. We calculated the overall survival (OS), relapse-free survival (RFS) and the distant metastasis-free survival (DMFS) of breast cancer patients with respect to the expression status. DMFS is not calculated for the TCGA cohort due to unavailability of the time to distant metastases information from patients in this cohort. The log-rank P-values were adjusted for multiple comparisons using the Benjamini-Hochberg false discovery rate (FDR) method and were used to select genes expression profiles significantly associated with survival.

We summarized the survival statistics, including the hazard ratios (an estimate of the ratio of the hazard rate in the highly versus the lowly expressed patient group) and the estimated 2- and 10-year survival rates in Table 5. There were about 24% OS-associated, 48% RFS-associated and 51% DMFS-associated genes that have adjusted log-rank P-value < 0.01 in the GEO cohort. There were 23% OS-associated genes but only 1.3% RFS-associated genes in the TCGA cohort, due to much fewer relapse/recurrence information in this cohort (adjusted log-rank P-value < 0.05). The breast cancer patients in the TCGA cohorts have lower overall survival (10-year) than those from the GEO and both cohorts have similar RFS rates. In the DMFS analysis, 51% of the differentially expressed genes were significant predictor of DMFS.

Adjusted log-rank P-value cutoff0.010.05
Overall survival (OS)
No. of genes associated with OS414 (24.4%)386 (22.8%)
Hazard Ratio > 1
No. of genes192134
2-year survival (low / high expression)0.969, 0.9230.968, 0.947
10-year survival (low / high expression)0.798, 0.6580.562, 0.369
Hazard Ratio < 1
No. of genes222252
2-year survival (low / high expression)0.918, 0.9680.945, 0.970
10-year survival (low / high expression)0.654, 0.7870.382, 0.558
Relapse-free survival (RFS)
No. of genes associated with RFS811 (47.8%)22 (1.3%)
Hazard Ratio > 1
No. of genes3447
2-year survival (low / high expression)0.901, 0.8400.949, 0.829
10-year survival (low / high expression)0.685, 0.5860.754, 0.555
Hazard Ratio < 1
No. of genes46715
2-year survival (low / high expression)0.845, 0.9000.826, 0.946
10-year survival (low / high expression)0.588, 0.6840.538, 0.749
Distant metastasis-free survival (DMFS)
No. of genes associated with DMFS856 (50.5%)NA
Hazard Ratio > 1
No. of genes384NA
2-year survival (low / high expression)0.923, 0.863
10-year survival (low / high expression)0.755, 0.663
Hazard Ratio < 1
No. of genes472NA
2-year survival (low / high expression)0.858, 0.926
10-year survival (low / high expression)0.667, 0.754

Table 5.

Survival statistics according to gene expression profiles of breast cancer patients.

3.2. Cox regression analysis using the expression profiles of two genes

From the three survival data, i.e. OS, RFS and DMFS, we selected the top 500 most significantly survival associated gene expression profiles consistent in both cohorts to generate 124,750 two-gene combinations and perform Cox regression analysis with two covariates (i.e. using the expression status of each gene as a covariate).

There were 81,902 (65.7%) and 78,136 (62.6%) pairs whose expression signatures of both genes remained predictors of OS in the GEO and TCGA cohorts respectively (P-value of the coefficient estimates < 0.05). Of these, 31,189 pairs were mutual in the two cohorts and 234 gene-pairs (consisting of 131 genes) had survival probability patterns greatly shifted compared with the previous single-gene model. The strongest predictor pairs were COL16A1-ARHGEF3, IGF1R-LTB, IGF1R-PTGDS, NPY1R-ARHGEF3and SERPINA1-ACADSB, where the lower expression of both genes in each pair was associated with lowest survival probabilities in all five cases. The results were presented as Kaplan Meier plots in Figure 3A to E.

Figure 3.

The Kaplan Meier plots of five OS-associated gene-pairs that also gained most changes in survival probabilities compared to the matching univariate approach.

The same analysis was also performed for RFS in the GOE and TCGA cohorts, and 85,244 (68.3%) and 64,049 (51.3%) pairs were significant predictors of RFS respectively (P-value of the coefficient estimates < 0.05). We found 22,165 significant pairs common in the two cohorts and 1,130 gene-pairs (consisting of 276 genes) whose survival probability patterns had greatly shifted compared with the single-gene model. The most significant RFS-associated pairs were ADM-MYBPC1, DIRAS3-TANC2, KIFC1-ADORA3, PDSS1-DIRAS3, STMN1-ADORA3(Figure 4).

Figure 4.

The Kaplan Meier plots of five RFS-associated gene-pairs that also gained most changes in survival probabilities compared to the matching univariate approach.

Lastly, we also make use of the DMFS data available from the GEO cohort to demonstrate the improvement of epistatic gene-pair approach in predicting survival probabilities. Of the 124,750 two-gene combinations, 122,751 (98.4%) were significant predictors of DMFS in breast cancer patients. The high percentage of strong two-gene predictors derived from the DMFS analysis was most likely due to the already high numbers of strong single-gene predictors as shown in Table 5. We further distinguished 228 gene-pairs (consisting of 138 genes) whose survival probability patterns had greatly shifted compared with the single-gene model. Six most significantly improved DMFS-associated pairs were MMP15-SPDEF, TRIB3-ETV1, TRIB3-PLD1, TRIB3-TRIM2, TRIM2-KRT14and XBP1-TRIB3(Figure 5).

Figure 5.

The Kaplan Meier plots of six DMFS-associated gene-pairs that also gained most changes in survival probabilities compared to the matching univariate approach.

3.3. Differentially expressed survival-associated hub genes and gene-pair candidates

As mentioned in the section 3.2, we have identified 234, 1,130 and 228 OS-, RFS- and DMFS-associated gene-pairs (consisting of 131, 276 and 138 genes respectively) that showed improved predictive performance. Some of these genes may be paired with many partners while remaining highly significant. In Table 6, we list five genes that have high number of pairing possibilities and also common in OS, RFS and DMFS analysis.

GenesNo. of Gene-pairsGEO RFS log-rankP
OSRFSDMFSsingle covariatemultiple covariates
MEOX161804.94E-083.43E-11 (C3orf18) ~ 4.02E-08
PPAP2B371401.27E-041.07E-10 (ADM) ~ 7.17E-05
PRPF38B84903.41E-021.04E-12 (DIRAS3) ~ 1.17E-02
SERPINA1720223.21E-052.30E-12 (CDT1) ~ 2.34E-05
XBP101151.97E-032.47E-13 (DIRAS3) ~ 1.11E-03

Table 6.

Five hub genes that associated with more than one type of survival data.

Gene-PairsOSRFSDMFSDE Status of Gene 1DE Status of Gene 2
C3orf18-PPAP2B0.510.510.660.64Her2+ DownElderly Down
IGF1R-KLRB10.390.600.720.73ER+ Up / TNBC DownIDC Down
NME5-PPAP2B0.510.560.770.67ER+ Up / PR+ Up / Her2+ Down / TNBC DownElderly Down
PRPF38B-RAMP30.560.580.800.62Early Stage Up
(i.e. Advanced Stage Down)
ER+ Up / PR+ Up / Her2+ Down / TNBC Down
GATA3-SERPINA10.620.410.560.56ER+ Up / PR+ Up / TNBC DownYoung Down / ER+ Up / PR+ Up / Her2+ Down / TNBC Down
PSAT1-SERPINA11.610.441.530.56Elderly Down / ER+ Down / PR+ Down / TNBC UpYoung Down / ER+ Up / PR+ Up / Her2+ Down / TNBC Down
MMP15-SLC44A41.290.751.690.60Early Stage Down
(i.e. Advanced Stage Up)
ER+ / TNBC Down

Table 7.

The gene-pairs that associated with more than one type of survival data.

There were also seven gene-pairs that were significantly associated with more than one type of survival data (Table 7). By incorporating the differential expression information we derived previously, we may observe that the TNBC patients were noticeably having worse survival outcomes than non-TNBC patients as TNBC is known to be an aggressive breast cancer subtype [43, 44]. For example, both GATA3[45, 46] and SERPINA1were found significantly under-expressed in TNBC cases and the low expressions of both genes were correlated with poor OS and DMFS. Additionally, the over-expression of PSAT1and the under-expression of SERPINA1in TNBC patients also correlated with poor OS and DMFS. Moreover, the over-expression of MMP15relating to advanced stage breast cancer and the under-expression of SLC44A4associated with TNBC are predictors of cancer recurrence as well as distant metastases.


4. Conclusion and perspectives

In section 2 of this chapter, we identified 1,694 genes that were differentially expressed in breast cancer patients of three age groups, early versus advanced stage breast cancers, invasive ductal versus invasive lobular breast cancers, and patients of various receptor status. While some of these genes are known to participate in the biological and genetic pathways that lead to breast cancer and many are novel findings. In section 3, we showed that more than 20% the differentially expressed genes were associated with at least one type of survival data. Our data indicated improved predictive performance when using a multivariate approach of combining the expression of two genes in the assessment of survival data. Perceivably, the gene pairs found in the epistatic analysis could provide useful pictures in gene interactions in breast carcinogenesis.

Breast cancer is a heterogeneous and complex disease where researchers and doctors have implemented different classifications (be it molecular, pathological, genetic or prognostic) to aid disease diagnosis and treatment decision. In the future, we hope to use the gene expression profiles of multiple survival-associated biomarkers to sub-classify patients of different types of breast cancer, and ultimately allow medical practitioners to derive better disease assessment and treatment decision.



The authors acknowledge financial support from the Ministry of Science and Technology, Taiwan (MOST 103-2811-B-010-020), and from the Ministry of Education, Taiwan (Aim for the Top University Plan; 103AC-T903). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.


  1. 1. Breast cancer: prevention and control. World Health Organization (WHO) Web site. Accessed June 2014. []
  2. 2. Breast Cancer - Estimated Incidence, Mortality and Prevalence Worldwide in 2012. GLOBOCAN 2012 Fact Sheets. Accessed June 2014. []
  3. 3. Cristofanilli M, Gonzalez-Angulo A, Sneige N, Kau SW, Broglio K, Theriault RL, Valero V, Buzdar AU, Kuerer H, Buchholz TA, Hortobagyi GN: Invasive lobular carcinoma classic type: response to primary chemotherapy and survival outcomes.J Clin Oncol2005, 23:41-48.
  4. 4. Reis-Filho JS, Simpson PT, Turner NC, Lambros MB, Jones C, Mackay A, Grigoriadis A, Sarrio D, Savage K, Dexter T, et al: FGFR1 emerges as a potential therapeutic target for lobular breast carcinomas.Clin Cancer Res2006, 12:6652-6662.
  5. 5. Lehmann BD, Pietenpol JA: Identification and use of biomarkers in treatment strategies for triple-negative breast cancer subtypes.J Pathol2014, 232:142-150.
  6. 6. Fredholm H, Eaker S, Frisell J, Holmberg L, Fredriksson I, Lindman H: Breast cancer in young women: poor survival despite intensive treatment.PLoS One2009, 4:e7695.
  7. 7. Assi HA, Khoury KE, Dbouk H, Khalil LE, Mouhieddine TH, El Saghir NS: Epidemiology and prognosis of breast cancer in young women.J Thorac Dis2013, 5 Suppl 1:S2-8.
  8. 8. Clarke PA, te Poele R, Workman P: Gene expression microarray technologies in the development of new therapeutic agents.European journal of cancer2004, 40:2560-2591.
  9. 9. Midorikawa Y, Makuuchi M, Tang W, Aburatani H: Microarray-based analysis for hepatocellular carcinoma: from gene expression profiling to new challenges.World journal of gastroenterology : WJG2007, 13:1487-1492.
  10. 10. Lacroix L, Commo F, Soria JC: Gene expression profiling of non-small-cell lung cancer.Expert review of molecular diagnostics2008, 8:167-178.
  11. 11. Nannini M, Pantaleo MA, Maleddu A, Astolfi A, Formica S, Biasco G: Gene expression profiling in colorectal cancer using microarray technologies: results and perspectives.Cancer treatment reviews2009, 35:201-209.
  12. 12. Sorensen KD, Orntoft TF: Discovery of prostate cancer biomarkers by microarray gene expression profiling.Expert review of molecular diagnostics2010, 10:49-64.
  13. 13. Oostlander AE, Meijer GA, Ylstra B: Microarray-based comparative genomic hybridization and its applications in human genetics.Clinical genetics2004, 66:488-495.
  14. 14. Shih Ie M, Wang TL: Apply innovative technologies to explore cancer genome.Current opinion in oncology2005, 17:33-38.
  15. 15. Dupuy A, Simon RM: Critical review of published microarray studies for cancer outcome and guidelines on statistical analysis and reporting.Journal of the National Cancer Institute2007, 99:147-157.
  16. 16. Morozova O, Hirst M, Marra MA: Applications of new sequencing technologies for transcriptome analysis.Annual review of genomics and human genetics2009, 10:135-151.
  17. 17. Marguerat S, Bahler J: RNA-seq: from technology to biology.Cellular and molecular life sciences : CMLS2010, 67:569-579.
  18. 18. Costa V, Aprile M, Esposito R, Ciccodicola A: RNA-Seq and human complex diseases: recent accomplishments and future perspectives.European journal of human genetics : EJHG2013, 21:134-142.
  19. 19. Dai M, Wang P, Boyd AD, Kostov G, Athey B, Jones EG, Bunney WE, Myers RM, Speed TP, Akil H, et al: Evolving gene/transcript definitions significantly alter the interpretation of GeneChip data.Nucleic Acids Res2005, 33:e175.
  20. 20. Johnson WE, Li C, Rabinovic A: Adjusting batch effects in microarray expression data using empirical Bayes methods.Biostatistics2007, 8:118-127.
  21. 21. Robinson MD, McCarthy DJ, Smyth GK: edgeR: a Bioconductor package for differential expression analysis of digital gene expression data.Bioinformatics2010, 26:139-140.
  22. 22. Bardou P, Mariette J, Escudie F, Djemiel C, Klopp C: jvenn: an interactive Venn diagram viewer.BMC Bioinformatics2014, 15:293.
  23. 23. Chou YT, Wang H, Chen Y, Danielpour D, Yang YC: Cited2 modulates TGF-beta-mediated upregulation of MMP9.Oncogene2006, 25:5547-5560.
  24. 24. Sohn DM, Kim SY, Baek MJ, Lim CW, Lee MH, Cho MS, Kim TY: Expression of survivin and clinical correlation in patients with breast cancer.Biomedicine & pharmacotherapy = Biomedecine & pharmacotherapie2006, 60:289-292.
  25. 25. Naderi A, Meyer M: Prolactin-induced protein mediates cell invasion and regulates integrin signaling in estrogen receptor-negative breast cancer.Breast cancer research : BCR2012, 14:R111.
  26. 26. Noetzel E, Rose M, Bornemann J, Gajewski M, Knuchel R, Dahl E: Nuclear transport receptor karyopherin-alpha2 promotes malignant breast cancer phenotypes in vitro.Oncogene2012, 31:2101-2114.
  27. 27. Xu C, Wang H, He H, Zheng F, Chen Y, Zhang J, Lin X, Ma D, Zhang H: Low expression of TFPI-2 associated with poor survival outcome in patients with breast cancer.BMC cancer2013, 13:118.
  28. 28. Li C, Ma H, Wang Y, Cao Z, Graves-Deal R, Powell AE, Starchenko A, Ayers GD, Washington MK, Kamath V, et al: Excess PLAC8 promotes an unconventional ERK2-dependent EMT in colon cancer.The Journal of clinical investigation2014, 124:2172-2187.
  29. 29. Vertino PM, Yen RW, Gao J, Baylin SB: De novo methylation of CpG island sequences in human fibroblasts overexpressing DNA (cytosine-5-)-methyltransferase.Molecular and cellular biology1996, 16:4555-4565.
  30. 30. Sun L, Hui AM, Kanai Y, Sakamoto M, Hirohashi S: Increased DNA methyltransferase expression is associated with an early stage of human hepatocarcinogenesis.Japanese journal of cancer research : Gann1997, 88:1165-1170.
  31. 31. Damiani LA, Yingling CM, Leng S, Romo PE, Nakamura J, Belinsky SA: Carcinogen-induced gene promoter hypermethylation is mediated by DNMT1 and causal for transformation of immortalized bronchial epithelial cells.Cancer Res2008, 68:9005-9014.
  32. 32. Zhao M, Sun J, Zhao Z: TSGene: a web resource for tumor suppressor genes.Nucleic Acids Res2013, 41:D970-976.
  33. 33. Bauer JA, Chakravarthy AB, Rosenbluth JM, Mi D, Seeley EH, De Matos Granja-Ingram N, Olivares MG, Kelley MC, Mayer IA, Meszoely IM, et al: Identification of markers of taxane sensitivity using proteomic and genomic analyses of breast tumors from patients receiving neoadjuvant paclitaxel and radiation.Clinical cancer research : an official journal of the American Association for Cancer Research2010, 16:681-690.
  34. 34. Ray PS, Wang J, Qu Y, Sim MS, Shamonki J, Bagaria SP, Ye X, Liu B, Elashoff D, Hoon DS, et al: FOXC1 is a potential prognostic biomarker with functional significance in basal-like breast cancer.Cancer Res2010, 70:3870-3876.
  35. 35. Rody A, Karn T, Liedtke C, Pusztai L, Ruckhaeberle E, Hanker L, Gaetje R, Solbach C, Ahr A, Metzler D, et al: A clinically relevant gene signature in triple negative and basal-like breast cancer.Breast cancer research : BCR2011, 13:R97.
  36. 36. Umekita Y, Ohi Y, Souda M, Rai Y, Sagara Y, Tamada S, Tanimoto A: Maspin expression is frequent and correlates with basal markers in triple-negative breast cancer.Diagnostic pathology2011, 6:36.
  37. 37. Beltran AS, Graves LM, Blancafort P: Novel role of Engrailed 1 as a prosurvival transcription factor in basal-like breast cancer and engineering of interference peptides block its oncogenic function.Oncogene2013.
  38. 38. Jenssen TK, Kuo WP, Stokke T, Hovig E: Associations between gene expressions in breast cancer and patient survival.Hum Genet2002, 111:411-420.
  39. 39. Gordon GJ, Richards WG, Sugarbaker DJ, Jaklitsch MT, Bueno R: A prognostic test for adenocarcinoma of the lung from gene expression profiling data.Cancer Epidemiol Biomarkers Prev2003, 12:905-910.
  40. 40. Huang E, Cheng SH, Dressman H, Pittman J, Tsou MH, Horng CF, Bild A, Iversen ES, Liao M, Chen CM, et al: Gene expression predictors of breast cancer outcomes.Lancet2003, 361:1590-1596.
  41. 41. Sotiriou C, Neo SY, McShane LM, Korn EL, Long PM, Jazaeri A, Martiat P, Fox SB, Harris AL, Liu ET: Breast cancer classification and prognosis based on gene expression profiles from a population-based study.Proc Natl Acad Sci U S A2003, 100:10393-10398.
  42. 42. Zhao H, Ljungberg B, Grankvist K, Rasmuson T, Tibshirani R, Brooks JD: Gene expression profiling predicts survival in conventional renal cell carcinoma.PLoS Med2006, 3:e13.
  43. 43. Bernardi R, Gianni L: Hallmarks of triple negative breast cancer emerging at last?Cell Res2014, 24:904-905.
  44. 44. Al-Ejeh F, Simpson PT, Sanus JM, Klein K, Kalimutho M, Shi W, Miranda M, Kutasovic J, Raghavendra A, Madore J, et al: Meta-analysis of the global gene expression profile of triple-negative breast cancer identifies genes for the prognostication and treatment of aggressive breast cancer.Oncogenesis2014, 3:e100.
  45. 45. Yoon NK, Maresh EL, Shen D, Elshimali Y, Apple S, Horvath S, Mah V, Bose S, Chia D, Chang HR, Goodglick L: Higher levels of GATA3 predict better survival in women with breast cancer.Human pathology2010, 41:1794-1801.
  46. 46. Chu IM, Michalowski AM, Hoenerhoff M, Szauter KM, Luger D, Sato M, Flanders K, Oshima A, Csiszar K, Green JE: GATA3 inhibits lysyl oxidase-mediated metastases of human basal triple-negative breast cancer cells. Oncogene 2012, 31:2017-2027.

Written By

I-Hsuan Lin and Ming-Ta Hsu

Submitted: June 7th, 2014 Reviewed: October 7th, 2014 Published: March 25th, 2015