Bioinformatics in Breast Cancer Research

Developments both in computer hardware and software allowed for storing, distributing, and analyzing data obtained from biological experimentation, the very definition of bioinformatics. From this standpoint, bioinformatics can be narrowly defined as a field at the crossroads of biology and computer engineering, responsible for the storage, distribution, and analysis of biological information.[1] The term of bioinformatics relatively refers to the formation and advancement of algorithms, computational and statistical techniques, and theory to solve formal and practical problems posed by or inspired from the management and analysis of biological data.[2,3]


Introduction
Developments both in computer hardware and software allowed for storing, distributing, and analyzing data obtained from biological experimentation, the very definition of bioinformatics. From this standpoint, bioinformatics can be narrowly defined as a field at the crossroads of biology and computer engineering, responsible for the storage, distribution, and analysis of biological information. [1] The term of bioinformatics relatively refers to the formation and advancement of algorithms, computational and statistical techniques, and theory to solve formal and practical problems posed by or inspired from the management and analysis of biological data. [2,3] Since its emergence as an independent discipline in the 1980s, bioinformatics has been rapidly developing, keeping up with the expansion of genome sequence data. Whereas it is safe to say that 20 years ago, publishing computationally-derived results was a challenge and experimental observations were considered the only way of making progress [1]; after the famous Clinton-Blair handshake for the completion of the human genome in April 2003 [4], headlines such as ''the laboratory rat is giving way to the computer mouse'' arose. [5] The importance of bioinformatics methods has further increased following the technological improvement of large-scale gene expression analysis using DNA microarrays and proteomics experiments. Wet experiments and the use of bioinformatics analyses go hand in hand in today's biological and clinical research. [6] Undeniably, it is almost inconceivable that a high-impact research publication in biology does not contain some elements of computing. [1] To date, the genome, transcriptome and proteome are investigated with large-scale and highthroughput techniques to suggest treatment and predict outcomes. With the availability of high-throughput sequencing in hypothesis driven science, various sequence-based techniques are originated, namely expressed sequence tags (ESTs) [7], serial analysis of gene expression (SAGE) [8], massively parallel signature sequencing (MPSS) [9], the 'HapMap' project proceeding by means of individual SNPs (single nucleotide polymorphisms) to link specific genotypes to diseases. [10,11] Aside from sequencing techniques, microarray technology is one of the high-throughput techniques, possibly the most promising one. As for protein analysis techniques, tissue arrays [12] and proteomics can be named.
On the one hand, microarrays are microscope slides or chips with immobilized probes, usually cDNA (complementary DNA), BAC (bacterial artificial chromosome), or oligo probes. [13] There are very large numbers of spots on an array, each containing a huge number of identical DNA molecules. Two important applications of microarray technology are gene expression monitoring and Single Nucleotide Polymorphisms (SNP) detection. [14] This technique is widely applicable because less RNA is used to analyze thousands of genes. Despite its increasing use around the world, microarray analysis has some limitations if used as a single method for exploring tumor biology. An obvious weakness is that a microarray represents a single snapshot of the patient. [15] But there are a large number of elements leading to disturbed gene function [16], such as large and small deletions or single base substitutions, mutations that affect promoter regions or splice-sites, as well as epigenetic silencing. Those factors may influence the result but may go undetected as well, depending on the exact type of lesion as well as its location with respect to the area hybridizing with the probe. [17] Furthermore, differentially expressed genes do not necessarily translate into varying protein levels with functional implications; so, it does not always show a correlation between the expression of a gene and the amount of translated protein. [18] Also, compared to RT-PCR (reverse transcription polymerase chain reaction), microarray signals are less sensitive, accurate and not able to resolve smaller differences in gene expression. [19] In addition to its comparative simplicity, microarray technology requires better understanding of the limitations and careful attention to experimental design and data analysis for meaningful results.
Bioinformatics applications are used in analysis of entire gene expression profiles to approach the disease at genome level and pose new hypotheses regarding certain mechanisms including but not limited to signaling pathways governing the process of formation, maintenance and expansion of tumor. [20] Bioinformatics analyses can also be applied to miRNA, DNA copynumber, SNPs, sequence, and methylation data [21] along with the field of medical sciences to know the pathways for diagnosing which genomic changes could give rise to each known inherited disease, i.e., identification of the gene causing disease, and also genetic therapies that can reverse disease phenotype. [14] Different Browser and Databases has been developed to analyze and process this huge quantity of data (Table 1.0 and Table 2.0).
Kept in mind that the discovery of complete protein classes is still in progress, e.g., the kinases of the human genome [22], the classification of proteins with related structure and function [23] will preserve its significance in the molecular dissection of human health and disease. In the future, bioinformatics is expected to continue its fascinating interplay with the field of genomics in cancer research, that is cancer bioinformatics and oncogenomics. [24]

Bioinformatics in various cancers
Cancer is one of the prevalent diseases that bring about death worldwide. Given that Scientists have sequenced the human genome [25], now it is time to use these genomic data, and the highthroughput technology developed to generate them, to tackle major health problems such as cancer. [24] Cancer molecular mechanisms are more successfully examined considering the genes and proteins interaction and network. Bioinformatics tools are vital for acquiring a more holistic view of cancer and analyzing the intricate data, speeding up the research process including biomarker discovery. Moreover, cancer clinical bioinformatics is critical to reach systems clinical medicine by combining clinical measurements and signs with human cancer tissue-generated bioinformatics, understanding clinical symptoms and signs, disease development and progress, and therapeutic strategy. [26,27 ,28] The leading cause of cancer death is lung cancer but still awaits reliable molecular markers. Kim et al. [29] used multiple clinical samples and combined the bioinformatics analysis of the public gene expression data with clinical validation to identify biomarker genes for non-smallcell lung cancer, which shows poor prognosis and recurrence. They meta-analyzed the SAGE and EST data and chose 20 genes for experimental validation through semiquantitative RT-PCR. Then, applied quantitative RT-PCR to 7 genes (CBLC, CYP24A1, ALDH3A1, AKR1B10, S100P, PLUNC, and LOC147166) identified as potential diagnostic markers, leading to 2 highly probable novel biomarkers (CBLC and CYP24A1).
Liver cancer is the most common type, subsequent to lung cancer, responsible for cancerrelated deaths. Sawey et al. [30] performed a forward genetic screen, using a mouse hepatoblast model and RNAi, guided by human hepatocellular carcinoma amplification data. They found that the amplification led to the selective sensitivity to FGF19 inhibition. Hence, FGF19 is an equally important driver gene of 11q13.3 amplicon as CCND1 in liver cancer, which means 11q13.3 amplification could be an effective biomarker for patients predicted to respond to anti-FGF19 therapy.
In a recent study [31], an individualized bioinformatics analysis strategy was applied to previously-established transcriptome data for clear cell renal cell carcinoma (ccRCC) to identify and reposition 8 FDA-approved drugs with negative correlation and P-value <0.05 for anticancer therapy. Authors demonstrated that pentamidine is effective against RCC cells in culture, and slows tumor growth in a RCC xenograft mouse model so it might be a new therapeutic agent to be combined with current standard-of-care regimens for patients with metastatic RCC.
With regard to leukemia, diagnosis and subclassification is mostly based on the application of various techniques like cytomorphology, cytogenetics, fluorescence in situ hybridization, multiparameter flow cytometry, and PCR-based methods which are time-consuming and costintensive, also require expertise in central reference laboratories. Therefore, microarray analysis represents a novel promising method to be used as a diagnostic tool. [14] A key determinant in the prognosis of chronic lymphocytic leukemia (CLL) is the mutational status of the immunoglobulin heavy chain variable region (IGHV) genes. [32] For the correct delin-eation of the mutational status, the patient's leukemic cells and closest germline counterpart should be compared. Unfortunately, public web-based databases are commonly used instead of the patient's germline DNA sequence from non-leukemic cells. Several of these reference databases involve VBASE, GenBank/IgBLAST and the international ImMunoGeneTics information systems that employ different software types, amount of natural IGHV polymorphism and criteria used to map the complementarity determining regions and framework regions. As a result, the correct interpretation of the IGHV mutational status in CLL may be affected. [33] Because of the heterogeneity of many tumors, it is a very challenging work to identify good molecular targets. For instance, resistant subclones of overexpressed and mutated genes may prevent them from being good molecular targets. Therefore, best target is a 'red dot' gene whose mutation occurs early in oncogenesis and dysregulates a key pathway that drives tumor growth in all of the subclones. Examples include mutations in the genes ABL, HER-2, KIT, EGFR and probably BRAF, in chronic myelogenous leukemia, breast cancer, gastrointestinal stromal tumors, non-small-cell lung cancer and melanoma, respectively. For efficacious therapeutics; identification of red-dot targets, development of drugs that inhibit the red-dot targets, and diagnostic classification of the related pathways are a must. [34]

Bioinformatics and breast cancer
Breast cancer occurs in both men and women, yet male breast cancer is less common. Although a cure for each stage of breast cancer has not yet been found, identifying the genetic mutations that cause the disease can play an important role and this is said by scientists to be like looking for needles in a haystack, and after finding the needles or coding regions, they must find disease-related sequences within them. [3 ,6] Bioinformatics sets the stage for searching 3 billion base pairs to detect genetic defects. Allinen et al. described the comprehensive gene expression profiles of each cell type composing normal breast tissue and in situ and invasive breast carcinomas performing SAGE (serial analysis of gene expression) and utilizing cell-type specific cell surface markers and magnetic beads for the rapid sequential isolation. Their results suggest that considerable transcriptional alterations happen in all cell populations while genetic changes were detected only in epithelial cells among myoepithelial, endothelial and stromal cells, myofibroblasts and lymphocytes. [35] To continue with another study, based upon a systematic Sanger sequencing analysis of 13,023 genes in 11 human breast cancers, individual tumors accumulate an average of approximately 90 point mutations in gene coding regions, but only a tiny number of these were recurrent and were in significant genes of breast cancer, including p53 and PIK3CA. A much larger number of the genes do not necessarily contribute to the carcinogenesis. [36] Considering the genomic landscape of breast cancer, these more common mutations resemble "mountains" while the vast majority of genes reflect "hills" that are infrequently mutated. We need to elucidate mechanisms involved in the disease to understand the heterogeneity of human cancers and utilize personal genomics for tumor diagnosis and new therapeutic strategies. [37] As widely accepted, early detection of breast cancer has an enormous impact on patient's survival. Seeing that genome-wide expression patterns of tumors mirror the biology of the tumors, relating gene expression patterns to clinical outcomes sheds light on the biological diversity of the tumors. [38] In the discovery of genes and pathways that are specifically activated or inactivated during tumor progression, high throughput genome-wide array based techniques like array comparative genomic hybridization (aCGH) and transcriptional profiling can be used. [13] A molecular classification of breast cancer, with more than five reproducible subtypes (basal-like, ERBB2, normal-like, luminal A, luminal B) was defined through gene expression profiling and microarray analysis. [38,39,17] In addition, performing the gene set enrichment analysis (GSEA), a gene set linked to the growth factor (GF) signaling was observed to be significantly enriched in the luminal B tumors. [40] Another study states that multiple pathways were identified by mapping gene sets defined in Gene Ontology Biological Process (GOBP) for estrogen receptor positive (ER+) or estrogen receptor negative (ER-); and among them, in a separate set, pathways related to apoptosis and cell division or G-protein coupled receptor signal transduction are associated with the metastatic capability of ER+or ER-tumors, respectively. [41] Additionally, in a study, it is supported that breast cancer is initiated with mutated stem cells/progenitors, also called "breast cancer stem cells" because they are sufficient to sustain oncogenesis and tumor growth. [42] To identify genetic changes in the progression of breast carcinoma, Yao et al. [43] used aCGH and SAGE combined for ductal carcinoma in situ (DCIS), invasive breast carcinomas, and lymph node metastases. They identified 49 minimal commonly amplified regions and reported that the overall frequency of copy number alterations was more in invasive tumors than in DCIS, with several of them present only in invasive cancer. In breast cancer, gene amplification happens recurrently on some chromosomal locations (e.g. 1q, 8p12, 8q24, 11q13, 12p13, 12q13, 17q21-q23, 20q13) [43,44], which points to the activation of some oncogenes at high frequency during the growth of tumor. Amplification is a mechanism causing the gene expression constitutively enhanced above the level of physiologically normal variation, so the significance of oncogene amplification in tumorigenesis had originated from expression profiling of tumor cells by oncogene arrays. [45] Bioinformatics is also crucial in the realm of pharmacogenomics. There became a need to develop accurate tools for the effective treatment relying on biological characterization of each patient's tumor. Gene-expression profiling of tumors with DNA microarrays is a powerful tool for pharmacogenomics targeting of treatments. Oncotype DX™ assay (Genomic Health) is a good example, which was described for identifying the subset of node-negative estrogenreceptor-positive breast cancer patients who do not require adjuvant chemotherapy. [46,34] A recent research demonstrated that microarray analysis with qRT-PCR validation reveals distinct pathways of resistance to bevacizumab (BEV) in xenograft models of human ER+breast cancer, showing Follistatin (FST) and NOTCH as the top signaling pathways associated with resistance in VEGF-driven tumors (P <0.05). According to the gene expression analysis, the level of VEGF expression affects the response to BEV therapy and gene pathways. [47] Using appropriate bioinformatics tools, such findings may elucidate the matter of resistance to drugs for individual patients and provide a deeper understanding of treatments and risk factors, opening the door from novel targets and disease-related biomarkers to right drugs.
Last but not least, the effect of epigenetic changes on breast cancer etiology is beyond doubt. In spite of quite a number of DNA methylation research manifesting diverse patterns including tumor suppressor genes and oncogenes, only a small fraction of them connect the epigenome data with the transcriptome. In a recent study by Minning and coworkers [48], DNA methylation and gene expression profiling of primary breast tumor tissues and adjacent noncancerous breast tissues was carried out. They preferred MS-MLPA or MS-qPCR for validation of results. The overlapping genes between DNA methylation and gene expression datasets were further mapped to the KEGG database to identify the molecular pathways linking the used genes together and supervised hierarchical clustering was used for data analysis. The authors found that most of the overlapping genes belong to the focal adhesion and extracellular matrix-receptor interaction that play important roles in breast carcinogenesis. The more gene signature data is acquired by different studies, the better understanding of epigenetic regulation of gene expression and remedial intervention will be possible.
Advances in bioinformatics and its application are much possible by multidisciplinary teams pursuing focused research. The sensitivity, specificity and combination of tools, methodologies, and databases should be evaluated in a complete matter. On top of that, findings must be confirmed with several molecular techniques before translation into clinical practice.