Genomic and gene characteristics revealed by the Human Genome Project.
Dynamic regulation of genes is an important part of the cell life cycle in health and disease. The regulation includes the variety and alteration of genome and gene expression, and the concept such as quality of genome will be useful to predict and assess the developmental stages of the cells, disease status and drug sensitivity. Recent technologies and worldwide sequencing projects have revealed 26,383 annotated genes in the 2.91-Gigabase human genome [1,2]. The main molecular functions of the annotated genes, as categorized by Gene Ontology (GO), are enzyme, signal transduction, nucleic acid binding, cell adhesion, chaperone, cytoskeletal structural protein, extracellular matrix, immunoglobulin, ion channel, motor, structural protein of muscle, protooncogene, select calcium binding protein, intracellular transporter, and transporter [1,3]. Despite a wealth of knowledge, the function of 42% of the annotated genes remains unknown . When the human genome sequence was published in 2001 , there were a predicted 39,114 genes, of which 59% were of unknown function. According to the International Human Genome Sequencing Consortium, the number of identified genes is approximately 32,000, of which 51% show a match within InterPro, a database that integrates diverse information about protein families, domains, and functional sites [2-5]. In 2001, InterPro combined sequence and pattern information from four databases (PRINTS, PROSITE, Pfam, Prosite Profile); however, it now includes information from an additional eight databases (SMART, ProDom, PIRSF, SUPERFAMILY, PANTHER, CATH-Gene3D, TIGRFAM, and HAMAP) [2,4-16]. In , the InterPro entries are collapsed into 12 broad categories: cellular processes, metabolism, DNA replication/modification, transcription/translation, intracellular signaling, cell–cell communication, protein folding and degradation, transport, multifunctional proteins, cytoskeletal/structural, defense and immunity, and miscellaneous function. The rate of single nucleotide polymorphism (SNP) variation has been reported as 1 in 1250 base pairs  and more than 1.4 million SNPs have been identified  (Table 1).
|Size of the genome||2.91 Gbp|||
|Number of annotated genes||26,383|||
|Main molecular functions of|
|enzyme, signal transduction, nucleic acid binding, cell adhesion, chaperone, cytoskeletal structural protein, extracellular matrix, immunoglobulin, ion channel, motor, structural protein of muscle, protooncogene, select calcium binding protein, intracellular transporter, transporter|||
|Percentage of annotated genes|
with unknown function
|Number of hypothetical and|
|Percentage of hypothetical and|
annotated genes with unknown function
|Number of identified genes||approx. 32,000|||
|Percentage of matches with|
|Rate of SNP variation||1/1250 bp|||
|SNPs identified||more than 1.4 million|||
Among the databases combined in InterPro (Table 2), PRINTS, PROSITE, and Pfam contain protein families in which the homology between each protein is predicted by the degree of sequence similarity . The others—SMART, ProDom, PIRSF, SUPERFAMILY, PANTHER, CATH-Gene3D, TIGRFAM, and HAMAP [4-16]—have unique characteristics and URLs, and have been developed sharing information among each other and incorporating information from GO. In detail, PRINTS is a collection of diagnostic protein family “fingerprints”, which are groups of conserved motifs, evident in multiple sequence alignments ; PROSITE is a protein domain database for functional characterization and annotation that consists of documentation entries describing protein domains, families, and functional sites as well as associated patterns and profiles to identify them ; Pfam contains collections of protein families, each represented by multiple sequence alignments and hidden Markov models, available via servers in the UK, the USA, and Sweden ; SMART (Simple Modular Architecture Research Tool) is an online resource for the identification and annotation of protein domains and the analysis of protein domain architectures ; ProDom is a comprehensive set of protein domain families generated automatically from the UniProt database ; PIRSF is a classification system that reflects evolutionary relationships among full-length proteins and domains ; SUPERFAMILY is a database of structural and functional annotation for all proteins and genomes ; PANTHER is a classification system that classifies genes by their functions using published experimental evidence and evolutionary relationships to predict function even in the absence of direct experimental evidence ; CATH-Gene3D is a comprehensive database of protein domain assignments for sequences from the major sequence databases ; TIGRFAM is a collection of protein family definitions built to aid high-throughput annotation of specific protein functions ; and HAMAP is composed of two databases: the proteome database and the family database, and of an automatic annotation pipeline mainly focused on microbial proteomes . Hidden Markov models are usually used for the database algorithm.
|InterPro||integrative predictive models of protein families, domain and functional sites of multiple databases such as PRINTS, PROSITE, Pfam, SMART, ProDom,|
PIRSF, SUPERFAMILY, PANTHER, CATH-Gene3D, TIGRFAM, and HAMAP
|PRINTS||a collection of diagnostic protein family|
"fingerprints" which are groups of conserved motifs, evident in multiple sequence alignments
|PROSITE||a protein domain database for functional characterization and annotation which consists of documentation entries describing protein domains, families and functional sites as well as associated patterns and profiles to identify them||http://prosite.expasy.org/|||
|Pfam||a database of collection of protein families, each represented by multiple sequence alignments and hidden Markov models, available via servers in the|
UK, the USA and Sweden
|SMART||an online resource for the identification and annotation of protein domains and the analysis of protein domain architectures, of which abbreviation|
is Simple Modular Architecture Research Tool
|ProDom||a comprehensive set of protein domain families automatically generated from the uniProt|
|PIRSF||the classification system which reflects evolutionary relationships of full-length proteins and domains||http://pir.georgetown.edu/pirsf/|||
|SUPERFAMILY||a database of structural and functional annotation|
for all proteins and genomes
|PANTHER||the classification system which classifies genes by|
their functions using published scientific experimental evidence and evolutionary relationships to predict function even in the absence of direct experimental evidence
|CATH-Gene3D||a comprehensive database of protein domain assignments for sequences from the major sequence databases||http://gene3d.biochem.ucl.ac.uk/|||
|TIGRFAM||a collection of protein family definitions built to aid|
in high-throughput annotation of specific protein functions
|HAMAP||a system which composed of two databases, the proteome database and the family database, and of an automatic annotation pipeline||http://hamap.expasy.org/|||
2. Gene regulation
2.1. Gene markers for cancer and cancer stem cells
Several molecular markers of cancer have been identified . Metastatic cancer cells can transfer into bodily fluids through the cellular epithelia, which enables the detection of cancer markers in bodily fluids such as blood plasma, urine, or saliva . The different types of cancer markers include genomic DNA point mutations, microsatellite alterations, promoter hypermethylation, viral sequences, aberrant chromosomal copy number, chromosomal translocations, deletions, or loss of heterozygosity, telomere extension, alterations in RNA or protein expression, and mitochondrial DNA mutations .
Molecular markers of cancer include TP53 (encoding p53), which has been shown to be mutated in head and neck, lung, colon, pancreatic, and bladder cancer [17,18]; colon, lung, esophagus, breast, liver, brain, reticuloendothelial tissue, and hematopoietic tissue cancers ; and bladder cancer . Mutation of the epidermal growth factor receptor (EGFR) gene is an important predictive/prognostic factor for EGFR-tyrosine kinase inhibitor therapy in non-small cell lung cancer . RAS oncogene mutations have been identified in colorectal tumors . Microsatellites, which are tandem iterations of simple di-, tri-, or tetranucleotide repeats, have been reported to be unstable in some inherited diseases and in some types of cancer , including head and neck, lung, breast, and bladder cancer [17,23].
The expression levels of the cell cycle-related proteins p21 (CDKN1A), p53 (TP53), cyclin D1 (CCND1), and aurora kinase A (AURKA) may be used as prognostic markers to predict recurrence in stage II and stage III colon cancer . In addition, markers of the epithelial–mesenchymal transition (EMT)–such as reduced expression of keratins, a switch from E-Cadherin to N-Cadherin, and enhanced migration in D492M cells—might be a useful marker in breast cancer . Furthermore, expression of the stem cell markers cytokeratins 15 and 19 was altered in squamous cell carcinoma: cytokeratin 15 levels were decreased and the localization of cytokeratin 19 was altered . KLK3, which encodes prostate-specific antigen, a member of the kallikrein family of serine proteases, is a biomarker for prostate cancer detection and disease monitoring [27,28]. Mitochondrial DNA mutations have been associated with bladder, head and neck, lung, colorectal, and pancreatic cancer [29-32] (Table 3).
Highly parallel identification of cancer-related genes using small hairpin RNA screening has revealed that the expression of known and putative oncogenes, such as EGFR, KRAS, MYC, BCR-ABL, MYB, CRKL, and CDK4 that are essential for cancer proliferation, is altered in cancer cells . Other genes such as PTPN1, NF1, SMARCB1, and SMARCE1 have been identified as essential for the imatinib response of leukemia cells, and TOPOIIA expression is involved in resistance to etoposide, an anti-topoisomerase II agent, in small cell lung cancer [33-36].
|TP53 mutation||head and neck cancer|||
|lung cancer (small cell lung cancer and non-small cell lung cancer); breast, colon, esophagus, liver, bladder, ovary, and brain cancers; sarcomas, lymphomas, and leukemias|||
|EGFR mutation||non-small cell lung cancer|||
|RAS mutation||colorectal tumors|||
|DNA microsatellite alterations||bladder cancer|||
|alteration in cell cycle mRNA|
|alteration in cytokeratin mRNA|
|squamous cell carcinoma|||
|alteration in kallikrein mRNA|
|mitochondrial DNA mutations||bladder cancer, head and neck cancer, lung cancer|||
|colorectal tumors||, |
2.2. Genes related to cell proliferation
Cyclins, which regulate the cell cycle, play important roles in cell proliferation and the uncontrolled cell proliferation that is the most important factor in tumorigenesis . Tumor cells accumulate mutations that result in constitutive mitogenic signaling and defective responses to anti-mitogenic signals that contribute to unscheduled proliferation . In cancer, unscheduled proliferation, genomic instability, and chromosomal instability are the three major factors in cell cycle dysregulation . Regulation of the cell cycle is mainly conducted by complexes of cyclins and cyclin-dependent kinases . Cyclin D1 in cell migration and proliferation is temporo-spatially separated by its biphasic expression induced by thrombin, a G protein-coupled receptor agonist, which is mediated by nuclear factor of activated T cells c1 (NFATC1) and signal transducer and activator of transcription 3 (STAT3) . Cyclin D1 regulates kinase activity and the G1–S phase transition in the cell cycle; deregulated cyclin D1 expression is well documented in breast, colon, and prostate cancers [39,40]. The expression of cyclin D1 is regulated by several factors including cytokines such as interleukin 3 and interleukin 6 via STAT3 and STAT5, or extracellular matrix factors such as collagen, fibronectin, and vitronectin, which activate focal adhesion kinase upon integrin clustering, and hepatocyte nuclear factor 6 . Cyclin D1 is a crucial regulator of Wnt- and Notch-regulated development [41,42]. The binding of Wnt to its receptor, Frizzled, causes release of β-catenin to translocate from the cytoplasm to the nucleus, where it forms a complex with the ternary complex factor and/or the lymphoid enhancer-binding factor [41,43]. Cyclin D1 is induced by overexpression of β-catenin, which is a major component of adherens junctions that link the actin cytoskeleton to members of the cadherin family of transmembrane cell–cell adhesion receptors. It plays an important role in linking the cytoplasmic side of cadherin-mediated cell–cell contacts to the actin cytoskeleton . Beta-catenin is upregulated in colorectal cancer, which is considered to trigger cyclin D1 gene expression followed by uncontrolled progression of the cell cycle . In addition, β-catenin plays another role in signaling that involves transactivation, in complex with transcription factors of the lymphoid enhancing factor family in the nucleus . The pathway involving β-catenin/LEF1 and elevation of cyclin D1 might be crucial for tumorigenesis . Inhibiting EglN2, a member of the EglN (also called PHD or HPH) family of prolyl hydroxylases that regulates the heterodimeric transcription factor hypoxia-inducible factor (HIF), causes a decrease in the expression of its interaction partner cyclin D1 in cancer cells and impairs the cells’ ability to proliferate in vivo .
Progression of the eukaryotic cell cycle is driven by cyclin-dependent protein kinases (CDKs), which are binding partner of cyclins. The CDK oscillator acts as the primary organizer of the cell cycle . Phosphorylation of cyclin-Cdk complexes is one of the primary mechanisms of cell cycle regulation . Cyclins are degraded by ubiquitin-mediated proteolysis . The ubiquitylation and degradation of cyclin 1 and cyclin 2 are mediated by the SCF complex, a multi-subunit ubiquitin ligase that contains Skp1, a member of the cullin family (Cdc53) and an F-box protein, as well as a RING-finger-containing protein . CDKs including CDK1, CDK2, CDK4, CDK6, and CDK11 have various functions that have been investigated using loss-of-function, target validation, and gain-of-function mouse models . CDK1 is a mitotic CDK, also known as cell division control protein 2 (CDC2). It is one of the master regulators of mitosis as it controls the centrosome cycle as well as mitotic onset; deficiency in CDK1 results in embryonic lethality in the first cell divisions [38,47]. CDK2, CDK4, and CDK6 are interphase CDKs that are not essential for the mammalian cell cycle; they are, however, required for the proliferation of specific cell types . Deficiency in CDK2, CDK4, and CDK6 caused mid-gestation embryonic lethality because of hematopoietic defects [38,47].
2.3. Genes related to cell differentiation
Inhibitor of differentiation 1 (Id1) is associated with the induction of cell proliferation and invasion , as well as the invasive features of cancer and the EMT . The HOX genes encode homeodomain-containing transcription factors involved in the regulation of cellular proliferation and differentiation during embryogenesis . The expression of HOXA1, which plays an important role in proliferation, apoptosis, adhesion, invasion, the EMT, and anchorage-independent growth, was significantly increased in oral squamous cell carcinoma compared with in healthy oral mucosa , and it might be a useful prognostic marker for patients with this disease .
Wnt/β-catenin signaling controls skeletal development and differentiation . The initiating step of skeletal development is mesenchymal condensation, during which mesenchymal progenitor cells are at least bipotentiate . Osteochondral progenitor cells differentiate into osteoblasts instead of chondrocytes when Wnt/β-catenin signaling is activated . In vitro models using human pluripotent stem cell-derived neural progenitor cells have been used to examine whether G11778A-mutated mitochondrial DNA, which is associated with Leber’s hereditary optic neuropathy, might be involved in the differentiation of neural progenitor cells into neurons, oligodendrocytes, and astrocytes . The differentiation of neural progenitor cells can be visualized by staining for the neuronal marker class III beta-tubulin . Alternative splicing of exons play an important role in cellular differentiation and pathogenesis . Alternative splicing in colorectal cancer and renal cell cancer samples has been analyzed by the Bioinformatics Exon Array Tool (BEAT,
2.4. Genes related to apoptosis
Cell proliferation and death are regulated by various molecules. Recently, microRNAs have been revealed to play important roles during death receptor-mediated apoptosis (programmed cell death) . Transfection with miR-133b caused a proapoptotic effect on tumor necrosis factor alpha (TNFα)-stimulated HeLa cells : the expression of apoptosis regulatory proteins such as transgelin 2 (TAGLN2), myosin, heavy chain 9, non-muscle (MYH9), cytoskeleton-associated protein 4 (CKAP4), polypyrimidine tract binding protein 1 (PTBP1), glutathione-S-transferase pi 1 (GSTP1), and copine III (CPNE3) were down-regulated compared with in control cells . The BCL protein family plays a major role in regulation of the apoptotic cascade . BCL2-associated protein (BAX) promotes apoptosis and delays disease progression, and has been associated with longer disease-free survival in patients with a number of gastrointestinal cancers, such as esophageal, stomach, small intestine, and colon cancer; moreover, high BCL6 expression is correlated with worse prognosis in patients with other gastrointestinal tumors, such as esophageal adenocarcinoma . There are two major cell death pathways that transduce the effects of various death inducers: the extrinsic death pathway that is mediated through cell death receptors of the TNF receptor family, such as the Fas receptor; and the intrinsic death pathway that proceeds through mitochondria . The expression of apoptosis signal-regulating kinase (ASK1), which plays an important role as a mitogen-activated protein kinase kinase kinase in apoptosis signaling, is increased in gastric cancer . Furthermore, the levels of cyclin D1 and phosphorylated JNK were higher in gastric cancer than in non-tumor epithelium . ASK1 may play a role in the development of gastric cancer .
2.5. Detection of cell proliferation or apoptosis
Several methods have been suggested for the diagnosis of cancer . Protein markers for cancer include prostate-specific antigen for prostate cancer, CA125 for ovarian cancer, carcinoembryonic antigen for colon cancer, human chorionic gonadotropin for trophoblastic cancer, and a-fetoprotein for hepatocellular carcinoma and germ cell tumors . Assays to detect telomerase activity in clinical samples include the TRAP (telomere repeat amplification protocol) assay, which involves protein extraction and subsequent primer-directed PCR amplification of telomere extensions .
Assays for the detection of kinases that regulate cell growth, proliferation, differentiation, and metabolism have been developed . The assay technology includes fluorescence polarization to detect protein phosphorylation, scintillation proximity to detect protein dephosphorylation by phosphatases, fluorescence resonance energy transfer to detect protein cleavage or modification, immunosorbent assays to detect phosphorylation state, luciferase-based ATP detection to detect the kinase-dependent depletion of ATP, luminescent oxygen channeling to detect phosphorylation, time-resolved fluorescence resonance energy transfer to detect phosphopeptide formation, and enzyme fragment complementation to detect molecular interactions with kinases [58,59]. Cell proliferation can also be determined by the tetrazolium hydroxide (XTT) cell proliferation assay, in which absorbance is measured by an ELISA reader under 490-nm-wavelength light (Biological Industries) .
Cell proliferation assays and apoptosis assays have been used to examine the effects of inhibitors on cancer cells . The cell proliferation of Neuro-2A cells, neuroblastoma cells, can be determined using the CellTiter 96 Aqueous Non-Radioactive Cell Proliferation Assay reagent (Promega) . A colony formation assay using Neuro-2A cells was used to determine the effect of an inhibitor of GSK-3β . In this experiment, colonies were allowed to form for 10 days, after which the cells were fixed with 70% ethanol and stained with 1% methylene blue. Apoptosis was then measured by flow cytometry using an Annexin V-allophycocyanin (APC) /propidium iodide (PI) detection kit (BD PharMingen) . Apoptosis was also determined using 4’6-diamidino-2-phenylindole (DAPI) staining, observing apoptotic nuclear morphology, and immunoblotting with antibodies to β-catenin, X-linked inhibitor of apoptosis, and BCL2 . Cell cycle analysis using PI to quantify the proportions of cells in the G1/G0 or G2–M phases was used to examine cell cycle status .
Viable cells can be determined using MTT (3-(4,5-dimethylthiazol-2-yl)-2,5-diphenyltetrazolium bromide) colorimetric assays . Absorbance at 570 nm is used to detect the incorporation of MTT. Apoptosis can also be determined by caspase activation using an anti-poly ADP-ribose polymerase (PARP) antibody . Viable cells can also be determined using a 3-(4,5-dimethyl-thiazol-2yl)-5-(3-carboxymethoxyphenyl)-2-(4-sulfophenyl)-2H-tetrazolium (MTS) kit (Promega) . The terminal transferase dUTP nick end labeling (TUNEL) assay is commonly used to detect apoptosis . Harvested cells are resuspended in DNA labeling solution consisting of TdT reaction buffer, TdT enzyme, and BrdUTP, then stained with PI to detect a fluorescein isothiocyanate-labeled anti-BrdU antibody . Cell viability and proliferation assays were used to validate internal tandem duplication mutations in FLT3 as a therapeutic target for human acute myeloid leukemia . Cell viability and proliferation can be determined using a Vi-cell XR automated cell viability analyzer (Beckman Coulter) .
3. Genomic variation in disease
3.1. Genome-wide association studies in cancer
Despite extensive research efforts for several decades, the genetic basis of common human diseases such as cancers remains largely unknown . Genome-wide association studies (GWAS) have emerged as an important tool for the discovery of genomic regions that harbor genetic variants conferring risk for various cancers [66,67]. Family-based linkage studies and studies comprising tens of thousands of gene-based SNPs can also assay genetic variation across the genome , but the National Institutes of Health guidelines for GWAS require a sufficient density of genetic markers to capture a large proportion of the common variants in the study population, measured in enough individuals to provide sufficient power to detect variants of modest effect . The recent success of GWAS can be attributed to the convergence of new technologies that can genotype hundreds of thousands of SNPs in hundreds or thousands of samples [66,69].
GWAS have been conducted in the five of the most common cancer types: breast, prostate, colorectal, lung, and melanoma (Table 4) and have identified more than 20 novel disease loci, confirming that susceptibility to these diseases is polygenic . For many years, human genetics has been used to map rare mutations with large effect sizes in families or genetically homogeneous populations, such as BRCA1/BRCA2 mutations in Ashkenazi women with breast cancer and ovarian cancer . A number of SNPs have now been associated with breast cancer; for example, a SNP in intron 2 of the FGFR2 gene, which encodes a receptor tyrosine kinase that is amplified and overexpressed in 5–10% of breast tumors [72,73], and SNPs on chromosomes 16q and 5q. The locus on 16q contains a gene TNRC9 and a hypothetical gene LOC643714. The function of TNRC9 is unknown but the presence of an HMG box motif suggests that it might act as a transcription factor. The 5q locus includes MAP3K1, which encodes a protein involved in signal transduction (but not previously known to be involved in cancer) and two other genes: MGC33648 and MIER3. In addition, several of the breast cancer loci appear to be associated with specific subtypes of the disease. In particular, the FGFR2 association is strongly associated with estrogen receptor-positive breast cancer, while the TNRC9 SNP is associated with both estrogen receptor-positive and -negative breast cancer [74,75]. It is surprising that none of the strongest associations map to regions harboring estrogen/progesterone genes in women of European background, particularly because a GWAS in Asian women reported a convincing association with markers near the estrogen receptor alpha (ESR1) gene . In prostate cancer, the first and most important region to emerge was 8q24. This region was first associated with prostate cancer through linkage studies by the deCode group, was followed up by association analyses , and has been confirmed in subsequent GWAS [78-81]. Another signal, on chromosome 10q13, points to a variant in the promoter of the MSMB gene, which encodes the PSP94 protein; this is now under intense investigation as a biomarker for prostate cancer [80,81].
In general, the susceptibility alleles discovered thus far are common—that is, with a frequency in one or more population of >10%, and each allele confers a small contribution to the overall risk of the disease. For nearly all regions conclusively identified by GWAS, the effect sizes per allele are estimated at <1.3. It was not anticipated that GWAS in certain cancers would yield many novel regions when other cancers strongly associated with particular environmental exposures have yielded so few regions. For example, prostate cancer, breast cancer, and colon cancer have been associated with 29, 13, and 10 regions of the genome, respectively, while there are only three associated regions for lung cancer in smokers, and three for bladder cancer despite analysis of sufficiently large data sets . Several GWAS for lung cancer have identified the same locus on 15q25, suggesting that this is an important susceptibility locus for this disease [82-87]. This locus contains the nicotinic acetylcholine receptor subunit genes CHRNA3 and CHRNA5, suggesting that susceptibility may be mediated through smoking behavior [86,87].
GWAS represent an important advance in discovering genetic variants influencing disease but have important limitations. There is a high potential for false-positive results, they do not yield information on gene function, they are insensitive to rare and structural variants, they require large sample sizes, and incur possible biases because of case and control selection and genotyping errors . Clinicians and scientists must understand the unique aspects of these studies and be able to assess and interpret GWAS results for themselves and their patients. However, at present these studies mainly represent a valuable discovery tool for examining genomic function and clarifying pathophysiological mechanisms. However, through GWAS, the identification of variants, genes, and pathways involved in multiple cancers offers a potential route to new therapies, improved diagnosis, and better disease prevention .
|Cancer type||Reference||Year||Platform[SNP passing QC]||Ethnic group||Initial sample size||Replication sample size|
|East Asian||2,062||2,066||East Asians||15,091||14,877|
|Swedish & Finnish||617||4,583||European||1,001||7,604|
[up to 607,728]
|Ashkenazi Jewish||249||299||Ashkenazi Jewish||1,193||1,166|
|African American||3,425||3,290||African American|
|European||1,854||1,894||European, Chinese, Japanese, African American, Latino, and Hawaiian||19,879||18,761|
|||2007||Affymetrix & Illumina|
|||2007||Affymetrix & Illumina|
|Han Chinese||245||-||Han Chinese||305||-|
|10,995 smokers||4,848 smokers|
3.2. Genetic risk score in cancer and diabetes
Type 2 diabetes mellitus and cancers are major health problems worldwide [150,151]. The recent increase in the prevalence of these diseases is largely attributable to environmental factors. However, convincing evidence shows that genetic factors may play an important role in these diseases [152,153]. Recent GWAS have led to the identification of a series of SNPs that are robustly associated with either the risk of diabetes or cancers [151,154-159]. For type 2 diabetes mellitus, common SNPs have been identified in the PPARG, KCNJ11, and TCF7L2 genes, and have been widely replicated in populations of various ethnicities [160-162]. Other potential new loci include HHEX, CDKAL1, CDKN2A/B, IGF2BP2, SLC30A8, and WFS1 [65,155-159,163,164]. A number of SNPs have been identified as associated with breast cancer risk, including FGFR2, CASP8, ERBB4, TAB2 , BARX2, TMEM45B, ESR1, FGFR2, TNRC9, MAP3K1, MGC33648, MIER3, and RAD51L1 [74,75,151] (Table 5).
Combining multiple loci with modest effects into a global genetic risk score (GRS) might improve the identification of those at risk for common complex diseases such as type 2 diabetes and cancers [165-167]. Several studies have developed methods to predict the risk of certain diseases, such as coronary heart disease, type 2 diabetes, and breast cancer, aggregating information from multiple SNPs into a single GRS [151,168,169]. For example, in the Atherosclerosis Risk in Communities study, the aggregation of multiple SNPs into a single GRS was responsible for improving the prediction of coronary heart disease incidence . In a study that used a GRS to determine the risk of type 2 diabetes in US men and women, individuals in the highest quintile of GRS had a significantly increased risk of type 2 diabetes compared with those in the lowest quintile; however, the addition of a GRS to the conventional model consisting of lifestyle risk factors only increased the area under the curve by only 1% (AUC=0.78). In this instance, the GRS was determined to be useful only when combined with the body mass index or a family history of diabetes . For breast cancer, a GRS was created using 14 SNPs previously associated with breast cancer, and was substantially more predictive of estrogen receptor-positive breast cancer than of estrogen receptor-negative breast cancer, particularly for absolute risk . Further studies are needed to confirm whether a GRS improves disease risk prediction.
The GRS is calculated on the basis of reproducible tagging of SNP-associated loci reaching genome-wide levels of significance. The GRS can be created by two methods: a simple count method (count GRS) and a weighted method (weighted GRS) [169,170]. Both methods anticipate each SNP to be independently associated with risk. An additive genetic model is used for each SNP, applying a linear weighting of 0, 1, or 2 to genotypes containing 0, 1, or 2 risk alleles, respectively. This model is known to perform well even when the true genetic model is unknown or wrongly specified . The count model assumes that each SNP in the panel contributes equally to the disease risk and is calculated by summing the values for each of the SNPs. The weighted GRS is calculated by multiplying each B-coefficient, the estimates resulting from an analysis carried out on variables that have been standardized, by the number of corresponding risk alleles (0, 1, or 2).
|Disease||Reference||Year||Ethnic group||Participants||No. of SNPs||Genes found from GWAS|
|||2008||Framingham||2,377 diabetic patients||18||NOTCH2 (rs10923931),|
CDC123, CAMK1D (rs12779790),
KCNJ11 (rs5219), INS (rs689),
TSPAN8, LGR5 (rs7961581)
3.3. Cancer Cell Line Encyclopedia
The Cancer Cell Line Encyclopedia (CCLE) has made predictive modeling of anticancer drug sensitivity a realistic proposition, by determining genomic markers of drug sensitivity in cancer cells [172,173]. The CCLE contains information from 947 human cancer cell lines including data on gene expression, chromosomal copy number, and massively parallel sequencing data. It has been used to identify genetic, lineage-specific, and gene expression-based predictors of drug sensitivity . This has revealed, for example, that the plasma cell lineage is correlated with sensitivity to IGF1 receptor inhibitors, aryl hydrocarbon receptor (AHR) expression is associated with MEK inhibitor efficacy in NRAS-mutant lines, and SLFN11 expression is associated with sensitivity to topoisomerase inhibitors . Genomic markers of drug sensitivity in cancer cells have also been systematically identified using the Genomics of Drug Sensitivity in Cancer database (
There are dramatic changes in the genomes of cancer cells, which vary according to cancer subtype. Integrative and wide investigations of cancer cell genomes have revealed mutations and alterations in gene expression that are associated with the disease. Databases that include abundant data related to gene and protein conformation, gene expression, and genomic mutations enable the construction of dynamic cellular simulations and disease models. New sequencing tools such as next-generation sequencing will reveal new horizons in the prediction of disease and drug sensitivity, which play an important role in personalized medicine. Appropriate translation of the abundance of information to clinical practice is one of most important future challenges for medicine. The quality of genome would be one of the important factors for detecting the development of the disease.
The authors are grateful to all those who helped with preparation of the manuscript. In particular, we thank Jaeseong Jo for his great assistance.