Genomic and gene characteristics revealed by the Human Genome Project.
Dynamic regulation of genes is an important part of the cell life cycle in health and disease. The regulation includes the variety and alteration of genome and gene expression, and the concept such as quality of genome will be useful to predict and assess the developmental stages of the cells, disease status and drug sensitivity. Recent technologies and worldwide sequencing projects have revealed 26,383 annotated genes in the 2.91-Gigabase human genome [1,2]. The main molecular functions of the annotated genes, as categorized by Gene Ontology (GO), are enzyme, signal transduction, nucleic acid binding, cell adhesion, chaperone, cytoskeletal structural protein, extracellular matrix, immunoglobulin, ion channel, motor, structural protein of muscle, protooncogene, select calcium binding protein, intracellular transporter, and transporter [1,3]. Despite a wealth of knowledge, the function of 42% of the annotated genes remains unknown . When the human genome sequence was published in 2001 , there were a predicted 39,114 genes, of which 59% were of unknown function. According to the International Human Genome Sequencing Consortium, the number of identified genes is approximately 32,000, of which 51% show a match within InterPro, a database that integrates diverse information about protein families, domains, and functional sites [2-5]. In 2001, InterPro combined sequence and pattern information from four databases (PRINTS, PROSITE, Pfam, Prosite Profile); however, it now includes information from an additional eight databases (SMART, ProDom, PIRSF, SUPERFAMILY, PANTHER, CATH-Gene3D, TIGRFAM, and HAMAP) [2,4-16]. In , the InterPro entries are collapsed into 12 broad categories: cellular processes, metabolism, DNA replication/modification, transcription/translation, intracellular signaling, cell–cell communication, protein folding and degradation, transport, multifunctional proteins, cytoskeletal/structural, defense and immunity, and miscellaneous function. The rate of single nucleotide polymorphism (SNP) variation has been reported as 1 in 1250 base pairs  and more than 1.4 million SNPs have been identified  (Table 1).
|Size of the genome||2.91 Gbp|||
|Number of annotated genes||26,383|||
|Main molecular functions of|
|enzyme, signal transduction, nucleic acid binding, cell adhesion, chaperone, cytoskeletal structural protein, extracellular matrix, immunoglobulin, ion channel, motor, structural protein of muscle, protooncogene, select calcium binding protein, intracellular transporter, transporter|||
|Percentage of annotated genes|
with unknown function
|Number of hypothetical and|
|Percentage of hypothetical and|
annotated genes with unknown function
|Number of identified genes||approx. 32,000|||
|Percentage of matches with|
|Rate of SNP variation||1/1250 bp|||
|SNPs identified||more than 1.4 million|||
Among the databases combined in InterPro (Table 2), PRINTS, PROSITE, and Pfam contain protein families in which the homology between each protein is predicted by the degree of sequence similarity . The others—SMART, ProDom, PIRSF, SUPERFAMILY, PANTHER, CATH-Gene3D, TIGRFAM, and HAMAP [4-16]—have unique characteristics and URLs, and have been developed sharing information among each other and incorporating information from GO. In detail, PRINTS is a collection of diagnostic protein family “fingerprints”, which are groups of conserved motifs, evident in multiple sequence alignments ; PROSITE is a protein domain database for functional characterization and annotation that consists of documentation entries describing protein domains, families, and functional sites as well as associated patterns and profiles to identify them ; Pfam contains collections of protein families, each represented by multiple sequence alignments and hidden Markov models, available via servers in the UK, the USA, and Sweden ; SMART (Simple Modular Architecture Research Tool) is an online resource for the identification and annotation of protein domains and the analysis of protein domain architectures ; ProDom is a comprehensive set of protein domain families generated automatically from the UniProt database ; PIRSF is a classification system that reflects evolutionary relationships among full-length proteins and domains ; SUPERFAMILY is a database of structural and functional annotation for all proteins and genomes ; PANTHER is a classification system that classifies genes by their functions using published experimental evidence and evolutionary relationships to predict function even in the absence of direct experimental evidence ; CATH-Gene3D is a comprehensive database of protein domain assignments for sequences from the major sequence databases ; TIGRFAM is a collection of protein family definitions built to aid high-throughput annotation of specific protein functions ; and HAMAP is composed of two databases: the proteome database and the family database, and of an automatic annotation pipeline mainly focused on microbial proteomes . Hidden Markov models are usually used for the database algorithm.
|InterPro||integrative predictive models of protein families, domain and functional sites of multiple databases such as PRINTS, PROSITE, Pfam, SMART, ProDom,|
PIRSF, SUPERFAMILY, PANTHER, CATH-Gene3D, TIGRFAM, and HAMAP
|PRINTS||a collection of diagnostic protein family|
"fingerprints" which are groups of conserved motifs, evident in multiple sequence alignments
|PROSITE||a protein domain database for functional characterization and annotation which consists of documentation entries describing protein domains, families and functional sites as well as associated patterns and profiles to identify them||http://prosite.expasy.org/|||
|Pfam||a database of collection of protein families, each represented by multiple sequence alignments and hidden Markov models, available via servers in the|
UK, the USA and Sweden
|SMART||an online resource for the identification and annotation of protein domains and the analysis of protein domain architectures, of which abbreviation|
is Simple Modular Architecture Research Tool
|ProDom||a comprehensive set of protein domain families automatically generated from the uniProt|
|PIRSF||the classification system which reflects evolutionary relationships of full-length proteins and domains||http://pir.georgetown.edu/pirsf/|||
|SUPERFAMILY||a database of structural and functional annotation|
for all proteins and genomes
|PANTHER||the classification system which classifies genes by|
their functions using published scientific experimental evidence and evolutionary relationships to predict function even in the absence of direct experimental evidence
|CATH-Gene3D||a comprehensive database of protein domain assignments for sequences from the major sequence databases||http://gene3d.biochem.ucl.ac.uk/|||
|TIGRFAM||a collection of protein family definitions built to aid|
in high-throughput annotation of specific protein functions
|HAMAP||a system which composed of two databases, the proteome database and the family database, and of an automatic annotation pipeline||http://hamap.expasy.org/|||
2. Gene regulation
2.1. Gene markers for cancer and cancer stem cells
Several molecular markers of cancer have been identified . Metastatic cancer cells can transfer into bodily fluids through the cellular epithelia, which enables the detection of cancer markers in bodily fluids such as blood plasma, urine, or saliva . The different types of cancer markers include genomic DNA point mutations, microsatellite alterations, promoter hypermethylation, viral sequences, aberrant chromosomal copy number, chromosomal translocations, deletions, or loss of heterozygosity, telomere extension, alterations in RNA or protein expression, and mitochondrial DNA mutations .
Molecular markers of cancer include
The expression levels of the cell cycle-related proteins p21 (
Highly parallel identification of cancer-related genes using small hairpin RNA screening has revealed that the expression of known and putative oncogenes, such as
|head and neck cancer|||
|lung cancer (small cell lung cancer and non-small cell lung cancer); breast, colon, esophagus, liver, bladder, ovary, and brain cancers; sarcomas, lymphomas, and leukemias|||
|non-small cell lung cancer|||
|DNA microsatellite alterations||bladder cancer|||
|alteration in cell cycle mRNA|
|alteration in cytokeratin mRNA|
|squamous cell carcinoma|||
|alteration in kallikrein mRNA|
|mitochondrial DNA mutations||bladder cancer, head and neck cancer, lung cancer|||
|colorectal tumors||, |
2.2. Genes related to cell proliferation
Cyclins, which regulate the cell cycle, play important roles in cell proliferation and the uncontrolled cell proliferation that is the most important factor in tumorigenesis . Tumor cells accumulate mutations that result in constitutive mitogenic signaling and defective responses to anti-mitogenic signals that contribute to unscheduled proliferation . In cancer, unscheduled proliferation, genomic instability, and chromosomal instability are the three major factors in cell cycle dysregulation . Regulation of the cell cycle is mainly conducted by complexes of cyclins and cyclin-dependent kinases . Cyclin D1 in cell migration and proliferation is temporo-spatially separated by its biphasic expression induced by thrombin, a G protein-coupled receptor agonist, which is mediated by nuclear factor of activated T cells c1 (
Progression of the eukaryotic cell cycle is driven by cyclin-dependent protein kinases (CDKs), which are binding partner of cyclins. The CDK oscillator acts as the primary organizer of the cell cycle . Phosphorylation of cyclin-Cdk complexes is one of the primary mechanisms of cell cycle regulation . Cyclins are degraded by ubiquitin-mediated proteolysis . The ubiquitylation and degradation of cyclin 1 and cyclin 2 are mediated by the SCF complex, a multi-subunit ubiquitin ligase that contains Skp1, a member of the cullin family (Cdc53) and an F-box protein, as well as a RING-finger-containing protein . CDKs including CDK1, CDK2, CDK4, CDK6, and CDK11 have various functions that have been investigated using loss-of-function, target validation, and gain-of-function mouse models . CDK1 is a mitotic CDK, also known as cell division control protein 2 (CDC2). It is one of the master regulators of mitosis as it controls the centrosome cycle as well as mitotic onset; deficiency in CDK1 results in embryonic lethality in the first cell divisions [38,47]. CDK2, CDK4, and CDK6 are interphase CDKs that are not essential for the mammalian cell cycle; they are, however, required for the proliferation of specific cell types . Deficiency in CDK2, CDK4, and CDK6 caused mid-gestation embryonic lethality because of hematopoietic defects [38,47].
2.3. Genes related to cell differentiation
Inhibitor of differentiation 1 (Id1) is associated with the induction of cell proliferation and invasion , as well as the invasive features of cancer and the EMT . The
Wnt/β-catenin signaling controls skeletal development and differentiation . The initiating step of skeletal development is mesenchymal condensation, during which mesenchymal progenitor cells are at least bipotentiate . Osteochondral progenitor cells differentiate into osteoblasts instead of chondrocytes when Wnt/β-catenin signaling is activated .
2.4. Genes related to apoptosis
Cell proliferation and death are regulated by various molecules. Recently, microRNAs have been revealed to play important roles during death receptor-mediated apoptosis (programmed cell death) . Transfection with miR-133b caused a proapoptotic effect on tumor necrosis factor alpha (TNFα)-stimulated HeLa cells : the expression of apoptosis regulatory proteins such as transgelin 2 (TAGLN2), myosin, heavy chain 9, non-muscle (MYH9), cytoskeleton-associated protein 4 (CKAP4), polypyrimidine tract binding protein 1 (PTBP1), glutathione-S-transferase pi 1 (GSTP1), and copine III (CPNE3) were down-regulated compared with in control cells . The BCL protein family plays a major role in regulation of the apoptotic cascade . BCL2-associated protein (BAX) promotes apoptosis and delays disease progression, and has been associated with longer disease-free survival in patients with a number of gastrointestinal cancers, such as esophageal, stomach, small intestine, and colon cancer; moreover, high BCL6 expression is correlated with worse prognosis in patients with other gastrointestinal tumors, such as esophageal adenocarcinoma . There are two major cell death pathways that transduce the effects of various death inducers: the extrinsic death pathway that is mediated through cell death receptors of the TNF receptor family, such as the Fas receptor; and the intrinsic death pathway that proceeds through mitochondria . The expression of apoptosis signal-regulating kinase (ASK1), which plays an important role as a mitogen-activated protein kinase kinase kinase in apoptosis signaling, is increased in gastric cancer . Furthermore, the levels of cyclin D1 and phosphorylated JNK were higher in gastric cancer than in non-tumor epithelium . ASK1 may play a role in the development of gastric cancer .
2.5. Detection of cell proliferation or apoptosis
Several methods have been suggested for the diagnosis of cancer . Protein markers for cancer include prostate-specific antigen for prostate cancer, CA125 for ovarian cancer, carcinoembryonic antigen for colon cancer, human chorionic gonadotropin for trophoblastic cancer, and a-fetoprotein for hepatocellular carcinoma and germ cell tumors . Assays to detect telomerase activity in clinical samples include the TRAP (telomere repeat amplification protocol) assay, which involves protein extraction and subsequent primer-directed PCR amplification of telomere extensions .
Assays for the detection of kinases that regulate cell growth, proliferation, differentiation, and metabolism have been developed . The assay technology includes fluorescence polarization to detect protein phosphorylation, scintillation proximity to detect protein dephosphorylation by phosphatases, fluorescence resonance energy transfer to detect protein cleavage or modification, immunosorbent assays to detect phosphorylation state, luciferase-based ATP detection to detect the kinase-dependent depletion of ATP, luminescent oxygen channeling to detect phosphorylation, time-resolved fluorescence resonance energy transfer to detect phosphopeptide formation, and enzyme fragment complementation to detect molecular interactions with kinases [58,59]. Cell proliferation can also be determined by the tetrazolium hydroxide (XTT) cell proliferation assay, in which absorbance is measured by an ELISA reader under 490-nm-wavelength light (Biological Industries) .
Cell proliferation assays and apoptosis assays have been used to examine the effects of inhibitors on cancer cells . The cell proliferation of Neuro-2A cells, neuroblastoma cells, can be determined using the CellTiter 96 Aqueous Non-Radioactive Cell Proliferation Assay reagent (Promega) . A colony formation assay using Neuro-2A cells was used to determine the effect of an inhibitor of GSK-3β . In this experiment, colonies were allowed to form for 10 days, after which the cells were fixed with 70% ethanol and stained with 1% methylene blue. Apoptosis was then measured by flow cytometry using an Annexin V-allophycocyanin (APC) /propidium iodide (PI) detection kit (BD PharMingen) . Apoptosis was also determined using 4’6-diamidino-2-phenylindole (DAPI) staining, observing apoptotic nuclear morphology, and immunoblotting with antibodies to β-catenin, X-linked inhibitor of apoptosis, and BCL2 . Cell cycle analysis using PI to quantify the proportions of cells in the G1/G0 or G2–M phases was used to examine cell cycle status .
Viable cells can be determined using MTT (3-(4,5-dimethylthiazol-2-yl)-2,5-diphenyltetrazolium bromide) colorimetric assays . Absorbance at 570 nm is used to detect the incorporation of MTT. Apoptosis can also be determined by caspase activation using an anti-poly ADP-ribose polymerase (PARP) antibody . Viable cells can also be determined using a 3-(4,5-dimethyl-thiazol-2yl)-5-(3-carboxymethoxyphenyl)-2-(4-sulfophenyl)-2H-tetrazolium (MTS) kit (Promega) . The terminal transferase dUTP nick end labeling (TUNEL) assay is commonly used to detect apoptosis . Harvested cells are resuspended in DNA labeling solution consisting of TdT reaction buffer, TdT enzyme, and BrdUTP, then stained with PI to detect a fluorescein isothiocyanate-labeled anti-BrdU antibody . Cell viability and proliferation assays were used to validate internal tandem duplication mutations in
3. Genomic variation in disease
3.1. Genome-wide association studies in cancer
Despite extensive research efforts for several decades, the genetic basis of common human diseases such as cancers remains largely unknown . Genome-wide association studies (GWAS) have emerged as an important tool for the discovery of genomic regions that harbor genetic variants conferring risk for various cancers [66,67]. Family-based linkage studies and studies comprising tens of thousands of gene-based SNPs can also assay genetic variation across the genome , but the National Institutes of Health guidelines for GWAS require a sufficient density of genetic markers to capture a large proportion of the common variants in the study population, measured in enough individuals to provide sufficient power to detect variants of modest effect . The recent success of GWAS can be attributed to the convergence of new technologies that can genotype hundreds of thousands of SNPs in hundreds or thousands of samples [66,69].
GWAS have been conducted in the five of the most common cancer types: breast, prostate, colorectal, lung, and melanoma (Table 4) and have identified more than 20 novel disease loci, confirming that susceptibility to these diseases is polygenic . For many years, human genetics has been used to map rare mutations with large effect sizes in families or genetically homogeneous populations, such as
In general, the susceptibility alleles discovered thus far are common—that is, with a frequency in one or more population of >10%, and each allele confers a small contribution to the overall risk of the disease. For nearly all regions conclusively identified by GWAS, the effect sizes per allele are estimated at <1.3. It was not anticipated that GWAS in certain cancers would yield many novel regions when other cancers strongly associated with particular environmental exposures have yielded so few regions. For example, prostate cancer, breast cancer, and colon cancer have been associated with 29, 13, and 10 regions of the genome, respectively, while there are only three associated regions for lung cancer in smokers, and three for bladder cancer despite analysis of sufficiently large data sets . Several GWAS for lung cancer have identified the same locus on 15q25, suggesting that this is an important susceptibility locus for this disease [82-87]. This locus contains the nicotinic acetylcholine receptor subunit genes
GWAS represent an important advance in discovering genetic variants influencing disease but have important limitations. There is a high potential for false-positive results, they do not yield information on gene function, they are insensitive to rare and structural variants, they require large sample sizes, and incur possible biases because of case and control selection and genotyping errors . Clinicians and scientists must understand the unique aspects of these studies and be able to assess and interpret GWAS results for themselves and their patients. However, at present these studies mainly represent a valuable discovery tool for examining genomic function and clarifying pathophysiological mechanisms. However, through GWAS, the identification of variants, genes, and pathways involved in multiple cancers offers a potential route to new therapies, improved diagnosis, and better disease prevention .
|East Asian||2,062||2,066||East Asians||15,091||14,877|
|Swedish & Finnish||617||4,583||European||1,001||7,604|
[up to 607,728]
|Ashkenazi Jewish||249||299||Ashkenazi Jewish||1,193||1,166|
|African American||3,425||3,290||African American|
|European||1,854||1,894||European, Chinese, Japanese, African American, Latino, and Hawaiian||19,879||18,761|
|||2007||Affymetrix & Illumina|
|||2007||Affymetrix & Illumina|
|Han Chinese||245||-||Han Chinese||305||-|
|10,995 smokers||4,848 smokers|
3.2. Genetic risk score in cancer and diabetes
Type 2 diabetes mellitus and cancers are major health problems worldwide [150,151]. The recent increase in the prevalence of these diseases is largely attributable to environmental factors. However, convincing evidence shows that genetic factors may play an important role in these diseases [152,153]. Recent GWAS have led to the identification of a series of SNPs that are robustly associated with either the risk of diabetes or cancers [151,154-159]. For type 2 diabetes mellitus, common SNPs have been identified in the
Combining multiple loci with modest effects into a global genetic risk score (GRS) might improve the identification of those at risk for common complex diseases such as type 2 diabetes and cancers [165-167]. Several studies have developed methods to predict the risk of certain diseases, such as coronary heart disease, type 2 diabetes, and breast cancer, aggregating information from multiple SNPs into a single GRS [151,168,169]. For example, in the Atherosclerosis Risk in Communities study, the aggregation of multiple SNPs into a single GRS was responsible for improving the prediction of coronary heart disease incidence . In a study that used a GRS to determine the risk of type 2 diabetes in US men and women, individuals in the highest quintile of GRS had a significantly increased risk of type 2 diabetes compared with those in the lowest quintile; however, the addition of a GRS to the conventional model consisting of lifestyle risk factors only increased the area under the curve by only 1% (AUC=0.78). In this instance, the GRS was determined to be useful only when combined with the body mass index or a family history of diabetes . For breast cancer, a GRS was created using 14 SNPs previously associated with breast cancer, and was substantially more predictive of estrogen receptor-positive breast cancer than of estrogen receptor-negative breast cancer, particularly for absolute risk . Further studies are needed to confirm whether a GRS improves disease risk prediction.
The GRS is calculated on the basis of reproducible tagging of SNP-associated loci reaching genome-wide levels of significance. The GRS can be created by two methods: a simple count method (count GRS) and a weighted method (weighted GRS) [169,170]. Both methods anticipate each SNP to be independently associated with risk. An additive genetic model is used for each SNP, applying a linear weighting of 0, 1, or 2 to genotypes containing 0, 1, or 2 risk alleles, respectively. This model is known to perform well even when the true genetic model is unknown or wrongly specified . The count model assumes that each SNP in the panel contributes equally to the disease risk and is calculated by summing the values for each of the SNPs. The weighted GRS is calculated by multiplying each B-coefficient, the estimates resulting from an analysis carried out on variables that have been standardized, by the number of corresponding risk alleles (0, 1, or 2).
|||2008||Framingham||2,377 diabetic patients||18||NOTCH2 (rs10923931),|
CDC123, CAMK1D (rs12779790),
KCNJ11 (rs5219), INS (rs689),
TSPAN8, LGR5 (rs7961581)
3.3. Cancer Cell Line Encyclopedia
The Cancer Cell Line Encyclopedia (CCLE) has made predictive modeling of anticancer drug sensitivity a realistic proposition, by determining genomic markers of drug sensitivity in cancer cells [172,173]. The CCLE contains information from 947 human cancer cell lines including data on gene expression, chromosomal copy number, and massively parallel sequencing data. It has been used to identify genetic, lineage-specific, and gene expression-based predictors of drug sensitivity . This has revealed, for example, that the plasma cell lineage is correlated with sensitivity to IGF1 receptor inhibitors, aryl hydrocarbon receptor (
There are dramatic changes in the genomes of cancer cells, which vary according to cancer subtype. Integrative and wide investigations of cancer cell genomes have revealed mutations and alterations in gene expression that are associated with the disease. Databases that include abundant data related to gene and protein conformation, gene expression, and genomic mutations enable the construction of dynamic cellular simulations and disease models. New sequencing tools such as next-generation sequencing will reveal new horizons in the prediction of disease and drug sensitivity, which play an important role in personalized medicine. Appropriate translation of the abundance of information to clinical practice is one of most important future challenges for medicine. The quality of genome would be one of the important factors for detecting the development of the disease.
The authors are grateful to all those who helped with preparation of the manuscript. In particular, we thank Jaeseong Jo for his great assistance.