Database Mining : Defining the Pathogenesis of Inflammatory and Immunological Diseases

Cardiovascular disease (CVD) is a leading cause of mortality in developed countries (Jan et al., 2010; Yang et al., 2008). Despite a long held understanding and strong characterization of the traditional and non-traditional risk factors for CVD, some mechanisms of CVD onset have only recently been uncovered. As a chronic inflammatory autoimmune disease, atherosclerosis and its progression involve innate and adaptive immune systems. Using new concepts and technologies to improve the current understandings of the molecular pathogenesis of inflammatory and immune responses would lead to the future development of novel therapeutics for these diseases. Biomedical literature and databases, available in electronic forms, contain a vast amount of knowledge resulting from experimental research (Ishii et al., 2007; Palakal et al., 2007). In the past decade, both traditional hypothesis-driven research and discovery-driven “-omics” research, including genomics, transcriptomics (Liang et al., 2005), proteinomics, metabolomics, glycomics, lipidomics, localizomics, protein-DNA interactomics, proteinprotein interactomics, fluxomics, phenomics (Joyce & Palsson, 2006), and antigen-omics (http://www.cancerimmunity.org/links/databases.htm) (Houle et al., 2010; Shimokawa et al., 2010; Weinstein, 1998;2002), has generated a tremendous amount of data and established many experimental data-based searchable databases. These databases include PubMed, nucleotide database, protein database, and other databases generated by the National Institutes of Health (NIH)/National Center for Biotechnology Information (NCBI) (see the NCBI handbook at http://www.ncbi.nlm.nih.gov/books/NBK21101/) and other institutions. This development has not only provided resources, but also raised unprecedented challenges and opportunities for biomedical scientists to develop more systemic and panoramic approaches to analyze the data contained in the databases and generate new hypotheses. The inconsistency between the vast amount of experimental data, various searchable databases, and relatively smaller numbers of database-mining research papers (< 50 papers on database mining in inflammation and immune responses listed in the PubMed) indicate the challenges that experimental biomedical scientists face, which include both technical/methodological difficulties and out-of-date concepts. Traditionally, medical literature search using the Index Medicus was the major approach for biomedical scientists to identify knowledge gaps and preparing new hypotheses. However, this approach has been significantly enhanced by more systemic approaches such as 1)


145
often involved in the bioinformatic algorithm generation, but may want to use database mining methods in their research either as parts of existing experimental studies or as freestanding projects.Of note, the database mining concept is not "brand new".Medical research has a long history in full-value extraction from costly data.For example, a metaanalysis uses a statistical approach to combine the results of several epidemiological studies that address a set of related research hypotheses.This practice started well over 100 years ago and has been widely used in various disease-related researches (http://en.wikipedia.org/wiki/Meta-analysis)(Egger & Smith, 1997;Egger et al., 1997).We believe that the practice of database mining will become a routine exercise to identify existing knowledge gaps and to generate new hypotheses.

Principles of database mining
In recent years, many databases regarding immune responses and inflammation have been established (Jan et al., 2010;Yang et al., 2006a), which have expanded the scope and depth of a publicly searchable online repertoire of tools.The results derived from the database mining analyses have become parts of many research papers or free-standing papers.Although projects may vary in format, database mining approaches follow the same set of principles (Fig. 1): 1) Hypothesis: A clearly-presented hypothesis based on the current biomedical literature search in a given field and previous experimental data in the lab is required to carry on a database mining project as we reported (Ng et al., 2004;Yan et al., 2004), which is similar to that of experimental projects.Of note, the database mining referred here focuses on database mining as a free standing project rather than as a part of experimental research; 2) Scope: Database mining scopes in terms of gene numbers are far more than that examined in experimental approaches.For example, our own research will examine mRNA transcript expressions of about 30 genes including all the reported toll-like receptors, NOD-like receptors, and inflammatory caspases in more than ten tissues.This scope allows us to obtain a panoramic view on the expressions of inflammatory pathways without focusing on a single gene in many tissues (Yin et al., 2009); 3) Suitable databases: Databases that are suitable for examining the hypothesis are available for online analytic search, which is also similar to the methods and reagents for experimental projects; 4) Sizable experimentally verified data for generating confidence intervals with statistical significance: To consolidate the results generated from database mining, the experimentally verified data are published by various laboratories, which can be used to generate statistically significant confidence intervals by using the same online analysis tools as we reported (Virtue, 2011).In this study, our analysis in the TargetScan yielded 524 microRNAs, which were predicted to participate in 1368 unique interactions with the 33 inflammatory gene mRNAs.To ensure relevance, we examined the context value and percentage of experimentally verified microRNAs.Confidence intervals were generated from 45 interactions between 28 experimentally verified human microRNAs and 36 genes found within the Tarbase, an online database of experimentally verified microRNAs (http://diana.cslab.ece.ntua.gr/tarbase/)(Papadopoulos et al., 2009;Sethupathy et al., 2006).These experimental interactions were also selected based on their confirmation by luciferase reporter assays and single site specificity.The 45 microRNA-mRNA interactions that met these criteria were then evaluated in TargetScan to determine the microRNA  context values and percentages.Analysis of this data yielded a mean and standard deviation (SD) of -0.25 ± 0.12 and 76.07 ± 19.07 for context value and context percentage, respectively.The intervals were then constructed and the lower limits (the mean -2 x standard deviations) were calculated for context percentage (76.07-1.96(19.07/SQRT (46)) = 76.07 -5.51 = 70.56)and context value (-0.25-1.96(0.12/SQRT(46) = -0.25 -0.04= -0.22).All predicted microRNAs interactions with a context value ≤-0.22 and context percentage ≥70 were accepted.Using the lower limit thresholds for context value and percentage, 297 out of the 524 predicted microRNAs met the criteria and were considered equivalent to the experimentally verified microRNAs.In order to generate valid confidence intervals, sample sizes have to be estimated with statistical tools of sample size determination (Rosner, 2000) as we reported (Ng et al., 2004); 5) Verifiable methods: Experimental methods are available to verify the data generated by the database mining (Yan et al., 2004); and 6) A new working model/hypothesis: Through database mining, a new knowledge gap will be identified, and a new hypothesis will be proposed to test fewer, much more-focused genes in further experiments.The following sections will illustrate these principles in our own publications (Chen et al., 2010;Jan et al., 2010;Ng et al., 2004;Virtue, 2011;Yang et al., 2006a;Yang et al., 2006b;Yin et al., 2009).In our invited review, we pointed out that the identification and molecular characterization of self-antigens expressed by human malignancies, that are capable of elicitation of anti-tumor immune responses in patients, have been an active field in tumor immunology (Yang & Yang, 2005).More than 2,000 tumor antigens have been identified, and most of these antigens are self-antigens (Yang & Yang, 2005).Despite this, the important question of how non-mutated self-protein antigens, generated from normal cells and tumor cells, gain immunogenicity and trigger immune recognition remained unanswered (Yang & Yang, 2005).Mutations may be responsible for some aspects of elevated immunogenicity underlying certain tumor-specific antigens (p53 and Ras), while chromosome translocations and abnormalities, such as expression of the fusion oncogene Bcr-Abl in chronic myelogenous leukemia (Clark et al., 2001;Pinilla-Ibarz et al., 2000;Yotnda et al., 1998;Zorn, 2001) (Yang et al., 2002;Yang et al., 2001) are responsible for other aspects.However, the mechanism underlying the immunogenicity of most non-mutated self-tumor antigens is their aberrant overexpression in tumors (Yang & Yang, 2005).Zinkernagel et al (Zinkernagel & Hengartner, 2001) suggested that the overexpression of self-antigens or novel antigenic structure, overcomes the threshold of antigen concentration at which an immune response is initiated (Shlomchik et al., 2001).This threshold might be lower for certain untolerized regions of certain antigen epitopes.Overexpressed genes, often encode tumor antigens up to 100 fold.These genes are identified by serological identification of self-antigens by screening a cDNA library with patients' sera (SEREX) (Sahin et al., 1995), which may reflect the inherent methodological bias for the detection of abundant transcript (Preuss et al., 2002).The overexpression of tumor antigens in tumors may result from transcriptional and post-transcriptional mechanisms.We recently demonstrated that overexpression of tumor antigen CML66L in leukemia cells and tumor cells via alternative splicing is the mechanism for its immunogenicity in patients with tumors (Yan et al., 2004;Yang et al., 2001).This not only illustrates the principle of overexpression of tumor antigen, but also elucidated alternative splicing as its molecular mechanism (Yan et al., 2004).A significant proportion of the SEREX-defined self-tumor antigens are autoantigens (Chen, 2004), for example, CML28 that we identified is autoantigen Rrp46p (Yang et al., 2002).Using this information gathered from SEREX, we hypothesized that alternative splicing is a general mechanism for the overexpression of untolerized self-antigen epitopes in tumors and autoimmune diseases.In order to test this hypothesis, we database mined the NIH-NCBI AceView database to examine the potential mechanisms of how non-mutated self-proteins gain new untolerized structures that trigger immune recognition (Ng et al., 2004).The AceView database provides a curated, comprehensive, and non-redundant sequence representation of all public mRNA sequences (mRNAs from GenBank or RefSeq, and single pass cDNA sequences from dbEST and Trace).These experimental cDNA sequences are first  (Lewin, 2000).Our results demonstrated that 80% of the autoantigen transcripts undergo non-canonical alternative splicing, which is significantly higher than the less than 1% rate in randomly selected gene transcripts (p<0.001).These studies suggest that non-canonical alternative splicing may be an important mechanism for the generation of untolerized epitopes that may lead to autoimmunity.Furthermore, the product of a transcript that does not undergo alternative splicing is unlikely to be a target antigen in autoimmunity (Ng et al., 2004).To consolidate this finding, we also examined the effect of proinflammatory cytokine tumor necrosis factor-(TNF-) on the prototypic alternative splicing factor (ASF)/SF2 in the splicing machinery.Our results show that TNFdownregulates ASF/SF2 expression in cultured muscle cells.This result correlates with our finding of reduced expression of ASF/SF2 in inflamed muscle cells from patients with autoimmune myositis (Xiong et al., 2006).Based on our and others' data, we recently proposed a new model of stimulation-responsive splicing for the selection of autoantigens and self-tumor antigens (Yang et al., 2006a) [also see Fig. 1 at (http://preview.ncbi.nlm.nih.gov/pubmed/16890493)].Our new model theorizes that the significantly higher rates of alternative splicing of autoantigen and self-tumor antigen transcripts that occur in response to stimuli, such as proinflammatory cytokines, could induce extra-thymic expression of untolerized antigen epitopes to elicit autoimmune and anti-tumor responses.By using B lymphocyte (B cell) antigen epitope analysis databases and T cell antigen epitope analysis databases listed in Tables in our recent invited review (http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2858284/pdf/JBB2010-459798.pdf)(Jan et al., 2010), we showed that protein sequences encoded by alternatively spliced exons are sufficient to equip antibody-binding antigen epitopes and major histocompatibility complex (MHC) class I-and MHC II-restricted T cell antigen epitopes to stimulate B lymphocytes and T lymphocytes, respectively (Ng et al., 2004).Of note, our model not only applies to nonmutated self-tumor antigens associated tumors and autoantigens associated with various autoimmune diseases, but also to the composition and expansion of the self-antigen repertoire of stem cells.Our additional database mining study has generated a new model of differential epitope processing for MHC class I-restricted viral antigen epitopes and tumor antigen epitopes (Yang et al., 2006b).Our reports have demonstrated the principles of database mining in adaptive immune responses.

Database mining example 2: Three-tier model for inflammasome/caspase-1 activation and inflammation privilege of tissues are important mechanisms underlying the differences in the readiness of inflammation initiation in tissues
Atherosclerosis is the leading cause of morbidity and mortality in industrialized society.
Several "traditional" risk factors have been identified for atherosclerosis including hyperlipidemia, oxidized low density lipoprotein, cigarette smoking, diabetes, hypertension, obesity (Ross, 1992), and hyperhomocysteinemia (HHcy), etc. Chronic vascular inflammation is an essential requirement for the progression of atherosclerosis in patients (Hansson, 2005).Recent progress in characterizing pathogen-associated molecular patterns' (PAMPs) receptor families (PAMP-Rs) and inflammasomes (the protein complex for activation of caspase-1) has further emphasized the importance of proinflammatory cytokine interleukin-1 (IL-1 ) signaling in bridging proatherogenic risk factors to initiate inflammation (Yang et al., 2008).However, constitutive expression levels and expression readiness of PAMP-Rs, inflammasome components and proinflammtory caspases in tissues remained poorly defined.We hypothesized that PAMP-Rs, inflammasome components, proinflammatory caspases, IL-1, and IL-18 are differentially expressed in cardiovascular tissues.To examine this hypothesis, we mined the NCBI-UniGene database, analyzed cDNA cloning and DNA sequencing data from tissue cDNA libraries and studied expression profiles of Toll-like receptors (TLRs), cytosolic nucleotide binding and oligomerization domain (NOD)-like receptors (NLRs), inflammasome components, inflammatory caspases, and caspase-1 cleavable inflammatory cytokines.The UniGene database provides an organized view of the transcriptome with information on protein similarities, gene expression, cDNA clone reagents, and genomic location (http://www.ncbi.nlm.nih.gov/unigene), in which each UniGene entry is a set of transcript sequences that appear to come from the same transcription locus (gene or expressed pseudogene).After analyzing the data from the UniGene database, we made several important findings: (1) Among 11 tissues examined, vascular tissues and heart express fewer types of TLRs and NLRs than immune system tissues including blood, lymph nodes, thymus, and trachea; (2) Brain, lymph nodes, and thymus do not express proinflammatory cytokines IL-1 and IL-18 constitutively, suggesting that these two cytokines need to be upregulated in response to inflammatory stimuli in the tissues; and (3) based on the expression data of three characterized inflammasomes (NALP1, NALP3 and IPAF inflammasomes), the examined tissues can be classified into three tiers: the first tier tissues including brain, placenta, blood, and thymus express inflammasome(s) in constitutive status; the second tier tissues have inflammasome(s) in nearly-ready expression status (with the requirement of upregulation of one component); and the third tier tissues like heart and bone marrow, require upregulation of at least two components in order to assemble functional inflammasomes.Based on the expression readiness of inflammasomes in tissues, we propose a new working model of three-tier responsive expression of inflammasomes in tissues and suggest a new concept of third tier tissues' inflammatory privilege, which provides an insight on the differences of tissues in initiating acute inflammations.This model suggests that (a) first-tier tissues with constitutively expressed inflammasomes initiate inflammation quicker than second and third-tier tissues; and (b) second tier tissues (requiring one component of upregulation) including vascular tissue, and third tier tissues including heart (requiring more than one component upregulation) are in an inducible expression status of inflammasomes.The inducible expressions of inflammasomes are presumably mediated through various signal pathways that initiate inflammation, and the interplay between the signal pathways, may take a longer time and overcome a higher threshold than first tier tissues.Traditional concepts of immune privilege suggests a protective mechanism from autoimmune destruction based on the lack of expression of antigen-presenting self-major compatibility complex (MHC) molecules in tissues (Yang & Yang, 2005).The lack of expression of self-MHCs in immune privileged tissues including testis results in the failure of self-antigen presentation that stimulates the hosts' immune system, thereby protecting immune privileged tissues from autoimmune destruction.Similarly, we proposed a new concept of tissues' inflammatory privileges that emphasize a protective mechanism against tissue destruction mediated by inflammasome/IL-1 -based innate immune responses.In our new concept of tissues' inflammatory privilege, vascular tissue and heart disproportionally express fewer types of TLRs and NLRs and may only inducibly express inflammasomes, thus preventing against uncontrolled inflammatory destruction mediated by inflammasome-based innate immune responses (Streilein & Stein-Streilein, 2000).Our new concept and model may also explain the potential differences between cardiovascular tissues and other tissues in initiating acute inflammation.The firsttier tissues may have a higher probability of experiencing acute inflammation than the second-tier and third-tier tissues.We and others showed that elevated levels of plasma homocysteine (Hcy), termed hyperhomocysteinemia (HHcy), is an independent risk factor, equivalent to hyperlipidemia, for cardiovascular diseases (CVD) including coronary heart disease and stroke (Maron & Loscalzo, 2009;Wang et al., 2003;Zhang et al., 2009).Recently, we performed an additional database mining study using to examine the expression of more than 20 homocysteine metabolic enzymes and methylation enzymes in >20 tissues in humans and mouse (Chen et al., 2010).We generated a new model of how hypomethylation (a post-translational protein modification) modulates the expressions of homocysteine-metabolizing enzymes (Chen et al., 2010).Taken together, our studies have demonstrated the principles of database mining in innate immune reactions.

Database mining example 3: A group of anti-inflammatory microRNAs may play critical roles in inhibiting the expression of proatherogenic molecules
Previous research has established that numerous genes are upregulated in atherogenesis through epigenetic or genetic transcriptional mechanisms (Turunen et al., 2009).However, transcription-independent mechanisms have received far less scrutiny.Recent publications suggest that microRNAs, a newly characterized class of short (18-24 nucleotide long), endogenous, non-coding RNAs (Bartel, 2009), contribute to the development of particular disease states by regulating diverse biological processes such as cell growth, differentiation, proliferation, and apoptosis (Zhang, 2008).This biological control is accomplished by posttranscriptional gene silencing (Naeem et al., 2010) through Watson and Crick base-pairing predominately at the 3'-untranslated region (3'UTR) of messenger RNAs (mRNAs) (Cordes et al., 2009;Rasmussen et al., 2010).This pairing can be further characterized as "perfect" or "near perfect", leading to target mRNA cleavage and degradation, or "imperfect", causing the inhibition of mRNA translation (Naeem et al., 2010).With the identification and sequencing of more than 800 human microRNAs thus far, it is thought that up to 30% of human genes may be regulated by microRNAs (Cheng et al., 2010;Zhang, 2008).Supporting evidence suggests that microRNAs function as key players during critical stages of cellular development and finely tune gene expression in the maintenance of routine cellular functioning (Baek et al., 2008).Furthermore, microRNAs can act on transcription factors, which lead to a broad indirect cellular effect as a result of their widespread gene modulating nature.In addition, the recent research has demonstrated that changes in microRNAs expression patterns are connected to several pathological conditions including cardiovascular disease and atherosclerosis.These studies primarily focused on characterizing microRNAs in atherosclerosis disease models, which had been previously reported to have elevated expression in disease conditions (Haver et al., 2010;Rink & Khanna, 2010).Thus, current microRNAs research has failed to provide a panoramic view of how microRNAs regulate proatherogenic inflammatory genes in a panoramic view and whether upregulation of proatherogenic inflammatory genes is the result of antiinflammatory microRNA downregulation.To address these issues, we hypothesized that a group of anti-inflammatory microRNAs may regulate the expressions of proatherogenic molecules (Virtue, 2011).We then developed a novel database mining approach using three types of databases including the online microRNA target prediction software TargetScan (http://www.targetscan.org/)(Dong et al., 2010;Rosero et al., 2010;Vickers & Remaley, 2010), the Tarbase, an online database of experimentally verified microRNAs (http://diana.cslab.ece.ntua.gr/tarbase/)(Papadopoulos et al., 2009;Sethupathy et al., 2006), and the online microRNA.orgexpression database (http://www.microrna.org/microrna/home.do)(Betel et al., 2008), in concert with a statistical analysis strategy established in our previous database mining publications (Chen et al., 2010;Ng et al., 2004;Shen et al., 2010;Yang et al., 2006b;Yin et al., 2009).Our unique research using database mining yielded several key findings.First, we discovered that the expression of 33 inflammatory genes (mRNAs) is upregulated in atherosclerotic lesions and second, that the mRNAs of those genes contain structural features in their 3'UTR for potential regulation by microRNAs.Furthermore, these structural features are statistically identical to experimentally verified 3'UTR microRNAs binding sites.Third, 21 out of the 33 inflammatory genes (64%) are targeted by highly expressed microRNAs while the remaining 12 inflammatory genes (36%) are targeted by normally expressed microRNAs.Fourth, it was also established that 10 of the 21 highly expressed microRNA-targeted inflammatory genes (48%) were targeted by a single microRNA, suggesting the specificity of microRNA regulation.Meanwhile, 12 out of the 25 highly expressed microRNAs (48%) targeted single inflammatory genes while the other 13 microRNAs targeted multiple inflammatory genes.Finally, it was determined that the microRNAs targeting atherosclerotic inflammatory genes use statistically higher numbers of "poorly conserved" binding interactions than the control group of microRNAs from the confidence interval.These results suggest that the microRNAs regulating atherosclerotic inflammatory genes possess special features (Virtue, 2011).Previous research has shown that microRNAs participate in modulating atherosclerosisrelated processes including hyperlipidemia (microRNA-33, microRNA-125a-5p), hypertension (microRNA-155), plaque rupture (microRNA-222, microRNA-210), and atherosclerosis itself (microRNA-21, microRNA-126) (Rink & Khanna, 2010).However, whether certain microRNAs play a role in preventing the disease development remains unknown.One of the most interesting findings from our study is that the 25 microRNAs that are highly expressed under normal untreated conditions target 21 out of the 33 (64%) atherosclerosis-upregulated inflammatory genes.The important result suggests a novel mechanism where a group of highly expressed anti-inflammatory microRNAs suppress the upregulation of proatherogenic inflammatory genes under normal physiological conditions.It has been well established that microRNAs play important roles in fine-tuning developmental processes and participate in the development of diseases such as cancer.Our results are the first to suggest that microRNAs may play a protective role by suppressing proatherogenic genes to maintain healthy arteries.Our conclusion is supported by other publications, which show that 7 out of the 20 microRNAs identified in this study were downregulated in the experimental studies by various proatherogenic factors (Chen et al., 2009;Elia et al., 2009;Ji et al., 2007).Together, our studies have demonstrated the principles of database mining in inflammation.

Conclusion
Active research in human and mouse genomes, transcriptomes, microRNAs transcriptomes, proteomes, and antigen-omes in the past decade has generated a tremendous amount of data and established many experimental data-based searchable databases.This provides unprecedented opportunities for biomedical scientists to develop more systemic and panoramic approaches to analyze the databases and generate new hypotheses.In this chapter, we briefly summarize our pioneering efforts in using our new database mining methods to address important questions in inflammatory and immunological diseases.The new principles and basic methodologies of database mining developed in our laboratories are elucidated in the following studies: 1) stimulation-responsive alternative splicing model for the generation of untolerized autoantigen epitopes; 2) a three-tier model for inflammasome/caspase-1 activation and inflammatory privileges of tissues; and 3) a group of anti-inflammatory microRNAs in inhibiting proatherogenic gene expression during atherogenesis.With recent technological breakthroughs, database mining has provided significant new insights and hypotheses in specifying the novel directions for experimental research.

mining example 1: Stimulation-responsive alternative splicing is an important mechanism in generating self-antigen epitopes (Ng et al., 2004; Xiong et al., 2006; Yan et al., 2004; Yang et al., 2006a; Yang, 2007)
on the genome, and then clustered into a minimal number of alternative transcript variants and grouped into genes (http://www.ncbi.nlm.nih.gov/IEB/Research/Acembly/).Our results showed that alternative splicing occurs in 100% of autoantigen transcripts.This is significantly higher than the approximately 42% rate of alternative splicing observed in the 9554 randomly selected human gene transcripts (p<0.001).Within the isoform-specific regions of the autoantigens, 92% and 88% encoded MHC class I and class II-restricted T-cell antigen epitopes, respectively, and 70% encoded antibody binding domains.Alternative splicing can be canonical or non-canonical.Canonical splicing removes introns that have 5'GT and 3'AG consensus flanking sequences (GT-AG rule) www.intechopen.com