Functional Interpretation of Omics Data by Profiling Genes and Diseases Using MeSH–Controlled Vocabulary

One of the major aims of molecular biology and medical science is to understand disease mechanisms. A genetic disorder is a disease caused by abnormalities in genes and chromosomes, and researchers often report the identification of disease-relevant genes and correlations between phenotypes and genotypes (Butte & Kohane 2006; Lamb 2007; PerezIratxeta et al. 2002, 2005, 2007). Omics analysis using microarray, new generation sequencing (NGS) technology, and mass spectrometry is widely employed for determining genome sequences and profiling gene expression. Changes in gene expression on a genome-wide scale can be detected by omics analysis, which provides various types of huge datasets. These data are often archived in public databases; nucleotide sequences in the DDBJ/EMBL/GenBank International Nucleotide Sequence Database (INSD) (Cochrane et al. 2011), gene expression in Gene Expression Omnibus (GEO) (Barrett et al. 2011), and journal articles in MEDLINE. Currently, research cannot continue without the use of these databases. In Japan, the Database Center for Life Science (DBCLS) has developed infrastructure for researchers to access and easily reuse these data by providing index sites such as INSD and GEO yellow pages and by constructing a portal site for life science databases and tools. Researchers can easily analyze public data in conjunction with their own omics data. Here we present an analytical method to clarify the associations between genes and diseases. We characterized genes and diseases by assigning a MeSH-controlled vocabulary (Nakazato et al. 2008, 2009). Our objective was to help interpret omics data from molecular and clinical aspects by comparing these feature profiles.


Introduction
One of the major aims of molecular biology and medical science is to understand disease mechanisms. A genetic disorder is a disease caused by abnormalities in genes and chromosomes, and researchers often report the identification of disease-relevant genes and correlations between phenotypes and genotypes (Butte & Kohane 2006;Lamb 2007;Perez-Iratxeta et al. 2002, 2005. Omics analysis using microarray, new generation sequencing (NGS) technology, and mass spectrometry is widely employed for determining genome sequences and profiling gene expression. Changes in gene expression on a genome-wide scale can be detected by omics analysis, which provides various types of huge datasets. These data are often archived in public databases; nucleotide sequences in the DDBJ/EMBL/GenBank International Nucleotide Sequence Database (INSD) , gene expression in Gene Expression Omnibus (GEO) (Barrett et al. 2011), and journal articles in MEDLINE. Currently, research cannot continue without the use of these databases. In Japan, the Database Center for Life Science (DBCLS) has developed infrastructure for researchers to access and easily reuse these data by providing index sites such as INSD and GEO yellow pages and by constructing a portal site for life science databases and tools. Researchers can easily analyze public data in conjunction with their own omics data. Here we present an analytical method to clarify the associations between genes and diseases. We characterized genes and diseases by assigning a MeSH-controlled vocabulary (Nakazato et al. 2008(Nakazato et al. , 2009). Our objective was to help interpret omics data from molecular and clinical aspects by comparing these feature profiles.
Bioinformatics Institute (EBI) (Parkinson et al. 2011). GEO is a public functional genomics data repository that accepts array-and sequence-based data. It has been developed and maintained by NCBI since 2000. GEO archives three types of data: datasets derived from research projects, samples such as species and cell lines used, and platforms to produce data (i.e., chipsets and massively parallel sequencers). GEO contains approximately 22,000 series of experiments, approximately 8500 platforms, and approximately 540,000 samples as of March 2011. GEO data are freely downloadable (http://www.ncbi.nlm.nih.gov/geo/); therefore, researchers can utilize the data to perform further analyses and compare their own data as omics analysis. However, it is extremely difficult to grasp the GEO archived data because experimental conditions referred in each GEO entry are complicated and partially described in plain English. To ease this situation, Dr. Okubo and his colleagues at the National Institute of Genetics (Japan) have developed a web service of an index site as a yellow page called the GEO Overview (http://lifesciencedb.jp/geo/), which is maintained by DBCLS (Fig. 1). It shows a list of project titles with their platforms and data provider names. In the GEO Overview, the datasets archived in GEO are categorized and organized by taxonomy (species type) and platform (methods or instruments). Researchers can easily refine the results by clicking the tabs corresponding to the taxonomy and platform of interest. In addition, the datasets can be searched using the search box at the top of the page. Hit data are categorized by histology with a hyperlink to the original GEO entry, and total data size is also provided. The GEO Overview should be helpful in outlining the abundant amount of gene expression data available from GEO. A tutorial movie for the GEO Overview is available on the TogoTV site (http://togotv.dbcls.jp/20100816.html). www.intechopen.com

NGS and its repository sites
Microarray technology has been widely employed to detect genome-wide gene expression. More recently, NGS, also called next-generation sequencing, has been performed for the same purpose. NGS is an ultra-high throughput nucleotide sequencing technology that drastically reduces the cost and time than previously possible (Shendure & Ji 2008). NGS technology has rapidly spread to approach whole-genome sequencing, metagenomics, and transcriptomics, and it also applies to epigenetics and genome-wide association study (GWAS) (Kahvejian et al. 2008). NGS provides a tremendous of captured images and numerous sequence reads (Nat. Biotechnol. Editorial Board 2008), and the in-process files require huge amounts of disk space. However, NGS data are important for researchers and should be shared as well as microarray data in GEO. Thus, the NGS data are also archived in public databases; the Sequence Read Archive (SRA) (Leinonen et al. 2011 b) at NCBI, European Nucleotide Archive (ENA)  at EBI, and DNA Data Bank of Japan (DDBJ) Sequence Read Archive (DRA) (Kaminuma et al. 2010) at DDBJ. These databases are an archive databank for raw data from NGS, and the data are collaboratively synchronized. Researchers can search and download the archived data from the DDBJ site (http://trace.ddbj.nig.ac.jp/dra/). Downloaded data from the SRA/ENA/DRA sites can be used for genome mapping, assembly, and annotation (Kaminuma et al. 2010). DBCLS has developed an index site for NGS data, called the Survey of Read Archives (http://sra.dbcls.jp/) as well as the GEO Overview site, to make this data more searchable and usable. The deposited NGS data contain not only sequence reads but also experimental conditions including project titles, species or cell lines, sample names, and sequencing platforms as metadata. The metadata consist of six files in XML format: submission, study, experiment, run, sample, and analysis. However, each submission does not contain all of these metadata because additional experiments or runs to be assigned to a previous project are often performed and archived as a new submission. Therefore, we determined the connections among each type of corresponding metadata and developed a project list as an index site. We attempted to curate the metadata by correcting misspellings and disambiguating spelling variations. The Survey of Read Archives site provides a list with project titles, sample names, and a hyperlink to corresponding experiments and run data. It categorizes the data by study type including whole genome sequencing, transcriptome analysis, and metagenomics. Furthermore, the archived data are divided by platform and sample taxonomy. Thus, researchers can easily obtain final results with corresponding features of interest. The Survey of Read Archives site provides NGS statistical data such as the number of projects assigned to each study, platforms, and sample taxonomy. Table 1 shows the top ten list of NGS statistical data as of March 2011. In addition, the Survey of Read Archives site offers a publication list that refers to NGS data. We obtained PubMed IDs (PMIDs) cited in the SRA database as reference articles. We also extracted hyperlinks and descriptions of SRA IDs from PubMed articles. The publication list provides article titles, journals, and project titles. Using this publication list, researchers can retrieve NGS data of sufficiently high quality for analysis. Users can narrow down the publication list by referring to NGS's study types, platforms, and sample species.

Online Mendelian Inheritance in Man (OMIM)
OMIM is one of the most widely referred disease databases by biological researchers (Amberger et al. 2009;Hamosh et al. 2002Hamosh et al. , 2005. It contains more than 21,000 detailed entries

Previous work using OMIM data
OMIM is an excellent resource to obtain information on genetic diseases and diseaserelevant genes and for researchers attempting to understand disease features. Using Entrez Gene or Ensembl as a gene database, gene features including gene names, genomic locations, and gene ontology (GO) terms can be obtained. However, OMIM is not completely exploited for omics analysis because of its bibliographic data structure; it is written in plain English (Bajdik et al. 2005). To overcome these difficulties, previous studies attempted to extract knowledge described in OMIM and make it easier to use that knowledge for biological research, including omics analysis. Some groups focused on terms referred in the clinical synopsis (CS) section of OMIM (Cantor & Lussier 2004;Freudenberg & Propping 2002;Hishiki et al. 2004;van Driel et al. 2006). The OMIM CS section contains keywords and key phrases for the mode of inheritance, symptoms, and phenotypes such as eye color, pain sensitivity, height, and weight. Table 2 shows a partial list of terms referred in the CS section for Prader-Willi syndrome (OMIM ID: 176270) as an example. This section describes clinical features of disorders and their modes of inheritance such as autosomal dominant, body system such as almond-shaped eyes, and endocrine features such as growth hormone deficiency. As a previous study, categorization of each OMIM disease entry using particular criteria such as episodes, etiology, tissue, onset, and inheritance has been attempted (Freudenberg & Propping 2002). They also calculated correlations between OMIM entries on the basis of profile similarities. Masseroli et al. normalized various descriptions such as Neuro and Neurologic in the CS section and characterized OMIM disease entries . They developed a web service called GFINDer to analyze phenotypes of inherited disorders (Masseroli et al. 2005 b). Cohen et al. also developed a web service to search the OMIM CS section called CSI-OMIM (Cohen et al. 2011). Using CS terms, researchers can retrieve disease information from OMIM without using text-mining techniques. Although the OMIM full-text content includes detailed biological and genetic descriptions, the CS terms are mainly clinical and diagnostic, and therefore, it is difficult to decipher disease information in conjunction with biological process data such as gene expression data. Furthermore, the CS terms such as Cardiac and Cardiovascular are www.intechopen.com ambiguous because the assigned terms are often defined by the author's original description in the cited articles. We therefore utilized the medical subject headings (MeSH)-controlled vocabulary to characterize OMIM entries.

Feature profiling of OMIM data using MeSH keywords
Many methods such as noise reduction , hierarchical clustering (Eisen et al. 1998), and self-organization maps (Tamayo et al. 1999) have been proposed for analyzing omics data including microarray. However, these methods are statistical approaches, and molecular biology and medicine researchers often need to grasp their microarray data from a biological viewpoint. Researchers often use a controlled vocabulary, called ontology, to annotate biological features including genes. The most popular ontology for biologists is GO (Ashburner et al. 2000). GO consists of three categories: biological process, molecular function, and cellular component. For omics analysis, GO terms are often utilized by assigning corresponding terms to genes of interest (Khatri & Draghici 2005;Zeeberg et al. 2003Zeeberg et al. , 2005. However, GO cannot be applied to annotate OMIM diseases because it focuses on features at the molecular level, and no term corresponding to specific diseases or chemical substance exists.
Here we introduce MeSH terms to characterize genes and diseases. www.intechopen.com

MeSH
MeSH (http://www.nlm.nih.gov/mesh/) is a controlled vocabulary and contains more than 23,000 keywords (Nelson et al. 2004). These keywords are hierarchically categorized into 15 concepts such as disease, chemicals and drugs, and anatomy. MeSH was originally curated for indexing MEDLINE articles by the National Library of Medicine (NLM). Researchers can view MeSH keywords assigned to each MEDLINE article in PubMed results. In a PubMed search, some queries are automatically added by corresponding MeSH t e r m s , a n d P u b M e d i s s e a r c h e d b y a c o n v erted query. PubMed also accepts MeSH keywords as an input query. MeSH has over 177,000 entry terms that assist in finding the most appropriate MeSH heading. For example, vitamin C is an entry term for ascorbic acid. In addition, another approximately 200,000 terms of chemical compounds and proteins are available as the Substance Names. MeSH is freely on the NLM site in XML and ASCII formats and is updated annually.

Feature profiling of OMIM data 4.2.1 Data collection
We retrieved OMIM data available as of January 2010 by downloading them from the NCBI FTP site (ftp://ftp.ncbi.nih.gov/repository/OMIM/) and by using the web service with Entrez Programming Utilities (eUtils, http://eutils.ncbi.nlm.nih.gov/). We obtained MeSH terms (2010 release) from the NLM web site (http://www.nlm.nih.gov/mesh/meshhome.html). MEDLINE article data were also obtained from NLM.

Article extraction related to each OMIM entry
As previously described, MeSH is originally used for keywords to index MEDLINE articles and is not directly linked to OMIM entries. Thus, we developed a method to retrieve articles referred in each OMIM entry. A schematic view of the pipeline for generating OMIM-PMID associations is shown in Fig. 2. The pipeline consists of three steps. First, we retrieved PMIDs cited in the OMIM reference section (Fig. 2 a). Alzheimer Disease, AD (OMIM ID: 104300) was used as an example, and 191 articles were referred in the OMIM reference section as of March 2011. Previous studies also extracted hyperlinks to external databases to utilize MeSH terms for interpreting microarray data (Djebbari et al. 2005;Masys et al. 2001). Next, we retrieved OMIM IDs referred in the Secondary Source ID section of MEDLINE articles (Fig. 2 b). We also collected the OMIM ID descriptions from full-text articles by searching PubMed Central. IDs of external databases including GenBank and GEO referred in full-text articles are often assigned to MEDLINE articles as a Secondary Source ID. As of April 2010, 5463 OMIM IDs were assigned to MEDLINE articles as a Secondary Source ID. These Secondary Source IDs are assigned by NLM, but not all IDs are extracted. In the last step, we obtained PMIDs of articles assigned with MeSH terms corresponding to each OMIM entry (Fig. 2 c). As described above, MeSH contains disease category terms; therefore, there is often a MeSH keyword corresponding to each OMIM entry. For example, the OMIM entry for Alzheimer Disease, AD corresponded to the MeSH term Alzheimer Disease. We also obtained articles referring to human genes. OMIM contains entries describing not only genetic diseases but also disease-relevant human genes. By obtaining articles related to each OMIM entry using these steps, we also obtained articles on human genes. To complement articles on human genes, we obtained articles using Entrez Gene as a gene database. Using the process described above, we obtained PMIDs referred in Entrez Gene, describing Entrez Gene ID in the abstract, and assigned them to corresponding MeSH terms. Accordingly, we retrieved approximately 500,000 unique pairs of OMIM IDs and PMIDs and generated approximately 2,000,000 OMIM-MeSH pairs. In a previous version (Nakazato et al. 2008(Nakazato et al. , 2009), we retrieved PMIDs by searching PubMed using disease names. To identify contexts indicating genes and diseases from articles is a major theme, and many approaches using text mining, such as named entity recognition (NER), have been reported (Gaudan et al. 2005;Hirschman et al. 2005;Jensen et al. 2006;Shatkay 2005). One of the difficulties is that a single disease often has many names, e.g., type 2 diabetes, non-insulin dependent diabetes, and NIDDM. Another problem is that the same abbreviation may refer to several diseases, genes, or drugs; e.g., EVA refers to enlarged vestibular aqueduct (disease), epithelial V-like antigen (gene), and ethylene vinyl acetate (chemical). Thus, we attempted to overcome this by creating abbreviations/long-form pairs for disease names, such as PWS and Prader-Willi syndrome, and searched MEDLINE for articles co-occurring with both names. However, this text-mining approach is noisy, and therefore, we discontinued applying this step in this version of data creation.

Scoring associations between OMIM entries and MeSH terms
OMIM contains gene entries as molecular mechanisms and disease entries as their phenotypes (Amberger et al. 2009). We calculated the scores of diseases and genes separately. These types are indicated by symbols prefixed to the OMIM ID such as #143100 (Huntington Disease; HD) and *613004 (Huntingtin; HTT). We divided the OMIM entries into three groups according to these types: sequence known (*, +), locus known (%), and phenotype (#, none). We then calculated p values as scores of OMIM-MeSH pairs in each group. The p value is the probability of the actual or a more extreme outcome under the null hypothesis. A lower p value means a larger significance of association. We used R language to calculate the p values.

Data visualization
To visualize retrieved features of OMIM disease entries with relevant MeSH terms, we developed a web-based software application called the gene disease features ontology-based overview system (Gendoo) (Nakazato et al. 2008(Nakazato et al. , 2009. Gendoo accepts OMIM IDs, OMIM titles, Entrez Gene IDs, gene names, and MeSH terms as input queries. For disease names, Gendoo currently uses descriptions of title, alternative titles, and symbols referred in OMIM, and therefore, not all synonyms are included in the disease name dictionary. We will increase the number of synonyms by adding the canonical name and synonyms (entry terms) for corresponding MeSH terms and by extracting disease names from MEDLINE and OMIM resources using text mining. Gendoo generates high-scoring lists that display relevant MeSH terms for diseases, drugs, biological phenomena, and anatomy together with their scores. These MeSH terms are sorted according to their scores. The background color www.intechopen.com of each association indicates its p value. Gendoo also provides a hierarchical tree view of MeSH terms associated with diseases of interest using JavaScript and cascading style sheet (CSS) resources from the Yahoo! User Interface (YUI) library (http://developer.yahoo.com/yui/). Gendoo can be openly accessed at http://gendoo.dbcls.jp/. Every association file including Entrez Gene/OMIM IDs, MeSH, and their scores is available from the web site. Dictionary files including gene/disease names, synonyms, and IDs are also downloadable. These web services and files are freely available under a Creative Commons Attribution 2.1 Japan license (http://creativecommons.org/licenses/by/2.1/jp/deed.en). Table 3 shows a list of scores and MeSH terms closely associated with Alzheimer Disease, AD (OMIM ID: 104300) and Amyloid Beta A4 Precursor Protein, APP (OMIM ID: 104760) as a positive control.  Alzheimer disease is a neurodegenerative disorder caused by accumulation of amyloid plaques in the brain. Here we used three MeSH terms as keywords to describe features of Alzheimer disease: Alzheimer Disease, Amyloid Beta Protein, and Brain. The scores among the keywords and the OMIM entry of Alzheimer Disease were small; thus, the retrieved associations seemed to properly illustrate features of the disease. The entry of the Amyloid Beta A4 Precursor Protein as an example of a gene was also strongly associated with these keywords.

Example 3: Comparison of profiles between diseases
We applied this analysis to types 1 and 2 diabetes mellitus (OMIM IDs 222100 and 125853, respectively). Figure 3 shows a summary of typical features and their scores for type 1 and 2 diabetes mellitus. Each cell color on the heat map reflects the p value of the association. Figure 3 summarizes the feature profiles; only type 1 diabetes mellitus was closely related to Autoimmune Diseases and Spleen (their p values were 4.55 × 10 −5 and 5.53 × 10 −7 , respectively), whereas type 2 diabetes was associated with Obesity (p value = 1.18 × 10 −15 ) and Adipocytes (p value = 5.17 × 10 −5 ). Type 1 diabetes mellitus involves the immune system, whereas type 2 diabetes mellitus is a metabolic disorder (Rother 2007). These retrieved profiles reflect the biological features of the diseases. This result suggests that MeSH profiles can clarify the differences and similarities in features between OMIM entries.

Discussion
Diverse types of life science data are available including nucleotide sequences at the molecular level and clinical records at the individual level. Omics analysis makes it easy to detect genome-wide upregulation and downregulation of gene expression under various conditions. We analyzed these raw data using several approaches such as statistical clustering and pathway analysis. In addition, to decipher phenotype information in conjunction with molecular data, we often relate genes that drastically change their expression levels to diseases as a result of omics analysis. However, molecular biologists only understand mechanisms for specific diseases. Although OMIM is an excellent knowledge bank for various diseases, it is not completely exploited for omics analysis because of the bibliographic data structure of OMIM. Moreover, drug information is not linked to elements associated with diseases and genes of interest. To alleviate this problem, we comprehensively characterized diseases and genes referred in OMIM with MeSHcontrolled vocabulary. MeSH profiles allow disease features to be shared and compared. Using GO terms, researchers can decipher their omics data from a molecular viewpoint. The developed feature profiles illustrate related diseases and drugs. We could obtain more clinical and medical data using these MeSH profiles. Furthermore, the profiles can be applied to analyses of disease-relevant genes by comparing the similarities among profiles of OMIM entries and groups of genes such as those found in the gene expression clustering results. Researchers can also obtain overviews of the features of unfamiliar diseases. Genetic disease subtype entries are available in OMIM. For example, diabetes mellitus, noninsulin dependent, 3 (NIDDM3, OMIM ID: 603694) is a genetic subtype of diabetes mellitus, non-insulin dependent (NIDDM, OMIM ID: 125853, i.e., type 2 diabetes mellitus), which is linked to chromosome 20q12-q13.1. Another genetic type of NIDDM, NIDDM1 (OMIM ID: 601283), is reportedly linked significantly to chromosome 2q37.3. The differences in the clinical features between these two NIDDM genetic types seem to be unclear but the genetic mechanisms are probably different. Omics analysis emphasizes these genetic differences. The diabetes mellitus entry was missing in OMIM, although the entries diabetes mellitus, insulin-dependent (type 1 diabetes mellitus) and diabetes mellitus, non-insulin dependent (type 2 diabetes mellitus) were present. We plan to create a dictionary of diseases using not only OMIM but also MeSH disease category and ICD-10 terms.

Conclusion
We characterized diseases and genes by generating feature profiles for associated drugs, biological phenomena, and anatomy using MeSH keywords. We developed a web service called Gendoo to visualize retrieved profiles. This approach illustrates disease features not only from a clinical but also a biological viewpoint. We also clarified the differences and similarities between disease features by comparing their profiles. Retrieved feature profiles are easy to remix such that Gendoo accelerates the process of omics analysis.