Online Metabolomics Databases and Pipelines

Metabolomics is a rapidly emerging field in life sciences, which aims to identify and quantify metabolites in a biological system. Analytical chemistry is combined with sophisticated informatics and statistics tools to determine and understand metabolic changes upon genetic or environmental perturbations. Together with other 'omics analyses, such as genomics and proteomics, metabolomics plays an important role in functional genomics and systems biology studies in any biological science. This book will provide the reader with summaries of the state-of-the-art of technologies and methodologies, especially in the data analysis and interpretation approaches, as well as give insights into exciting applications of metabolomics in human health studies, safety assessments, and plant and microbial research.


Introduction
As metabolomics becomes an increasingly major component of modern biological research, steps must be taken to preserve and make maximal use of the ever increasing torrents of new data entering the public domain. While this task is by no means unique to the field of metabolomics, the complexity, heterogeneity and large sizes of metabolomics datasets make the development of effective metabolomics bioinformatics tools particularly challenging. Despite these challenges, metabolomics specialists have recently been making rapid progress in this area. A wide range of powerful web-based tools designed to facilitate the systematic online storage, processing, dissemination and biological interpretation of technically and biologically diverse metabolomics datasets have now emerged and are rapidly becoming cornerstones of advancement in biological science.
Web-based tools for metabolomics perform a wide variety of functions. These can be divided into several broad categories, including: 1. Storage and dissemination of technical, biological, and physicochemical reference data for metabolites 2. Processing of raw instrument data to generate [metabolite x sample] data matrices suitable for statistical and multivariate data-analysis 3. Database storage and querying of pre-processed relative and/or absolute metabolite level data 4. Statistical and multivariate analysis of pre-processed relative and/or absolute metabolite level data 5. Aiding biological interpretation of metabolomics results by integration of biological knowledge such as known biomarkers or metabolic pathway information.
While some tools are broader in scope than others and some tools can essentially fully service the data-processing requirements of certain metabolomics approaches, it is important to note that no single tool is currently capable of fulfilling every requirement of every metabolomics researcher. This chapter will review the current state of development in the area of web-based informatics tools for metabolomics and explain how currently available tools can be used to accelerate scientific discovery. It will then attempt to predict future developments in the area of metabolomics web-tool development and advise new metabolomics researchers on strategies to maximise their own benefit from these developments.

Information about metabolites: Biological cheminformatics 2.1 Background
One of the fundamental questions of metabolomics is "how many metabolites occur in nature, what are their structures, what are their physical, chemical and biological properties and how are they distributed amongst species?". Large-scale efforts to build comprehensive databases of metabolite-related knowledge are beginning to provide at least approximate answers to these questions. Defining "the metabolome" of an organism in qualitative sense, by building well-annotated catalogues of metabolites and their properties, is analogous to sequencing and mapping the genome of an organism. That is, it provides a crucial foundation for the development of analytical approaches and experiments, aids in the interpretation of analytical results and provides an important scaffold upon which to attach new information as it becomes available.
Because metabolites are small molecule chemicals of biological origin, organisation of metabolite information lies at the interface between bioinformatics (the management of biological information) and cheminformatics (the management of chemical information). While metabolomics researchers will find useful information about metabolites in broadscoped, general cheminformatic databases, a new generation of biology-focused cheminformatic databases are making it easier for biologists to find cheminformatic data specifically related to biology. This section will guide the reader towards online sources of metabolite information and explain how these information sources can be used to aid metabolomics research.

Molecular semantics: The metabolite naming issue
One of the challenges associated with finding online information about metabolites can be figuring out what text to enter into search engines. Metabolites can be named in many different ways in many different places online and searching with one name will generally only retrieve resources tagged with that particular name. Moreover, in cutting-edge metabolomics research, it is frequently the case that one is searching for information about a poorly-known or even completely hypothetical metabolite for which one has a structure in mind but for which its common name, if indeed it has one, is unknown. Fortunately, there are ways around these problems, thanks to the thoughtful design of cheminformatic databases. These will be explained below.
For well-known metabolites, finding detailed information is particularly easy. Most metabolite information databases annotate each metabolite entry with a large set of 'synonyms' -a range of different names commonly used to refer a given metabolite. As a result, if one uses a common name to search those databases for information about a wellknown metabolite, one will usually find the information they need. In those cases, the key thing is to know which databases to search (these will be outlined shortly).
Finding information on well-known metabolites is relatively easy. However, metabolomics researchers are often interested in discovering new metabolites or learning what little is known about more about poorly-known metabolites. Often, a researcher may know the structure of a theoretical metabolite but have no idea whether it has been observed in nature before let alone what its common name might be. Indeed, such 'theoretical' metabolites www.intechopen.com often have been observed in nature before and have a common name, but finding this out can be challenging if one does not know where to start. This is where InChI codes and comprehensive InChI-enabled cheminformatic databases become indispensible (Wohlgemuth et al., 2010).
"InChI" is an abbreviation for "International Chemical Identifier", a system of expressing chemical structures as compact strings of text suitable for efficiently and unambiguously conveying chemical structures across text-based systems such as web search engines. The InChI system was developed by the International Union of Pure and Applied Chemistry (IUPAC) and the National Institute of Standards and Technology (NIST). Each unique chemical structure can be converted into its own unique InChI code and vice-versa 1 . There are a range of freely-available software tools that allow one to draw a chemical structure and obtain its InChI code or enter an InChI code and have its structure drawn automatically (see Table 1 for examples). All the major metabolite information databases tag their entries with InChI codes, so if one is uncertain of the name of a target metabolite, the best approach is to generate its InChI code and search with that. Some cheminformatic databases provide webbased structure drawing tools allowing users to effectively generate an InChI code and search with it in a single step. One of the advantages of using an unambiguous structural identifier such as InChI to search a database is that if no hits are obtained, one may fairly safely conclude that the target molecule was not in the database 2 . When a hit is obtained, however, the returned information may include common name(s) for the molecule that can aid in subsequent literature searches. For anyone building metabolite databases or supplying supplementary tables of metabolite data for publication, annotation of these data with InChI codes is highly-recommended (Wohlgemuth et al., 2010). Online tools for generating InChI codes from structures or other identifiers are listed in Table 1. A particularly useful tool for metabolomics researchers is the Chemical Translation Service provided by the lab of Oliver Fiehn (Wohlgemuth et al., 2010) since this tool is capable of batch translations of miscellaneous metabolite identifiers and synonyms to standard InChI codes and other common identifiers.

Chemical ontologies: Organising metabolites into useful categories
In scientific communication, biologists frequently refer to broad 'classes' of metabolites using terms related to their functional groups (eg. 'alcohols'), their chemical properties (eg. 'organic acids') or biological roles (eg. 'hormones'). Moreover, researchers are often interested in obtaining lists of metabolites that fall in a particular class. For example, a researcher may want to identify metabolites in an organism that contain a particular functional group and will therefore be expected to undergo certain chemical reactions. Potential classes range in scope from very broad (eg. 'organic') to moderately specific (eg. 'alkaloids') to even more specific (eg. 'monoterpenoid indole alkaloids') and so on. While "metabolite classes" like these appear throughout the scientific literature, formalising them 1 There is one caveat to this statement. The only truly non-ambiguous InChI codes are called "Standard" InChI (often abbreviated to "StdInChI" -these always begin with the string "InChI=1S/"). If building a metabolomics database, it is advisable to use only standard InChI codes. 2 Some metabolite databases were built prior to the release of Standard InChI and have been annotated using non-standard InChI codes (always beginning with "InChI=1/"). It is always a good idea to check which InChI type a database uses before searching it with an InChI code.
www.intechopen.com into accurately and systematically defined and 'chemical ontologies' that can be used in practically useful ways is a non-trivial task. Despite this, a number of metabolite-related databases have begun developing and/or employing hierarchical systems of compound classification, allowing users to browse lists of metabolites via classification trees (ontologies). Examples of databases employing compound ontologies or hierarchical compound taxonomies for annotation of metabolite information include PubChem, ChEBI (Degtyarenko et al., 2008), the BioCyc family of metabolic pathway databases (Caspi et al., 2010), the Human Metabolome Database (HMDB) (Wishart et al., 2007) and MetabolomeExpress (Carroll et al., 2010). The ChEBI compound ontology is by far the most advanced and comprehensive ontology for biological small molecules and is downloadable in open formats from the ChEBI website. Its adoption is recommended in the development of new metabolomics databases.

Tool (URL) Features
ChEBI

Physicochemical information
Physicochemical information about metabolites includes information about their physical and chemical properties such as their structures, molecular formulas, molecular weights, melting and boiling points, solubilities in different solvents at different temperatures, polarities, pKa, light absorbance and fluorescence properties, energy contents, refractive indices and other similar types of basic empirical information. This kind of information can be extremely useful when designing extraction, sample clean-up or analyte-enrichment protocols, for example. www.intechopen.com

Recommended sources of general metabolite information
Many online databases offer information about metabolites. These have varying scales and scopes of content, query tools and modes of access. In these aspects, several databases stand out from all others and these are described below.

ChemSpider
Description: A freely-accessible collection of compound data from across the web with a very versatile search engine.
Scope: all chemicals -not just metabolites Limitations: Important fields are empty for some very common metabolites. Being limited to human metabolites limits utility for other research areas. Downloadable flat-file format requires parsing in order to be usable in spread sheets or local databases.

Chemical Entities of Biological Interest (ChEBI)
Description: A freely-available dictionary of small molecule chemicals of interest to biologists.

Metabolic pathway databases
A wide range of biological information about metabolites is available online. Utilising this information can aid in the development of hypotheses, the design of experiments and the biological interpretation of metabolomics results. For this purpose, among the most useful types of database are metabolic pathway databases. These play a crucial role in metabolomics research by systematically capturing and providing a close representation of current knowledge about: a) which metabolites occur in particular biological systems; b) the enzymatic and non-enzymatic reactions that link different metabolites together into metabolic pathways; c) the enzymes that carry out these reactions and the genes that encode them; and d) the allosteric interactions and signalling networks that regulate these genes and gene products. Another highly useful function that some metabolic pathway databases carry out is to visually overlay metabolomic datasets over pathway diagrams to provide biological contexts aiding the biological interpretation of results. Some of most useful metabolic pathway databases are described below.

Kyoto Encyclopedia of Genes and Genomes (KEGG)
Description: A knowledgebase of genomes, genes, gene-products their properties and the metabolic and regulatory pathways they form.

Species: many species from many different classes
Metabolic pathway content: metabolite names, formulas, masses, structures and external database IDs; reactions; reactant-product atom mappings; pathways; enzymes; enzyme genes; orthologies; bioactivities; allosteric interactions / regulatory pathways; pathway, compound, taxonomy and biological process ontologies

Noteworthy features: Structural similarity search
Modes of access: browse, search, API, FTP download (requires subscription) Strengths: Enormous amount of information. The largest source of atom-mapped reactions available.
Limitations: Broad focus means extracting desired subsets of information can be challenging. Query tools are limited.

BioCyc and the "Cyc" family of metabolic pathway databases
Description: Similar to KEGG. A collection of Pathway / Genome Databases (PGDBs) built using software that predicts metabolic pathways from genome sequences and subsequently refined by varying degrees of expert curation.
Species: BioCyc itself includes highly-curated PGDBs for 3 organisms: Escherichia coli (EcoCyc), Arabidopsis thaliana (AraCyc), Saccharomyces cerevisiae (YeastCyc). Another highlycurated PGDB called MetaCyc compiles pathway and enzyme information from >1900 organisms (mainly single-cell organisms) into a single reference database. See also the separate HumanCyc, PlantCyc and many other "Cyc" databases. Limitations: Some useful and easily-fillable fields are empty for some metabolites. The Cyc databases often refer to generic entities such as 'a fatty acid' -this can limit their utility when researchers are interested in modelling connections between certain specific entities.
Reference: (Caspi et al., 2010)  Limitations: Reaction-centric. Not much information about metabolites and does not provide any tools for overlaying metabolite expression data.

KappaView
Description: A web-based tool allowing users to overlay metabolite-and gene-expression responses and correlations onto custom pathway diagrams or onto a collection of neat, simple and interactive metabolic pathway diagrams.

KNApSAcK
Description: A comprehensive species-metabolite relationship database for plants.
Although not strictly a metabolic pathway database, this database is useful for identifying plant species that contain a certain chemical or identifying chemicals that have been reported in a particular plant species or higher level taxon.

Species: Plants
Metabolism-related content: References to literature reporting the presence of compounds in different plant species. Chemical structures. Masses.
Noteworthy features: References to literature.

Modes of access: browse, search
Strengths: Contains information on many plant-specific specialised metabolites.
Limitations: Data itself is not downloadable.

The roles of analytical reference libraries in metabolomics research
The first online metabolomics databases to store and disseminate actual instrument data for metabolites generally provided spectral reference libraries. These spectral libraries provide reference signals for authentic standard compounds and sometimes also for 'unknown' metabolites obtained through the analysis of standards and biological materials under controlled conditions. The de-novo construction of large analytical reference libraries requires expertise in chemistry, is time consuming and expensive. Centralization of spectral reference data in expert-curated public repositories helps the metabolomics community by: 1) making it easier and cheaper for new labs to build their www.intechopen.com own data processing pipelines; 2) reducing the probability of metabolite misidentification by non-specialists; and 3) promoting efficient communication about 'unknown' metabolites that are recognisable on the basis of their analytical properties but for which no structural information is available.

Types of analytical reference spectra available online
Reference spectra are available from a number of online sources. Types of reference data available include downloadable mass-spectral and retention-index (MSRI) libraries for gas chromatography / mass spectrometry (GC/MS) Schauer et al., 2005;Carroll et al., 2010), searchable but not-downloadable MSRI data (Skogerson et al., 2011), NMR spectra collected under standardized conditions (Wishart et al., 2007;Cui et al., 2008;Ulrich et al., 2008) and MS and MS/MS spectra from a wide range of platforms including accurate mass instruments (Smith et al., 2005;Horai et al., 2010). In addition, most cheminformatic and metabolic pathway databases provide accurate monoisotopic mass information for metabolites which can help provide candidate identities for accurate-mass LC/MS and direct-infusion (DI)/MS peaks. These data sources are described in detail later.

Reference data for Nuclear Magnetic Resonance (NMR)
One of the great advantages that NMR has over mass-spectrometry is that chemical shifts and coupling constants -unlike mass-spectral fragmentation patterns -are, under readily controllable conditions, absolute physical constants that may be readily and accurately reproduced between different makes and model of instrument. Reference libraries of NMR spectra of metabolites, acquired under standardized conditions, are therefore of broad utility by the metabolomics research community. The major sources of standardized NMR spectra for metabolomics are the Madison Metabolomics Consortium Database (Cui et al., 2008), the Biological Magnetic Resonance Bank (Ulrich et al., 2008) and the HMDB (Wishart et al., 2007). These are detailed shortly.

Reference data for Gas-Chromatography / Mass-Spectrometry (GC/MS)
The most useful reference data for GC/MS are downloadable MSRI libraries. These are libraries of mass-spectra and retention indices for peaks observed in GC/MS chromatograms obtained by the GC/MS analysis of pure compounds and biological samples under standardised conditions Schauer et al., 2005). When the same standardized conditions are employed for GC/MS analysis in different laboratories, a single common MSRI library can be used for the high-confidence identification of common metabolite signals in those different labs . Researchers setting up new GC/MS metabolomics platforms are advised to consider adopting a standardised GC/MS protocol already supported by a publicly-available MSRI library such as those available from the Golm Metabolome Database  or MetabolomeExpress (Carroll et al., 2010) since this will enable them to share MSRI libraries with those labs and benefit from ongoing efforts to extend those libraries and annotate the large number of 'unknown' metabolites detected in GC/MS chromatograms of biological samples.

Reference data for liquid chromatography-MS, MS/MS and MS n
While the low-cost and operational simplicity of GC/MS has led it to become the most widely employed analytical platform in metabolomics, an increasing number of laboratories are adopting complementary techniques based on liquid chromatography (LC)-and direct infusion (DI)/MS methods that employ different ionisation techniques and more advanced mass-spectrometers capable of MS, MS/MS, MS 3 and MS n modes of analysis together with much higher mass accuracy and resolution than is provided by most standard GC-MS systems. In the paragraphs below, the various types of non GC/MS, MS-based metabolomics techniques such as LC/MS, DI/MS and capillary electrophoresis (CE)/MS including tandem MS and MS n methods will be referred to collectively as "LC/MS" techniques.
While GC/MS metabolomics is dominated almost entirely by electron impact ionisation (EI) methods using the industry-standardised ionisation energy of 70eV, yielding highlyreproducible fragmentation spectra between different GC/MS instruments, such broad standardisation has not occurred for LC/MS. For LC/MS, the enormous diversity of massspectrometer types, combined with a lack of highly-developed LC 'retention-index' systems present significant challenges towards the creation of standardized MSRI reference libraries, analogous to those available for GC/MS, capable of unambiguous cross-laboratory peak identification for LC/MS.
The simplest type of online reference data for LC/MS metabolomics are the accurate, monoisotopic masses and molecular formulas of metabolites and, in some cases, their stableisotope-labelled isotopomers. The data-processing packages provided with MS instruments capable of high-accuracy mass measurements generally allow users to create custom libraries of accurate masses and/or molecular formulas (for improved match scoring based on the shapes of isotopic envelopes) for target analytes to assist with peak identification. Although accurate masses or molecular formulas alone are not sufficient to unambiguously identify metabolite signals (due to the high frequency of structural isomers across nature), using these data in a rational manner can often provide valuable clues about the possible identities of peaks.
A good way of reducing (but not eliminating) ambiguity in accurate mass-based assignments is to build a separate accurate mass library for each biological system under investigation and to include in each library only those metabolites for which literature evidence exists to support their presence in that organism. An easy way of doing this is to use the advanced query tool provided with each of the BioCyc family of metabolic pathway databases (of which there are many). While the metabolite sets thus obtained may not be complete, this is a fast way of obtaining a good quality starting set.
Another approach for reducing ambiguity in LC/MS peak identifications is to use MS/MS spectral similarity as a scoring parameter to complement accurate-mass MS based assignments (see (Matsuda et al., 2009;Matsuda et al., 2010) for good examples). The major online sources of MS/MS spectra for metabolites are MassBank (Horai et al., 2010), METLIN (Smith et al., 2005), ReSpect for Phytochemicals (http://spectra.psc.riken.jp/menta.cgi/ index) and the HMDB (Wishart et al., 2007). These databases each have different strengths and limitations which will be outlined shortly. With the notable exception of ReSpect for Phytochemicals, a drawback that these databases share is a lack of support for bulk www.intechopen.com downloading of spectra. That said, MassBank does provide a powerful API to partially overcome the need for bulk download while the METLIN website currently reports that an API is in development.

The need for chromatographic retention data in LC/MS reference databases
It is important to note that, for high-confidence peak identifications that meet minimum reporting standards outlined by the Metabolomics Standards Initiative (MSI) (Sansone et al., 2007), it is necessary to support peak identifications with an additional, orthogonal identification parameter. In the case of LC/MS, where chromatography is used, this parameter is generally retention time or relative retention time agreement with an authentic standard. Unfortunately, there appear to be few if any LC/MS reference databases that provide retention time or relative retention time information. Absolute retention times vary from instrument to instrument and from column to column (even between columns of the same make and model), and are therefore considered to be of limited use for highconfidence inter-laboratory peak identification. However, relative retention times (or retention indices), where the retention time of each peak is expressed relative to one or two other peaks in the same chromatogram, are far more stable (Tarasova et al., 2009) and may provide an avenue to the compilation of LC-MS reference libraries capable of providing MSI-compliant peak identifications by combining accurate mass MS or MS/MS spectra with meaningful and highly reproducible retention index (RI) properties. Complementary to this approach would be the further development of RI-prediction models that can accurately predict the LC retention indices of metabolites based on their structures (Hagiwara et al., 2010).
It is important to note that sufficient RI reproducibility may only be achievable with certain simple types of stationary and mobile phase combinations whereby a single stationary phase interaction mechanism (eg. hydrophobic interactions in C18 reversed-phase chromatography or hydrogen-bonding interactions in silanol based normal phase chromatography) applies to all analytes. In separations over mixed-mode stationary phases where multiple interaction mechanisms occur, there is more potential for variations in chromatographic conditions to differentially affect different peaks, thus changing their relative retention times. Public databases of "Accurate Mass / retention Time (AMT) tags" are playing increasingly important roles in peptide identification in LC-MS proteomics (Hagiwara et al., 2010). A similar trend is to be expected in metabolomics.

Madison Metabolomics Consortium Database (MMCD)
Description: An analytical reference database and signal-matching tool for metabolomics. Strengths: Enormous resource for NMR metabolomics. Includes a wide range of metabolites including those that don't occur in humans (eg. plant-specific metabolites). Spectral matching tools provide batch-processing capability.

Human Metabolome Database (HMDB)
Description: A comprehensive, freely-available knowledgebase of human metabolite information.

Noteworthy features: NMR, MS/MS and GC/MS spectrum-based search
Modes of access: browse, search and bulk download (bulk download of MS/MS spectra only provides images of spectra).

Strengths:
A large set of standardized NMR and GC/MS spectra help new labs to quickly set up metabolite profiling platforms.

Limitations:
No support for bulk download of metabolite information based on complex query. No batch-processing capabilities for spectral matching. No API for integration with other web tools.

Description:
A repository for metabolite information and tandem mass spectrometry data.

Strengths:
A large set of standardized NMR and GC/MS spectra help new labs to quickly set up metabolite profiling platforms.

MassBank
Description: A repository for mass-spectra of pure compounds. Features a unique design involving a centralised interface but a distributed network of data servers providing the mass-spectra.
Species: Not species constrained. Not limited to biological metabolites.
Reference data: >29000 mass spectra from a wide range of instrument types including, but not limited to, GC/MS, LC/MS and LC-MS/MS.

Noteworthy features:
Batch searching of MS/MS files against the database. Neutral loss search. Most sophisticated and powerful spectral search and visualisation capabilities of all available mass-spectral repositories.

Modes of access: Search, browse and API.
Strengths: Many spectra, powerful search capabilities.

Limitations:
No bulk-download. However, individual spectra may be downloaded in text format.

ReSpect for Phytochemicals
Description: An interactive collection of MS n spectra of plant metabolites, collected by the LC/MS metabolomics group of the RIKEN Plant Science Center.

Species: Plant species.
Reference data: A total of >8500 MS/MS spectra including >3000 spectra from the literature, >4000 triple quadrupole MS/MS spectra corresponding to >861 standard compounds and >1000 Q/TOF spectra corresponding to >550 standard compounds. Includes both +ve andve ionization modes.

Noteworthy features: Spectral search online using cosine method
Modes of access: Search, browse and complete download.
Strengths: Contains many plant-specific spectra not available elsewhere. Free for bulk download.

MoTo DB
Description: A liquid chromatography-mass spectrometry-based metabolome database for tomato

Species: Tomato (Solanum lycopersicum)
Reference data: Masses, retention times, UV/Vis properties and MS/MS fragment information for a range of metabolites reported to occur in tomato plants.
Noteworthy features: Includes retention times.

Modes of access: Search only.
Strengths: Provides literature references to support peak annotations.
Limitations: Very limited search capability. No browse capability. No download.

The Golm Metabolome Database (GMD)
Description: An interactive and downloadable database of electron impact (EI) ionization mass-spectra and associated retention indices of metabolite peaks detected by GC-EI-Quadrupole (GC-EI-Q-MS) and GC-EI-Time Of Flight (GC-EI-TOF-MS) instruments operated under standardized conditions.
Species: Not formally species-constrained but is plant-centric.

Modes of access: Search, browse and API
Strengths: Very comprehensive. Free for download. Well curated and supported.

Limitations:
Does not provide innate support for sharing of MSRI libraries by arbitrary users.

MetabolomeExpress
Description: An interactive database of downloadable MSRI libraries, raw and processed GC/MS metabolite profiling datasets and a database of metabolic phenotypes observed in any organism using any analytical technique. Includes a complete GC/MS data processing pipeline and cross-study data mining tools.
Species: Not formally species-constrained but current content is plant-centric.
Reference data: A number of GC/MS MSRI libraries are downloadable from the website. Golm Metabolome Database MSRI libraries are provided for use within the data processing pipeline.

Noteworthy features:
Members may independently upload their own MSRI libraries for interactive dissemination and use within the GC/MS data-processing pipeline.

Modes of access: browse and FTP
Strengths: Libraries free for download. Provides a built-in GC/MS data processing pipeline.

Background
Less than a decade ago, software packages enabling processing and analysis of metabolomics datasets were restricted to a limited range of desktop software programs. Would-be metabolomics researchers would have to download or purchase and install software on local computers, set up local reference libraries for peak identification and sometimes develop custom in-house computer scripts to adapt the outputs of various programs into the formats required by programs used for downstream analysis. These challenges were compounded by the fact that available programs often lacked the kinds of specialised, biology-related features desirable for metabolomics research. However, the understandable widespread dissatisfaction of metabolomics researchers with this situation has, over the last decade, driven rapid development of powerful online, platform-independent data processing pipelines tailored www.intechopen.com towards the needs of metabolomics research. Thanks to the availability of these packages and the availability of standardised analytical reference libraries, it is now quite feasible for researchers with limited experience to conduct detailed processing and analysis of their instrumental datasets with little more than a fast internet connection, an up-to-date webbrowser and, in some cases, an FTP-client program for uploading data. This section will provide an overview of the types of data processing pipelines that are currently accessible online and compare the most powerful examples in more detail.

Functions carried out by online data-processing tools
Any ideal metabolomics data-processing pipeline, whether online or offline, should be able to: a) identify and quantify biologically-relevant signals from raw instrument files and distinguish them from biologically irrelevant signals; b) identify non-redundant metabolite signals and, where possible, annotate them with their molecular identities; c) assemble a [metabolite x sample] data matrix appropriately normalised to sample volumes, internal standards and/or other useful normalisation factors; d) facilitate determination and statistical analysis of relative metabolite levels between sample classes; e) carry out multivariate analyses such as principal components analysis (PCA), hierarchical clustering analysis (HCA) and partial least squares discriminant analysis (PLS-DA); and f) provide facilities to assist biological interpretation of results (eg. mapping of detected metabolite responses onto metabolic pathways, overrepresentation analysis and biomarker detection). While the vast majority of online metabolomics data-processing tools carry out only one or a few of these functions, there are systems capable of carrying out all of these functions. The functionalities of a variety of webbased data processing tools for metabolomics are summarised in Table 2.

Background
The long-standing scientific tradition of openly disclosing supporting primary data whenever scientific claims are made has been a fundamental factor underlying the credibility of science. However, in more recent years, the scale and complexity of primary datasets has risen dramatically, presenting ever-new challenges to this tradition with the widespread emergence of high-throughput metabolomics technologies in bioscience being a good example.
In this author's view, it is absolutely crucial that the culture of open primary data disclosure is maintained, and that "challenges" should not become "excuses". Even in the extreme case of next-generation DNA sequencing where the sizes of typical primary datasets (after parsing of raw image data) are typically measured in the 10's of gigabytes (at least 10 times larger than typical metabolomics datasets), scientists have risen to the challenge by providing online storage space and developing specialised data repository systems capable of systematically archiving and effectively disseminating these data (Kaminuma et al., 2010;Cochrane et al., 2011;Leinonen et al., 2011).
Given the relatively small sizes of metabolomics datasets and the fact that metabolomics techniques predate next-generation sequencing by a considerable number of years, it is difficult to think of a satisfactory justification for the number of scientific claims that have been made on the basis of metabolomics datasets that have not, at the very least, been made freely available for download from a publicly accessible web site. That said, recent years have seen a strong increase in the number of metabolomics labs sharing primary datasets from their own websites and even the emergence of centralized metabolomics data repositories allowing arbitrary labs to share their datasets publicly without even having to set up their own website. These groups that have been voluntarily driving the free and open dissemination of primary metabolomics data should be commended! The following sections will highlight the data sharing efforts that have been made by individual groups within the metabolomics community and describe the centralized metabolomics data repositories that are currently in operation and/or development.

DROP met: Data resources of plant metabolomics
Description: A part of the PRIMe (Platform for RIKEN Metabolomics) website.

Species: Plant species
Reference data: Provides a simple download page allowing free download of raw and/or processed LC/MS and GC/MS datasets and metadata from 8 different peer-reviewed publications emerging from the RIKEN Plant Science Center.
Noteworthy features: Metadata for each raw data file is provided in a systematic, MSIcompliant format.

Modes of access: browse
Strengths: Data are easy to find and well annotated.
Limitations: Metabolic phenotypes are not stored in a database. There is no way of querying the data without downloading, extracting biological information and importing into a local database.

KomicMarket (Kazusa omics data market)
Description: A freely accessible database of annotations of metabolite peaks from FT-ICR-MS analysis of standard compounds and plant samples.

Species: Plant species
Reference data: Metabolites detected in tomato fruits by FT-ICR-MS. 215 standard compounds detected by FT-ICR-MS.

Noteworthy features: None
Modes of access: Search, browse and API.
Strengths: Good collection of high mass-accuracy flavonoid spectra. API makes download of spectra and associated annotations relatively easy.

Limitations:
No bulk download of spectra needlessly makes access to spectra more challenging. www.intechopen.com

Species: Arabidopsis thaliana
Reference data: Provides relative metabolite levels of a large number of metabolites in a large number of Arabidosis thaliana mutants.

Modes of access:
Search, browse and download.
Strengths: Data on a very wide range of metabolites. Incorporates phenotypic notes on mutants.
Limitations: Important metadata fields are frequently left empty. Raw data files are not provided. Origins of processed results are not transparent. There is no way to align and compare global phenotypes of mutants.

Mery-B
Description: A repository for plant metabolomics datasets including experimental metadata processed data and raw data for NMR experiments.

Species: Plants.
Reference data: Provides NMR-based metabolite quantification data for a variety of tissues from a variety of species grown under a variety of conditions. Based on ~1000 spectra. Chemical shift peak assignment information is provided.

Noteworthy features:
Interactive raw data viewers for 1D NMR and GC/MS data.

Modes of access: Search and browse.
Strengths: Contains data from a range of peer-reviewed publications and references to literature are clearly presented. Raw NMR spectra and GC chromatograms are available for visualisation. All experimental protocols are provided.
Limitations: Tools for statistical analysis are not yet functional. Data are not downloadable for offline analysis. Analytical reference libraries are not provided. Peak assignments are not seamlessly integrated into the raw data viewer. No direct links between statistical results and raw data vizualisation. Interface is not very intuitive.

MetabolomeExpress
Description: An interactive, centralized metabolomics data repository for metabolomics data from all organisms and all analytical platforms that provides a variety of cross-study data-mining tools for analysis of metabolic phenotypes. Processed data may be uploaded in a simple tab-delimited format. Alternatively, raw GC/MS data may be uploaded and processing online using the integrated data-processing pipeline before being imported into the data repository.
Species: Not formally species-constrained but current content is plant-centric. Data from other systems is currently being gathered from the literature.
Reference data: MSRI libraries, GC/MS chromatograms, processed results, metadata in systematic formats. Database currently includes >12000 publicly available metabolite response statistics representing >100 metabolic phenotypes from 8 species under 22 different experiments in 16 different peer-reviewed publications.
Noteworthy features: Members may independently upload their own MSRI libraries for interactive dissemination and use within the GC/MS data-processing pipeline. Provides tools for cross-study meta-analysis and database-driven phenotype recognition by pattern matching.

Modes of access: browse and FTP
Strengths: All public data free for download. Provides a built-in GC/MS data processing pipeline. Allows cross-study analysis. Processed metabolite response statistics are transparently linked to underlying raw data in an interactive raw data viewer.
Limitations: No API. No search. Raw data processing pipeline needs to be extended to support analytical platforms other than GC/MS. Does not provide as many multivariate analysis and classification tools as other web-based metabolomics data-processing systems.

Conclusion
The field of metabolomics informatics development is moving very rapidly. New dataprocessing tools and new data repositories will continue to emerge. As they do, an increasingly important area to make progress in will be in the standardization of universal data exchange formats that allow free flow of data between compliant databases. Similarly important will be the development of user-friendly metadata capture tools that make systematic annotation of their datasets as painless as possible for biologists. These developments will require the development of new ontologies and/or the extension of existing ontologies that do not cover all of the terms required to describe metabolomics experiments. The efficient sharing and mining of well-annotated and well-quality-controlled metabolomics data across the internet will undoubtedly lead to many important discoveries in the future.