Computational Methods to Interpret and Integrate Metabolomic Data

Revolutionary improvements in high-throughput DNA sequencing technologies have made it possible to measure gene, mRNA, proteins and metabolites, as well as their interaction at global level. In the past decades, significant efforts in improving analytical technologies pertaining to measuring mRNA, proteins and metabolites have been made. These efforts have led to the generation of several new ‘omics’ research fields: transcriptomics, proteomics, metabolomics, interactomics and so on (Singh & Nagaraj, 2006; Fiehn 2007; Lin & Qian, 2007; Kandpal et al., 2009; Ishii & Tomita, 2009). Among them, metabolomics is an approach to obtain a comprehensive evaluation of metabolites in cells. Compared with transcriptomics and proteomics approaches, metabolomics can achieve large-scale quantitative and qualitative measurements of cellular metabolites, which can thus generate a high-resolution biochemical and functional information of an organism.


Introduction
Revolutionary improvements in high-throughput DNA sequencing technologies have made it possible to measure gene, mRNA, proteins and metabolites, as well as their interaction at global level.In the past decades, significant efforts in improving analytical technologies pertaining to measuring mRNA, proteins and metabolites have been made.These efforts have led to the generation of several new 'omics' research fields: transcriptomics, proteomics, metabolomics, interactomics and so on (Singh & Nagaraj, 2006;Fiehn 2007;Lin & Qian, 2007;Kandpal et al., 2009;Ishii & Tomita, 2009).Among them, metabolomics is an approach to obtain a comprehensive evaluation of metabolites in cells.Compared with transcriptomics and proteomics approaches, metabolomics can achieve large-scale quantitative and qualitative measurements of cellular metabolites, which can thus generate a high-resolution biochemical and functional information of an organism.
Due to the chemical complexity of cellular metabolites, it is generally accepted that no single analytical technique can provide a comprehensive visualization of all metabolites, so multiple technologies are generally employed (Dunn & Ellis, 2005;Villas-Boas Silas et al., 2005;Hollywood et al., 2006;Dettmer et al., 2007;Lenz & Wilson, 2007;Seger & Sturm, 2007).The selection of the most suitable technology is typically a compromise between speed, chemical selectivity, and instrumental sensitivity.Tools such as nuclear magnetic resonance spectroscopy (NMR) are rapid, highly selective and non-destructive, but have relatively lower sensitivities.Other tools such as capillary electrophoresis (CE) coupled to laser-induced fluorescence detection are highly sensitive, but have limited chemical selectivity (Ramautar et al., 2006).So far mass spectrometry (MS) measurement following chromatographic separation offers the best combination of sensitivity and selectivity (Dunn & Ellis, 2005;Bedair & Sumner, 2008).Mass-selective detection provides highly specific chemical information including molecular mass and/or characteristic fragment-ion that is directly related to chemical structure of molecules.This information can be utilized for compound identification through spectral matching with data compiled in libraries for

Major analytical technologies of metabolomics 2.1 Mass spectrometry (MS)-based metabolomics
Several MS-based metabolomics technologies have been developed in the past decades, a brief introduction was presented here.Much detailed information regarding these technologies can be found from several recent excellent reviews (Dunn & Ellis, 2005;Hollywood et al., 2006;Dettmer et al., 2007;Bedair & Sumner, 2008).
Direct infusion mass spectrometry (DIMS) is a method for direct analysis of complex metabolic extracts without extraction or separation via electrospray ionization (ESI) MS, which provides a sensitive, high-throughput method to make it possible for several hundred samples per day.In terms of disadvantages, DIMS analysis is susceptible to ionization suppression due to competitive ionization with other components in the matrix (e.g., salts and other ionic compounds, organic acids/bases, and hydrophobic compounds), although ionization suppression effects could be reduced by nano electrospray (nano-DIMS) with increased ionization efficiency (Southam et al., 2007).In addition, typical DIMS is not able to discriminate isomeric compounds; however, when coupled with tandem MS (DIMSMS) or Fourier transform ion cyclotron resonance (FT-ICR) spectrometers can trap and accumulate fragment ions that often enable to determine different isomeric structures (Aharoni et al., 2002).A comparison between direct infusion negative-ESI iontrap MS and GC-quadrupole MS analysis for the metabolic fingerprinting of five yeast mutants was reported recently (Mas et al., 2007).Negative ESI LCQ ion-trap MS was reported as an effective method for the characterization of plant extracts with well-defined clusters in comparison to positive-ion ESI and 1H-NMR profiling (Mattoli et al., 2006).
Gas chromatography-mass spectrometry (GC-MS) has been a very useful technology for volatile and thermally stable polar and nonpolar metabolites (Tanaka et al., 1980).Metabolite identification or confirmation is performed by retention time or index comparisons with pure compounds and mass spectral interpretation or comparison using retention index/mass spectral library databases (Wagner et al., 2003).Metabolites can be classified into two classes: volatile metabolites not requiring chemical derivatisation (Yassaa et al., 2001;Mallouchos et al., 2002;Deng et al., 2004) and non-volatile metabolites requiring chemical derivatisation (Roessner et al., 2000).GC-MS based metabolic profiling has been used to compare four Arabidopsis genotypes and showed each genotype exhibited a different metabolite profile (Birkemeyer et al., 2003), and to compare transgenic tomato plants over expressing hexokinase (Roessner-Tunali et al., 2003).Silent phenotypes of potatoes have been distinguished from their parental background by employing metabolic profiling (Weckwerth et al., 2004).The same approach has recently been employed in microbial metabolomics to study the effect of different growth conditions on Corynebacterium glutamicum (Strelkov et al., 2004).
The application of Liquid Chromatography coupled to Mass Spectrometry (LC-MS) in metabolomics has been growing over the past few years (Wittmann et al., 2004).As a universal separation technique that can be tailored for the targeted analysis of specific metabolite groups or utilized in a broader non-targeted manner, LC offers additional benefits of analyte recovery by fraction collection and/or concentration, which has been difficult for GC separation.In addition, LC-MS operates at lower analysis temperatures than GC-MS, which enables the analysis of heatlabile metabolites.LC-MS analysis does not involve sample derivatization, which simplifies the sample-preparation and improves the identification of metabolites.The major disadvantage of LC-MS relative to GC-MS is the lack of transferable LC-MS libraries for metabolite identifications, although some efforts have been initiated to construct in-house LC-MS or LC-MS-MS libraries for automated metabolite identifications (Noteborn et al., 2000).Two-dimensional LC has also been utilized to increase the peak separation capacity (Aharoni et al., 2002).Recent LC-MS metabolite-profiling examples include the identification of flavonoids and isoflavonoids in Medicago truncatula (Daykin et al., 2002), the revelation of novel pathways by studying the differential and elicitor-specific responses in phenylpropanoid and isoflavonoid biosynthesis in Medicago truncatula cell cultures (Farag et al., 2008), and the investigation of small polar-metabolite responses to salt stress in Arabidopsis thaliana (Lindon et al., 2000).LC-MS has also been used in the non-targeted analysis of endogenous metabolites in an unbiased manner (Rashed et al., 1997;De Vos et al., 2007).
Capillary electrophoresis mass spectrometry (CE-MS) is a powerful separation technique for charged metabolites (Ramautar et al., 2006;Monton et al., 2007).CE has superior separation efficiencies compared to LC due to the plug-flow profile generated by the electroosmotic flow (EOF) as compared to the parabolic flow in LC.Capillary zone electrophoresis (CZE) has been the major CE mode used for CE-MS analysis of metabolites, due to the simplicity of the running buffer.Simultaneous separation of charged and neutral metabolites can be achieved using other CE modes (e.g., micellar electrokinetic chromatography (MEKC) or capillary electrochromatography (CEC)).Cationic and anionic CE-MS analysis of Bacillus subtilis extracts detected 1692 metabolite features of which 150 were identified (Soga et al., 2003).The same analytical procedure was recently used to study the alteration of metabolic pathways in transgenic rice lines that over-express dihydroflavonol-4-reductase (Sato et al., 2004).Non-aqueous CE-ESI ion-trap MSn was utilized for quantitative and qualitative profiling of isoquinoline alkaloids in single-plant tubers of four central European Corydalis species (Sturm et al., 2007).

Other emerging mass spectrometry technologies
Matrix assisted laser desorption ionization mass spectrometry (MALDI-MS) is a popular analytical technique for biopolymer analysis.It has a high throughput capacity and a higher tolerance for salts than ESI.In metabolomics, MALDI has largely been confined to the targeted analysis of high-molecular-weight metabolites due to the substantial chemicalbackground signals generated by the matrix in the low-molecular-weight region (<1,000 m/z) (e.g., the analysis of phospholipids in mammalian tissues (Jones et al., 2006), plant carotenoids (Fraser et al., 2007), and plant cell wall xyloglycans (Lerouxel et al., 2002).MALDI has also been used for imaging MS (IMS) of proteins and small molecules in tissues (Reyzer & Caprioli, 2007).Whole organisms or selected tissue sections are analyzed through an array of spots in which MS spectra are acquired at spatial intervals that define the image resolution.The m/z intensities of the acquired spectra are then plotted in the x and y coordinates to form a 2D image of the m/z values, which represents the spatial distribution of that metabolite/ion in the tissue (Rubakhin et al., 2005).MALDI-TOF IMS has been applied successfully for the study of drug and metabolite distributions in rat-brain tissues (Hsieh et al., 2006) and whole rat body (Khatib-Shahidi et al., 2006).
Desorption electrospray ionization (DESI) is a new, ambient, soft-ionization technique that combines features from both ESI and desorption ionization (DI) methods (Takats et al, 2004;Cooks et al., 2006).In DESI, an electrospray emitter is used to generate a spray of charged micro droplets that is directed towards an ambient sample surface.There is virtually no sample preparation required for DESI, thus allowing the direct analysis of animal and plant tissues.The application of DESI in metabolomics is relatively new, but its ambient DI properties as well as its high-throughput capacity make it an attractive tool for metabolomics.One promising area for DESI is in vivo metabolomics, which was demonstrated through the direct profiling of alkaloids from plant tissues of Conium maculatum without sample preparation while still identifying all of its previously reported alkaloids using tandem MS (Talaty et al., 2005).Although this technique can be incorporated into an IMS configuration, the spatial resolution of the DESI source (0.5-1.0 mm) is currently less than that of MALDI ion imaging (50-100 μm) (Wiseman et al., 2005).
Extractive electrospray ionization (EESI) is another new ESI technique that uses two separate sprayers.Although the exact sample-ionization mechanism is still unclear, the ionization process depends on liquid-liquid extraction between the colliding micro-droplets of the sample spray and the charged reagent-solvent spray (Chen et al., 2006).The advantage of EESI is its ability to analyze complex biological samples, such as urine and serum, directly with minimum or no sample preparation for an extended period of time.EESI-MS along with 1 H NMR was recently used to monitor the effect of diet on the metabolites founds in rate urine (Gu et al., 2007).

Nuclear magnetic resonance (NMR) spectroscopy
NMR spectroscopy is a technique that exploits the magnetic properties of certain atomic nuclei to determine physical and chemical properties of atoms or the molecules in which they are contained.It relies on the phenomenon of nuclear magnetic resonance and can provide detailed information about the structure, dynamics, reaction state, and chemical environment of molecules.Most frequently, NMR spectroscopy is used to investigate the properties of organic molecules, though it is applicable to any nucleus possessing spin.This can range from small compounds analyzed with 1-dimensional proton or carbon-13 NMR to large proteins or nucleic acids using 3 or 4-dimensional techniques (Grivet & Delort, 2009).NMR presents an unbiased technique for metabolite fingerprinting that is quantitative even in complex mixtures.Nicholson and co-workers have pioneered the application of NMR for metabolite fingerprinting (Lindon et al., 2003).The analysis of six yeast knock-out strains proved to classify and relate the genotypes by multivariate statistics, which potentially can be applied for functional genomics (Raamsdonk et al., 2001).NMR is non-destructive and therefore in vivo analysis is also possible (Gmati et al., 2005).Thus, NMR provides a powerful method for accessing metabolite complement (metabolome) and metabolic fluxes (fluxome) at a fine scale (metabolite identification) and a global scale (metabolomics).

Vibrational spectroscopy
Vibrational spectroscopy is one of the oldest spectroscopic methods.The vibrational states of a molecule can be probed in a variety of ways.The most direct way is through infrared spectroscopy (IRS), as vibrational transitions typically require an amount of energy that corresponds to the infrared region of the spectrum.Raman spectroscopy (RS), which typically uses visible light, can also be used to measure vibration frequencies directly.RS has been used for the identification of microorganisms of medical relevance (Dunn et al., 2005); however, its application for complex biological systems outside the area of microbiology is still in its infancy, although the potential of using 1064 nm excitation has been demonstrated in studies of the biochemical analysis of honey (de Oliveira et al., 2002) and in the analysis of plant pigments and essential oils (Schrader et al., 1999).In contrast, IRS has been applied for diagnostics, characterisation of microorganism and plant, adulteration and quality assurance, biomarker discovery and biochemical responses (Dunn et al., 2005).

Single-cell metabolomics
Single cell analysis is the new frontier in "OMICS" (Wang & Bodoritz, 2010).Most current metabolomic technologies only collect data averaged over thousands or millions of cells.However, cellular heterogeneity within a cell population is a widespread event (Irish et al., 2006;Graf & Stadtfeld, 2008).Since the metabolome provides biological processes occurring in the cells, it will be imperative to establish a reliable metabolomic method to measure at a single-cell level.
The challenges for single-cell metabolomics include: i) tiny quantities of metabolites from a single cell.A type single cell is about 1-500 fL in volume (from E. coli to large mammalian), and with metabolites as low as amole to fmole (Schmid A et al., 2010), 10 6 times lower than a typical population-based metabolomics.While amplification of DNA/RNA and highly sensitive fluorescence measurements could be employed in single cell genomics, transcriptomics and proteomics, no similar technique is available for single-cell metabolomics.ii) Sample processing for a single cell is extremely challenging.Even though detection limits for metabolites using MS can be as low as fmoles to amole range (Amantonico et al., 2008); however, transferring of a cell or cell content to mass spectrometer, conserving the original metabolome, and separating metabolites from cell debris, proteins and salts, would be critical.
In recent years, several approaches have been established for MS-based single-cell metabolomics (Figure 1) (see review by Heinemann & Zenobi, 2011).i) Sampling the cell contents with a micropipette, followed by injection into a mass spectrometer using a nano-electrospray ionization (nano-ESI) source (Masujima 2009).This approach, probably only suitable for very large cells, can only measure a few cells per hour; ii) Sample preparation on a microfluidic chip, followed by deposition on a sample plate for (matrixassisted) laser desorption/ionization (MALDI or LDI) mass spectrometry (Lu et al., 2006;Mellors et al., 2008;Amantonico et al., 2008Amantonico et al., , 2010;;Holmes et al., 2009).Once a complete setup is realized, it has the potential to generate high throughput data in an automated way; iii) Cell arraying, single cells are deposited on a sample plate for LDI or MALDI covered by a solvent-repelling, application of a MALDI matrix in an organic solvent will then lyse the cells and extract the compounds of interest for analysis by MALDI.This approach is a true high-throughput operation because the sample arraying can be automated, and thus the speed of MS instrument is the only limited factor (1000s of cells/hour) (Urban et al., 2010); iv) Imaging mass spectrometry, many modern mass spectrometers have imaging capabilities, with a spatial resolution of typically ~50 µm (MALDI or LDI), and ~1 µm (secondary ion mass spectrometry, SIMS), at relatively fast acquisition speed (Fletcher, 2009).With SIMS, the distribution of ions such as Na + , K + , Ca 2+ , as well as cationized cholesterol, lipids present at cell surfaces can be imaged (Fletcher 2009).However, so far the data generated through this approach has been less quantitative (Heinemann & Zenobi, 2011).(Heinemann & Zenobi, 2011) Only very few metabolites can be analyzed directly in single cells by autofluorescence (Amantonico et al., 2010); however, by incorporating fluorescent tags or probes, researchers have been able to detect more metabolites.For example, Fehr et al. (2002) developed a protein based nanosensor for detection of maltose uptake by living yeast cells.In another study, a genetically encoded fluorescent sensor was expressed in living cells for detecting adenylate nucleotides (Berg et al., 2009).However, it is still arguable whether the formation of these foreign complexes between sensor and metabolites in cells will cause damages and lead to alteration of physiological status of cells.
The high sensitivity of electrochemical detectors to electroactive species makes them suitable for targeted studies of metabolites in single-cell analysis.A range of microscale electrochemical methods have been introduced to monitor various physiological processes (Huang & Kennedy, 1995).For example, release of metabolites, such as catecholamines and oxygen, can be readily measured electrochemically (Cannon et al., 2000).Specific methods targeting particular metabolic pathways in single cells have been in use for a long time, including autoradiography of cells preincubated with radioisotopically labeled compounds (Fliermans & Schmidt, 1975).
Other methods used in single cells including single-cell spectroscopy in conjunction with image analysis for glycogen metabolism in yeast cells (Cahill et al., 2000), enzyme-catalyzed luminescence method for dopamine release from a mammalian nerve cell (Shinohara & Wang, 2007), synchrotron Fourier transform infrared spectromicroscopy for ethanol formation in single living cells of unicellular algae (Goff et al., 2009), raman spectroscopy (Schuster et al., 2000;Buckmaster et al., 2009;Hermelink et al., 2009) for detecting nucleic bases and amino acids in single cells, and nuclear magnetic resonance (NMR) for structural characterization of organic compounds, including metabolites (Beckonert et al., 2007;Motta et al., 2010).Brief summary and comparison for various metabolomics techniques discussed above are listed in Table 1.

Computational methods for analyzing metabolomic data
Several computational methods have been developed in recent years to analyze metabolomic data.The overview of the metabolomic data processing is shown in Figure 2.These computational methods can be divided into two major categories: methods for data pre-processing (low-level, such as noise reduction) or methods for interpretation (high-level, such as feature selection).Pre-processing methods concern the improvement and the enhancement of raw signals, which typically include noise reduction, peak detection, baseline correction, peak alignment and normalization.Pre-processing methods for metabolomic data has been reviewed in details by researchers from different perspectives (Jewett et al., 2007;Enot et al., 2011).In this chapter, we will focus on the most widely used or recently developed high-level methods for interpreting metabolomic data.Some computational methods may also depend on the platform or instruments used.The reader should refer Section 2 of the chapter for detailed explanation on popular platforms such as

Computational methods for interpreting metabolomics
The computational methods to interpreting metabolomics data should be selected according to the aim of a study.If the aim is for sample classification and prior information about the sample identity is unknown, unsupervised methods such as hierarchical clustering analysis (HCA), principal component analysis (PCA), or self-organizing map (SOM) are typically used.On the other hand, in the case that sample identity is known and the aim of a study is to discover characteristic biomarkers (e.g., search for biomarkers of a disease comparing samples from healthy and diseased subjects), supervised methods such as linear discriminant analysis, artificial neural networks (ANN) or support vector machine (SVM) can be used.The supervised methods use prior information about sample class and perform better in biomarker discovery.However, if the aim of the study is a mere biomarker discovery in samples of known classes, common statistical methods such as ANOVA with multiplicity control can also be used (Jonsson et al., 2005).Below, we categorize and discuss different computational methods according to their usage in practice.We will not specifically differentiate them by the mathematical forms such as univariate or multivariate models.It should be noted that most of the methods are multivariate and are often used in combination in practical application.
Biomarker discovery: Like other "omics" studies, the primary objective of many metabolomics studies is to find biomarkers that are discriminative between matched "case" and "control" samples, i.e., which metabolites are apparently altered under different physiological conditions.In pharmaceutical research, metabolomics study has been used for biomarker discovery for different diseases, safety markers, or drug mechanism research.However, given the large number of metabolites studied simultaneously with usually small sample size, it is very common to find metabolites that appear persuasive but in fact spurious.Thus, it is of crucial importance to control the rate of false positive (Broadhurst & Kell, 2006).To tackle this problem, many statistical methods have been developed under the term of large scale hypothesis testing (Benjamini & Hochberg, 1995;Storey 2002;Efron 2003Efron , 2004aEfron , 2004bEfron , 2007aEfron , 2007bEfron , 2008;;Storey & Tibshirani, 2003;Reiner et al., 2003;Xie et al., 2005).In classical hypothesis testing, the fundamental problem is to control type I error, the probability that a non-trivial finding is declared while it actually happens by chance.Type I error increases with the number of independent hypothesis considered simultaneously.The well-known and widely used strategy to control the overall type I error rate is Bonferroni-correction, in which the critical value for individual hypothesis testing is obtained by dividing the significance level by the number of hypothesis considered.For example, in metabolomics, if the search for discriminating biomarkers is performed using 500 metabolites and an acceptable chance to reject one true hypothesis is 0.05, and then the Bonferroni-corrected critical value for rejecting an individual null hypothesis for a metabolite should be 0.05/500 = 0.0001).Berferroni-correction is conservative in the sense that it excludes type I errors at the cost of increasing the potential for type II errors (false negatives) (Broadhurst & Kell, 2006).
A widely accepted error measure in microarray literature for large-scale hypothesis testing is the false discovery rate (FDR), the proportion of false positives among all the discoveries.The procedure controlling false discovery rate proposed by Benjamini and Hochberg (1995) has been recognized as a breakthrough and widely applied or adapted by statistical researchers (Efron, 2004a).Most of the literatures assume that the theoretical null hypothesis is known in advance.However, Efron argued that in large scale hypothesis testing, like in 'omics' studies, the theoretical null often fails for reasons like correlations among proteins or genes, unknown confounding factors, or systematic bias (Efron, 2004b(Efron, , 2007a(Efron, , 2008)).Thus, it will be appropriate to estimate the distribution of the null statistics from the data in order to have a more meaningful discovery.Translating to the setting of a metabolomics study, Efron's concept aims to find a subset of metabolites that behave very differently from the majority of the metabolites.Efron's creative idea has received significant attention from the research field.Applications to metabolite biomarker finding has not been reported but certainly expected.
Metabolomics-based biomarker discovery have been reported.In one example for invasive ovarian carcinomas and ovarian borderline tumors, a differential analysis of 291 detected metabolites in sixty-six invasive ovarian carcinomas and nine borderline tumors of the ovary revealed 51 metabolites that were significantly different between borderline tumors and carcinoma with a FDR controlled at 7.8% (Denkert et al., 2006).For Onchocera volvulus, analysis of an African sample set of 73 serum and plasm samples based on LC-MS revealed a set of 14 biomarkers that showed excellent discrimination between Onchocera volvuluspositive and negative individuals (Denery et al., 2010).Controlling FDR at 54% using Storey's q-value approach (Storey, 2002) resulted in 194 features selected from a total of 2350 mass features.Among the 194 features, the authors selected the top 14 feature for investigation.

Data clustering and visualization:
Clustering or unsupervised modeling is useful for class discovery and provides information on data similarity: metabolomic samples clustered or grouped together can be objectively considered to be similar.
Principal components analysis (PCA) is probably the most widely used unsupervised approach to data mining or visualization.PCA is a multivariate technique that transforms the data into a coordinate system where each new projection (also called principal components (PC)) is a linear combination of the original variables.PCs are orthogonal so that each dimension is related to different data characteristics and source of variability in a mathematical sense (Enot et al., 2008).As a dimension reduction tool, PCA is very useful for metabolomics data visualization and further data clustering.However, PCA may not work well for metabolomics data where the differences between groups are minor and obscured by other covariates.
Self-organizing map (SOM) is another visualization tool for high-dimensional data (Kohonen, 1998(Kohonen, , 2001)).The SOM describes a mapping from a higher dimensional input space to a lower dimensional map space.The procedure for placing a vector from data space onto the map is to find the node with the closest weight vector to the vector taken from data space.Once the closest node is located, it is assigned the values from the vector taken from the data space.The SOM places similar input data in adjacent nodes.Therefore, SOM forms a semantic map where similar samples are mapped close together and dissimilar apart.One disadvantage of SOM is that the final map solution is dependent on the order of the presentation of the training data.The batching-learning version of the algorithm (Kohonen, 2001) overcame this problem and yields reproducible maps.SOM has been applied to metabolic profiling for clustering blood plasma (Kaartinen et al., 1998), and NMR spectra of breast cancer tissues (Beckonet et al., 2003b).More recently, Kouskoumvekaki et al. (2008) applied SOM to identify similarities among the metabolic profiles of different filamentous fungi.Meinicke et al. (2008) proposed one-dimensional SOM for metabolite-based clustering and visualization of marker candidates.In a case study on the wound response of Arabidopsis thaliana, they showed how the clustering and visualization capabilities of SOM can be utilized to identify relevant groups of biomarkers.
As a popular unsupervised learning method, Hierarchical cluster analysis (HCA) clusters the data to form a tree diagram or dendrogram which shows the relationships between samples (Ebbels, 2007).The algorithm begins by computing the distances between all pairs of samples.Initially each cluster consists of a single sample.The algorithm proceeds iteratively until all samples are members of a single cluster.The final structure of the resulting clusters depends on the choice of distance function or "linkage" between two clusters as well as a similarity cut-off.The most popular choices of linkage are centroidal, average, single (nearest neighbor) and complete (farthest neighbor) linkages.The centroidal linkage defines the inter-cluster distance as the distance between the centroids.To determine the cluster membership, one must decide on a similarity cut-off which breaks the dendrogram into a number of separate clusters.As an example, Beckobert et al. (2003) used the HCA method to explore a set of toxicology studies.HCA allowed interpretation of the data in terms of the magnitude and site of toxicological effect, and helped to explain misclassifications by other methods.
The k-means clustering is a method of cluster analysis that aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean.In kmeans clustering, the Euclidean distance is used as a distance metric and variance is used as a measure of cluster scatter.The number of clusters k is an input parameter.When performing k-means, it is important to run diagnostic checks for determining the number of clusters in the data set.Thus, k-means is often used in conjunction with other clustering and visualization methods (Ebbels, 2007).2009) applied fuzzy k-means to cluster three gene types of Escherichia coli on the basis of their metabolic profiles and delivered better results than PCA.On the basis of the optimized parameters, the fuzzy k-means was able to reveal main phenotype changes and individual characters of three gene types of E. coli, while PCA failed to model the metabolite data.
Clustering of metabolomics data can be hampered by noise originating from biological variation, physical sampling error and analytical error.Bootstrap aggregating (bagging) is a resampling technique that can deal with noise and improves accuracy.Hageman et al. (2006) demonstrated the application of bagged clustering to metabolomics data.It was argued that the bagged k-means should be favored against ordinary k-means clustering when dealing with noisy metabolomics data.
In practice, it is common to combine dimensionality reduction and clustering methods.For example, first, a sample-based principal component analysis (PCA) is performed to compute a subset of principal components.Then the metabolite-specific PCA loadings of these components are used for metabolite-based clustering using k-means or hierarchical methods (Pohjanen et al., 2006).
Classification and prediction: While the purpose of clustering is to group similar data together, classification aims at finding a rule to discriminate the classes in an optimal way as well as selecting the subset of features that are most discriminative or predicative.In contrast to clustering applications, the class label and the number of classes are known for a subset of data (training samples) in a priori in classification problem.Once the rule or classifies are determined using training dataset, it can be used to predict the class label (such as diseased or not) of a test sample.
The k-Nearest Neighbour (kNN) rule for classification may be the simplest of all supervised classification approaches (Ebbels, 2007).Different from other supervised learning methods, the training phase of kNN consists of only storing the training samples and the corresponding class labels.In the classification (prediction) phase, a test sample is classified by assigning the label which is most frequent among the k nearest training samples.The method requires only the choice of k, the number of neighbors to be considered when making the classification.Greater values of k reduce the effect of noise on the classification, but make boundaries between classes less distinct.A limitation of this classification method is that the classes with the more frequent examples tend to dominate the prediction of a test sample.Usually, k is chosen through a cross validation procedure.kNN has been often used as a comparator for other methods in literature (Beckonert et al., 2003;Baumgartner et al., 2004).
Partial least square for discriminant analysis (PLS-DA) is a regression extension of PCA that takes advantage of class label information to maximize separation between groups of observations.PLS-DA models the relationship between the class affiliation matrix (Y) and feature matrix (X), which is a generalization of multiple linear regressions.It determines a set of latent variables explaining as much as possible of the covariance between the two matrices.PLS-DA can deal with uncompleted dataset and multicollinearity problem.The output of PLS-DA is the score matrix that can be plotted similarly as in PCA and the predictor matrix containing estimated class affiliation (Ciosek et al., 2005;Trygg & Lundstedt, 2007).The ortogonal-PLS (OPLS) method (Cloarec et al., 2005;Trygg & Lundstedt, 2007) is a recent modification of the PLS method.The main idea of OPLS is to separate the systematic variation in X into two parts, one that is linearly related to Y and one that is orthogonal to Y.The OPLS method provides a prediction similar to that of PLS.However, the interpretation of the models is improved because the structured noise is modeled separately from the variation common in X and Y. Analogous to PLS, when the Y matrix is class affiliation, the corresponding analysis is named OPLS-DA.Cloarec et al. (2005) illustrated the applicability of the method in combination with statistical total correlation spectroscopy to 1 H NMR spectra of urine from a metabonomic study of a model of insulin resistance based on the administration of a carbohydrate diet to three different mice strains.Tapp and Kemsley (2009) recently discussed similarities and differences between PLS-DA and OPLS-DA with a focus on the usage of OPLS in the analytical chemistry literature.They concluded that the two methods are very similar and no one outperforms the other, and the reported discrepancies in the literature must be due to differences in the implementation details, or some otherwise ''unfair'' comparison between the methods An Artificial neural network (ANN) is a widely used non-linear data modeling tool (Bishop, 1995).An ANN is a computational model that is inspired by the structure and functional aspects of biological neural network.An ANN consists of a layered network of nodes with simple linear or sigmoid activation function.The most widely used type of ANN is the multilayer perceptron (MLP), which has at least three layers including the input layer, one or more hidden layers and output layer.The most attractive feature of a MLP is its capacity to approximate any continuous function in arbitrary precision given enough number of nodes in the hidden layers with sigmoid-type activation functions.ANN has been applied to classification of tumor cells by different researchers (Maxwell et al., 1998;Ott et al., 2003).
One disadvantage of ANN is that non-linear optimization algorithm is needed to train a neural network.It is easy for the optimization procedure to be stuck in local minima while finding the optimal parameters.Thus, global optimal solution is not guaranteed.Moreover, it is difficult to interpret the connection weights to gain some biological insight for feature ranking in applications.
Support vector machines (SVMs) is a supervised learning technique for classification and regression (Cristianini & Shawe-Taylor, 2000).A support vector machine constructs an optimal hyperplane for classification.The hyperplane is constructed in such a way that it has the largest distance to the nearest training data points of any class.In contrast to ANN, SVMs is trained by using convex programming such as quadratic programming.Thus, it can find the global optimal solution efficiently.The computation complexity of SVMs depends on the number of samples instead of the dimension of each sample.Thus there is no curse of dimensionality problem.When cases are not linearly separable, appropriate kernel functions can be chosen to transform the original data into high dimensional feature space.SVMs is well known for its good generalization capacity.For nearly a decade, SVMs has been used in the field of bioinformatics for classifying and evaluating gene expression microarray data (Furey et al., 2000;Guyon et al., 2002).Mahadevan et al. (2008) compared the performance of PLS-DA multivariate analysis to SVMs and showed that SVMs were superior to PLS-DA in terms of predictive accuracy with the least number of features.With fewer features, SVMs were able to give better predictive model when compared to that of PLS-DA.
Computation methods for identification of metabolites: Metabolomics studies are targeted at identifying and quantifying all metabolites in a given biological context.One central problem is the translation of the measured mass into molecular formulae.Given the observed mass and atoms, the first problem is to find all the possible elemental compositions by solving the Diophantine equation: where M is the observed mass and m i is the mass of the ith atom.Diophantine equation is the basis for much of the mass spectrometry software to obtain compositions.Usually, there are many integer solutions to it mathematically.Among all the mathematical solutions, we then seek all of the integers n i that are chemically feasible considering some chemical contextual information such as the valence rules, double-bond equivalents or exact mass (Meija, 2006).Even with very high mass accuracy (<1 ppm), many chemically possible formulae can be obtained in higher mass regions.To further reduce the number of potential elemental compositions, it is necessary to utilize isotope abundance pattern (Kind & Fiehn, 2006).
The identification of small metabolites has been seen as one of the bottlenecks in interpreting metabolomics data.Neumann & Bocker (2010) provided a review focusing on the computational methods for electronspary ionization (ESI) mass spectrometry.One of the most common methods for the identification of compounds using mass spectrometry is the comparison with spectra of authentic standards.The Metabolomics Standards Initiative (MSI) has defined several confidence levels for the identification of non-novel chemical compounds, ranging from level 1 for a rigorous identification to unidentified signals at level 4 (Sumner et al., 2007).The difference between level 1 and 2 is that the former requires the comparison with authentic standards based on in-house data measured under identical analytical conditions, whereas the latter allows one to use literature or external databases.Level 1 or level 2 identifications are based on a comparison of "exact mass and isotope pattern".Even with the most exact mass and isotope pattern the identification will be limited to the elemental composition.Many compounds share the same sum formula for known metabolites in databases such as KEGG or PubChem.For all other MSI levels, the "identification" usually reduces to an annotation with lower levels of confidence (Neumann & Bocker, 2010).
When a reference spectrum is used, a similarity or distance function is needed for selecting database entries.The most basic similarity functions are those based on counting the number of matching peaks between a query spectrum and each of the database spectra.For this, both spectra can be considered as binary vectors with 0's and 1's for "peak absent" and "peak present", respectively (Neumann & Bocker, 2010).Common distance functions on binary vectors are the Hamming distance (counting any difference) or the Jaccard coefficient (the fraction of matching peaks).Besides counting matches, other measures also consider their actual mass and intensity, such as the Euclidean distance, the probability-based matching (PBM), the normalized dot product (NDP), and a modified cosine distance for the database search of EI spectra (Stein, 1994).Oberacher et al. (2009) proposed and optimized a search function for tandem mass spectrometry (MS/MS-spectra) based on a combination of relative and absolute match probabilities, which combines the principle of peak counting and summed intensities of matching peaks.The X-Rank algorithm (Mylonas et al., 2009) for MS/MS-spectra match is based on probability calculations.It sorts peak intensities of a spectrum and then establishes a correlation between two sorted spectra.X-Rank computes the probability that a rank from an experimental spectrum matches a rank from a reference library spectrum.The solution requires training on a representative dataset.In a training step, characteristic parameter values are generated for a given data set.Identification of small compounds is still challenging, especially for compounds that have not been recorded in any library or structure database.Methods for these tasks are highly sought.

Computational methods for metabolic pathway analysis using metabolomic data
Metabolomics data provides a series of snapshots of cellular metabolism, which can be combined with metabolic flux data for further analysis.Metabolic pathways are the true functional units of metabolic systems (Schilling et al., 2000b).Finding biochemically plausible pathways between metabolites in a metabolic network is a central problem in computational metabolic modeling.Mathematical modeling approaches to metabolic regulation analysis involve different levels of details and complexities ranging from detailed kinetic models, stoichiometric analysis, structural kinetic models, to large scale topological network analysis (Steuer, 2007).
Detailed kinetic models of metabolic pathways, based on explicit enzymekinetic rate equations, is a bottom-up approach towards more comprehensive large-scale dynamic models.It allows for the most detailed quantitative evaluation of the dynamics of metabolic systems, which is very important for improving the understanding of metabolic regulation and control.The metabolic control analysis (MCA) is the culminating mathematical theory from kinetic models, which describes the control and regulatory properties of metabolic systems (Heinrich & Schuster, 1996;Fell, 1997).
A metabolic network is a collection of enzyme-catalyzed reactions and transport processes that serve to dissipate substrate metabolites and generate final metabolites.The dynamics of a metabolic system can be described by a set of ordinary differential equations: where i X represents the concentration of the metabolite and ij S stands for the stoichiometric coefficient for the reactant i in the jth reaction.(,) j vXk corresponds to the flux through the jth reaction.The vector (,)  vXk consists of nonlinear enzyme-kinetic rate functions, which depends on the concentration X and a set of kinetic parameters k .Given the (,)  vXk function form, the set of kinetic parameters k and an initial state (0) X , the differential equations can be solved numerically to obtain the time-dependent behavior of all metabolites under consideration.The stoichiometric matrix S is an m by n matrix where m corresponds to the number of metabolites and n is the total number of metabolites and n is the total number of fluxes taking place in the network.
The stoichiometric analysis approach takes advantage of the structure nature of metabolic system.Knowledge of the stoichiometry puts constraints on the feasible flux distributions, which can be utilized to model the functional capabilities of metabolic networks (Varma & Palsson 1994;Edwards & Palsson, 2000;Stelling et al., 2002;Famili et al., 2003;Price et al., 2003).
The pathway structure should be an invariant property of the network along with stoichiometry.Under steady-state, the set of ordinary equations reduce to linear equations: 00 0( , ) , SvX k Sv    (2) which is typically underdetermined as the number of reactions exceeds the number of participating metabolites.The set of all solutions 0 v to the steady state equation ( 2) is called null space.A set of basis vectors corresponding to a steady-state biochemical pathway can be selected to describe the null space.Additional constraints from biological insights are necessary to determine the system completely (Schilling et al., 2000b), which generally results in a set of linear equalities and inequalities.The set of feasible solutions under bio-chemical constraints form a convex steady-state flux cone.The convex analysis of biochemical networks was founded by Clarke (Clarke, 1980(Clarke, , 1988)).The study of convex flux cones utilizes methods and concepts rooting in linear algebra and optimization and forms the underlying mathematical structure for metabolic pathway analysis.
Flux balance analysis (FBA) is a computational approach to reduction of the admissible flux space (Schilling et al., 1999;Edwards & Palsson, 2000).FBA optimizes an objective function such as maximal biomass yield or maximal energy production through the steady-state flux space has resulted in many applications (Papin et al., 2004;Almaas et al., 2004;Stephanopoulos et al., 2004).Using FBA, in silico studies on the systemic properties of the Haemophilus influenzae and E. coli (Edwards & Palsson, 2000) metabolic networks have been completed.Under various substrate conditions, Schilling et al. (2000a) explored the metabolic capabilities and predicted functions of a sub-system of the E. coli using FBA.
A metabolic network can be decomposed into distinct pathways, termed elementary flux modes (EFM).An EFM is the minimal set of reactions capable of working together in a steady state, which is unique for a given metabolic network.Another closely related concept is extreme pathways, which are a subset of elementary modes (Klamt & Stelling, 2003).All feasible flux vectors can be described as linear combinations of EFMs.The concept of EFM has resulted in a vast number of applications for metabolic network analysis (Stelling et al., 2002;Schuster et al., 2002;Klamt & Schuster, 2002;Klamt & Gilles, 2004;Klamt et al., 2006).For medium-sized metabolic networks, software packages have been developed for the computation of elementary flux modes (Hoops et al., 2006;Klamt et al., 2007).Owing to a combinatorial explosion of the number of elementary vectors, this approach becomes computationally intractable for genome scale networks.To develop an analysis approach computationally feasible even for genome scale networks, Urbanczik & Wagner (2005) proposed to focus on conversion cone, the projection of the flux cone, which describes the interaction of the metabolism with its external chemical environment.The method for calculating the elementary vectors of this cone was applied to study the metabolism of Saccharomyces cerevisiae.
Stoichiometric analysis does not incorporate dynamic properties into the description of the system.Steuer (2006Steuer ( , 2007) ) proposed a structure kinetic modeling approach to augmenting the stoichiometric analysis with kinetic properties.The idea of the proposed approach is to use a local linear approximation to explicit kinetic model to capture the dynamic response to perturbations, the stability of a metabolic state, as well as the transition to oscillatory behavior.The local linear approximation is obtained from a Taylor series expansion of the metabolic system.The linear term of the expansion is the derivative of the kinetic rate equations with respect to the metabolic concentration X at a given state, which usually requires knowledge of the enzyme-kinetic rate equations.Even in the absence of enzymekinetic information, it is still possible to specify the structure of the linear term.Structure kinetic modeling approach allows quantitative conclusions about the possible dynamics of the system, based on only a minimal amount of additional information.
The extension of the detailed kinetic models to whole cell models is faced with some fundamental difficulties including the absence of comprehensive measured kinetic parameter values, and the observed inconsistency in the available kinetic data, and the computational complexity of such models.Traditionally, kinetic models are constructed using rate equations derived to describe conditions in vitro and thus rely on the use of in vitro measured kinetic parameters.However, the conditions at which in vitro experiments are performed are often very different from those inside the cell.Thus, in vitro kinetic rates and in vitro kinetic parameters describe enzymatic behaviors that may not truly represent the observed physiological kinetic behavior in the cell.Several methods have been proposed to address this issue by incorporating in vivo measurements in constructing kinetic models (Visser & Heijnen, 2003) or estimating kinetic parameters in biochemical networks using measured variables (Lei & Jorgensen, 2001;Moles et al., 2003;Segre et al., 2003).These methods require considerable mathematical efforts or utilize nonlinear optimization techniques.(Famili et al., 2003) proposed an approach for incorporating steady-state in vivo data with constraint based modeling approach to determine all candidate numerical values of kinetic constants.The kinetic solution space, termed k-cone, contains all the allowable numerical values of the kinetic constants.The k-cone is obtained by approximating the nonlinear kinetic rate laws as a linear or bilinear function of the kinetic constants.The k-cone approach can be used to determine consistency between in vitro measured kinetic values and in vivo concentration and flux measurements when used in a network-scale kinetic model.To calculate the relationship between kinetic parameters measured in vitro and the k-cone, optimization methods were essential.It was successful in determining whether in vitro measured kinetic values used in the reconstruction of a kinetic-based model of Saccharomyces cerevisiae central metabolism could reproduce in vivo measurements.
Despite the number of research published, information gained from theoretical or experimental metabolic network has not fully enabled probing biochemical pathway structure with the aim at detecting novel metabolic routes (Fiehn, 2007).

Computational methods to integrate metabolomic data with other "omics" datasets
I t i s a l s o b e c o m i n g c l e a r t h a t a n y s i n g l e "omics" approach may not be sufficient to characterize the complexity of biological systems and an integrated "omics" approach may be a key to decipher complex biological systems (Gygi et al., 1999;Zhang et al., 2010).In general, integrated analysis of metabolomics dataset with other types of "omics" datasets can increase both dimension of information sources and statistical power in order to generate a conclusion with high confidence.In recent years, several computational methods have been applied to integrate metabolomics with other "omics" datasets, and the results demonstrated that better pattern recognition and association identification can be achieved when proper mathematical, statistical and other computational tools were applied.
The simple correlation analysis, Pearson or Spearman's correlation has been used to assess degree of association between metabolomic and transcriptomic data.For example, it has been used to distinguish different potato tuber systems, and to determine the relationship between genes and their paired metabolites (Urbanczyk-Wochniak et al., 2003, 2007), through integrating metabolomic and transcriptomic datasets.
Because of the high dimensionality involved in the metabolomic data and other "omics" data, the dimension reduction tool such as PCA can demonstrate its power in the integrated analysis.It has been used in two different but related scenarios: i) PCA applies to each "omics" data to reduce the dimensionality of each omics data so that the "omics" data are ready for integration; ii) PCA applies to the integrated metabolomics and other "omics" data directly to identify a particular pattern.In scenario 1, metabolomic data or other "omics" data are not directly used in any integrated analysis.Rather, PCA was used to reduce the dimension of each data, so that researchers can focus on the most important components of metabolomic data or other "omics" data (Urbanczyk-Wochniak et al., 2003;Rubingh et al., 2009;van den Berg et al., 2009).In scenario 2, PCA was directly used to identify pattern of the integrated metabolomic data with other "omics" data.For example, it was used to a integrated metabolomic data and a proteomic data to reveal clustering of the two genotypes (Weckwerth et al., 2004).While PCA can also be used for exploring polynomial relationships and for multivariate outlier detection, this method is restricted to linear relationships.
In both correlation and PCA analysis, the roles of all variables are the same and they are interchangeable.They are used to explore associations between factors.In some other analyses, some factors (independent variables X) are used to explain or predict the variable of main interest (dependent variable Y).For example, PLS is a statistical method that models Y over X through a linear relationship.Rather than considering all dependent variables as regressors in a multivariate regression analysis, PLS regresses Y over principal components resulted from a principal components analysis (Garthwaite 1994).The method was in fact previously used to model metabolomic variables as a function of the transcriptome profiles (Pir et al., 2006).The analysis allowed the discrimination between the effects that the growth media, dilution rate and deletion of specific genes on the transcriptomic and metabolomic profiles (Pir et al. 2006).The method was also used to relate quantifiable phenotypes of interest such as protease activity or productivity, to concentrations of each of the metabolites determined (Braaksma et al., 2011).The analysis revealed various sugar derivatives correlated with glucoamylase activity.
As an extension of PLS, Le Cao and colleagues proposed a sparse PLS approach to combine integration and simultaneous variable (e.g., gene) selection in one step (Le Cao et al., 2008, 2009).In the approach, the PLS was penalized by the sum of the absolute values of the coefficients through least absolute shrinkage and selection operator (LASSO) (Tibshirani 1996), therefore automatically eliminating variables (e.g., genes) with negligible effects.The model selection approach, together with the smoothly clipped absolute deviation approach (Fan & Li, 2001) is effective in analyzing data with sparsity (e.g., only a few genes have significant effects).
The methods previously discussed in this section, including Perason correlation, PCA, and PLS, are all methods to explore linear relationship.On the other hand, the kinetic model and artificial network could be more sensible when nonlinearity occurs.In an analysis to integrate metabolomics and pharmacokinetics (or nutrikinetics), Van Velzen et al. (2009) presented a one-compartment nutrikinetic model with first-order excretion, a lag time, and a baseline function was fitted to the time courses of these selected biomarkers based on metabolomic data.A kietic model was also used to model the relationship between enzyme kinetics and intracellular metabolites through a two-substrate Michaelis-Menten equation with competitive substrate inhibition or competitive product inhibition (Schroer et al., 2009).The kinetic constants were estimated by nonlinear regression of initial rate measurements.Martense & Vanrolleghem (2010) summarized a few other modeling approaches.Compared to data driving unsupervised analysis, the mathematical modeling may provide a meaning relationship for a better understanding.However, because the modeling is generally based on approximation under some restricted assumptions, the simple model may not precisely describe the complex biology system.
Artificial network is another method to cope with the nonlinearity.A batch-learning selforganizing network was utilized to classify the metabolomes and the transcriptomes according to their time-dependent pattern of changes (Kanaya et al., 2001); the results showed that the metabolomes and transcriptomes regulated by the same mechanism tended to be clustered together (Hirai et al., 2004(Hirai et al., , 2005)).The A batch-learning self-organizing network is artificial neural network that is trained using unsupervised learning to produce a low-dimensional representation of the input space of the training samples.A Networkembedded thermodynamic analysis (NET analysis) is presented as a framework for mechanistic and model-based analysis of metabolite data.By coupling the data to an operating metabolic network via the second law of thermodynamics and the metabolites' Gibbs energies of formation, NET analysis allows inferring functional principles from quantitative metabolite data; for example it identifies reactions that are subject to active allosteric or genetic regulation as exemplified with quantitative metabolomic data from E. coli and S. cerevisiae (Kummel et al., 2006).The network typically creates a graphic representing the global relationship.In a review article, Feist et al. (2006) classified studies using network into three categories: studies that use a reconstruction to examine topological network properties, studies that use a reconstruction in constraint-based modeling for quantitative or qualitative analyses, and studies that are purely data driven.Some of the networks' mathematical frameworks are graph theory.It provides a visual presentation of the complex biology system.However, when it involves more features, the network approach often become too complicated to provide a clear clue.
Bayesian graphical modeling approaches infer biological regulatory networks by integrating expression levels of different types.Specific sequence/structure information will be incorporated into the prior probability models (Webb-Robertson et al., 2009) presented a Bayesian approach to integration that uses posterior probabilities to assign class memberships to samples using individual and multiple data sources; these probabilities are based on lowerlevel likelihood functions derived from standard statistical learning algorithms.The approach was demonstrated by integrating two proteomic datasets and one metabolic dataset from microbial infections of mice; the results showed that integration of the different datasets improved classification accuracy to ~89% from the best individual dataset at ~83 %.
Integrative interpretations of data from different "omics" including metabolomics, are still in it early development stage.More thoughtful interpretation methods that are capable to reveal biology at a system level are yet to come.The collaboration between mathematicians, statisticians, bioinformaticians and experimental biologists will be the key to the success of these efforts.

Final remarks
Although comprehensive coverage of metabolome in cells is not yet possible, significant advancements in the large-scale profiling of metabolites have been achieved in recent years and these analyses have offered unique insight into the metabolic and regulatory networks of cells.In this chapter, we first reviewed some of the widely used and emerging technologies for metabolomics analysis, and then focus on recent progress in developing computational methodologies to improve biological interpretation of high throughput metabolomic data.In addition, we present some mathematical, statistical and bioinformatics methods that have been utilized for the integration of metabolomics data with other type of "omics" datasets and how this integrative analysis has improved our interpretation of biological systems.

Fig. 1 .
Fig. 1.Schematics of the four MS-based approaches for single cell metabolomics.(Heinemann & Zenobi, 2011) G C -M S , L C -M S a n d N M R e t c .S o m e s o f t w a r e p a c k a g e s h a v e b e e n d e v e l o p e d f o r interpreting metabolomic data in recent years, although review of the software tools are beyond the scope of this chapter, most of the software utilize the statistical methods discussed here.

Fuzzy
k-means (also called Fuzzy c-means) is an extension of k-means clustering technique based on Fuzzy logic.While k-means discovers hard clusters (a point belong to only one cluster), Fuzzy k-means is a more statistically formalized method and discovers soft clusters where a particular point can belong to more than one cluster with certain probability.In Fuzzy k-means algorithm, one sample can be assigned to more than one class instead of only one.The membership of each sample is calculated and then represented by a membership value between 0 and 1, instead of just 0 and 1 in the hard clustering.Cuperlovic-Culf et al. (2009) presented the application of fuzzy k-means clustering method for the classification of samples based on metabolomics 1D 1 H-NMR fingerprints.The sample classification was performed on NMR spectra of cancer cell line extracts and of urine samples of type 2 diabetes patients and animal models.The fuzzy k-means clustering method allowed more accurate sample classification in both datasets relative to the other tested methods including PCA, HCA and k-means clustering.Li et al. (