Chemometrics-Based TLC and GC-MS for Small Molecule Analysis: A Practical Guide Chemometrics-Based TLC and GC-MS for Small Molecule Analysis: A Practical Guide

Nowadays, thin-layer chromatography (TLC) and gas chromatography/mass spectrom- etry (GC-MS) instruments can produce more data than even before. At this point, the use of mathematical and statistical tools has provided the key to resolve the information overload. In this chapter, a practical guide is provided for the TLC and GC-MS analysis of short-chain fatty acids (SCFAs), amino acids, and monosaccharides. A methodology for extracting and transforming the chromatographic data to a suitable format for chemometrics is described. Furthermore, a procedure for chemometric analysis based on principal components analysis and clustering analysis is suggested.


Introduction
Chemometrics can be defined as the application of mathematical and/or statistical methods to chemical analysis [1]. Like any aspect of mathematics, it requires the use of numbers obtained from measured values of chemical variables [1,2]. There exist a large number of instrumental techniques for analytical chemistry, which have certain advantages for specific metabolites or matrices. Grosso modo, we can mention spectroscopic (infrared, ultraviolet-visible, X-ray, etc.), mass spectrometric, chromatographic, electrochemical and thermal methods, and hybrid techniques [3,4]. The advances in electronics and computer-assisted data processing have provided powerful instruments that obtain more information than can be analyzed by using traditional data analysis methods. Thus, at this point, chemometrics appears as the preferred choice for the analysis of these complex data [1,5]. Chromatography comprises the separation techniques based on the partition of the analytes between a mobile phase and a stationary phase. It is extensively used in food industry, pharmaceutical sciences, and natural products science and technology [3,4] due to its high sensitivity, selectivity, and reproducibility. The versatility of chromatographic techniques allows analyzing a wide range of metabolites, from low-molecular-weight aliphatic gases to complex high-molecular-weight polymeric substances.
In this chapter, we will describe the state-of-the-art in the use of chemometrics-based thinlayer chromatography (TLC) and gas chromatography coupled to mass spectrometry (GC-MS), paying special attention to the analysis of low-molecular-weight metabolites such as amino acids, short-chain fatty acids (SCFAs), and monosaccharides (referred herein as small molecules). In brief, schematic representations of the advantages and some characteristics of TLC and GC-MS for small molecule analysis are shown in Figure 1.

TLC analysis of small molecules 2.1. Generalities
TLC is a planar chromatographic technique extensively used due to its rapidity, versatility, and affordable equipment [6]. Bit by bit, it has been relegated as a chromatographic technique for screening tests; however, TLC analysis of some metabolites may represent the best option. The versatility of TLC results from a wide variety of stationary phases used for separation. The common stationary phases include silica with different physical (e.g., pore diameter) and chemical traits (e.g., silanized silica, C n alkyl-bonded silica) as well as other sorbents such as celluloses, aluminas, and polyamides [6][7][8][9]. On the other hand, the improvement of separation of specific metabolites could be largely done by varying the mobile phases from pure nonpolar solvents to complex mixtures of solvents with different polarities [7][8][9].
The principal measure that can be obtained from a TLC analysis is the retention factor (Rf). This parameter is used to describe the migration of components over a TLC plate and is defined as the ratio of the distance traveled by the center of a spot to the distance traveled by the solvent front, both distances are measured from the starting point [7]. By definition, Rf values are always in the range from 0 to 1, or from 0 to 100 if multiplied by 100 (hRf) to avoid the decimal point. This parameter is used for component identification, since every compound has a specific Rf value for every specific mobile phase. If other compounds comigrate and appear with the same or similar Rf value, it is preferable to improve the separation by changing the mobile phase or using a specific visualization reagent [6][7][8]. Although Rf value is very useful for identifying components in TLC and in some cases gathers enough information to qualify and semiquantify a sample [7,10], it is inadequate for chemometricsbased analysis. For a more precise TLC data handling, it is necessary to complement Rf data with numerical values that correspond to the abundance of retained components [10].
An elementary and simple approach for quantitative purpose is the visual comparison of the spot/band intensity of a known sample aliquot with the intensities of a concentration series of known standards, all developed in the same TLC plate. This approach offers semiquantitative results, with precision (expressed as absolute deviation) and accuracy (expressed as experimental error) ranged from 10 to 30% [6][7][8][9]. Concentration estimation via visual comparison depends on the interpretation of the analyst. To reduce the interanalyst interpretation, the standard concentration series should be very close to the sample aliquot intensity. Other simple approaches for quantitative TLC determination include measuring the area of the spot or band directly by approximating the spot/band to a regular geometric figure. However, there is still an error range dependent on the precision of the approximation with these approaches.
These metabolites are produced by dietary fiber fermentation in the colon [13]. Some SCFAs, like butyric acid, have medical relevance since they contribute to colon health by working as anti-inflammatory and energy-source compounds [11][12][13]. For TLC analysis of SCFAs, it is necessary to consider the autoesterification and autopolymerization reactions of some organic acids because of their dehydration from a dilute aqueous solution, particularly for lactic acid [14]. These reactions are evidenced by the presence of two or more spots in a chromatogram of the respective acid.
The term "amino acids" is used to group the organic compounds that contain amine and carboxyl functional groups, along with side chain group. Some of these compounds are the building units of proteins. Commonly, this term refers only to the 20 amino acids of the genetic code, but there exist about 500 naturally occurring compounds that are chemically amino acids [15,16]. The TLC techniques to analyze this group of compounds exploit the physicochemical characteristics of the amino and carboxylic moieties, although the separation on a TLC plate is due to the characteristics of the side chain [17][18][19]. It is noteworthy that two-dimensional TLC development should be required to better separate a complex mixture of more than 10 amino acids [19][20][21].
Monosaccharides belong to a large family of natural products (i.e., carbohydrates) with the general formula C n (H 2 O) n , the basic structures consisting of five carbons (pentoses) or six carbons (hexoses). They are either polyhydroxyaldehydes or polyhydroxyketones. The alpha hydroxyl group of some monosaccharides can be replaced by another substituent such as hydrogen in deoxy sugars and amino group in amino sugars. Furthermore, they can be oxidized to acidic sugars or reduced to polyols [8,22]. Although they mainly exist as their cyclic hemiacetals or hemiketals, it is necessary to consider the equilibrium of both cyclic and acyclic forms for appropriate chromatographic analysis [8]. Monosaccharides are the most polar compounds of the small molecules here mentioned, and the TLC techniques to analyze this group exploit this feature. Table 1 summarizes the condition and procedure for TLC analysis of abovementioned small molecules. Because most of these molecules are colorless and nonfluorescent under ultraviolet and visible light, the use of a derivatization reagent is a must for their visualization.

TLC data treatment
Direct optical quantification in TLC can be realized by using a slit-scanning densitometer. In this technique, the absorbance or emitted fluorescence of the components separated in a TLC plate is measured. According to the compound nature or their derivatives' spectral characteristics, halogen or tungsten (for the visible range) and deuterium lamps (for the UV region) can be used; nevertheless, better results are usually obtained with absorption of UV light on regular layers. On the other hand, on layers with incorporated phosphor, the compound abundance can be quantified upon UV absorption by dark zones on a fluorescent background (fluorescence quenching) [7]. This type of equipment can be computer-controlled, performing automated and accurate data acquisition and processing. At present, commercially available scanning TLC densitometers have common technical characteristics such as spectral range
A TLC densitometer chromatogram consists of one axis representing the Rf values and the other representing the measured absorbance or fluorescence, which can be extracted as a twocolumn matrix for posterior chemometric analysis. A well-resolved component is characterized by a well-shaped (taller-than-wide) and normally distributed peak in the chromatogram, typically leading to RSD values in the range of 0.5-3% in quantitative high-performance TLC, using the peak area corresponding directly to the compound concentration [7,10].
Apart from slit-scanning densitometry, a compound can be quantified based on the analysis of its spot/band image obtained from a TLC plate. The instrument used for this purpose is known as video scanner densitometer, and it is certainly coupled to a computer system. For chromatogram data acquisition, image densitometers obtain a picture of the TLC and subsequently measure the color brightness of visible spots. Commonly, the image is obtained under white light and/or UV light (short-wave and/or long-wave radiation) [30].
Since the first step for this technique is the obtaining of a good-resolution colorful image, homemade instruments equipped with a high-quality digital camera can be adapted for image acquisition.
The disadvantages of image densitometers are related to their lower sensitivity and chromatographic resolution than slit-scanner densitometers. As a result, the data matrix loses a lot of information of the components that cannot be detected under a few wavelengths of light applied and may contain more background interference. However, due to their simplicity and lower cost, these instruments are the most popular for densitometric evaluation nowadays [7,30].

Generalities
Basically, GC is a more sophisticated technique than TLC; for a better understanding, some references are suggested [3,4,31]. This is the method of choice for separation and detection of permanent inorganic gases and volatile organic compounds in a mixture. It is based on the partitioning of vaporized or gaseous compounds between an inert gas mobile phase and a stationary liquid or solid phase. Helium is the most commonly used carrier gas, but others such as nitrogen or hydrogen can be used. The separation column is packed with a finely divided solid or coated with a thin film of liquid (typically <1 μm). In the market, there is a wide variety of capillary columns, which can be grouped by the polarity of their stationary phases, for achieving better separation and resolution [31].
The affinity of the components of a mixture for the stationary phase depends on their physicochemical characteristics and it impacts directly on their separation and resolution.
Components' separation is based on the "like-dissolves-like" rule that explains the different interaction strengths between the compounds and stationary phase. A stronger compoundstationary phase interaction provokes a longer compound-stationary phase contact, and more time is needed for compounds' migration through the column. This migration time is known as retention time Rt (commonly expressed in minutes) and represents one of the principal numeric values that can be obtained by GC for compound characterization. The other value also obtained by GC is the relative abundance of the components that corresponds with the height or area of the respective chromatogram peak. The unit of the relative abundance depends on the type of detector coupled to GC [4,31].
GC-MS is a hybrid technique, in which a gas chromatograph is coupled to a mass spectrometer via an interface (i.e., a heated metal tube equipped with a temperature controller, connecting the column exit in the gas chromatograph and the entrance to the ion source of the mass spectrometer). GC alone can separate volatile compounds with great resolution, but it cannot identify them properly [4]. MS uses the difference in mass-to-charge ratio (m/z) of ionized atoms or molecules, providing structural information by identification of distinctive fragmentation patters. Thus, after separation in the GC column, analytical species are transported to the mass spectrometer to be ionized for subsequent mass filtration and detection. GC-MS is a potent tool for modern analytical chemistry that allows separating the compounds in complex mixtures and identifying them effectively with some considerations [3,31].
Theoretically, MS is based on the analysis of ions moving through a vacuum, and a mass spectrometer must include the ion source (electron or chemical ionization), ion analyzer (quadrupole, ion trap, or time-of-flight), and ion detector [3,31]. For GC-electron impactquadrupole-mass spectrometry (GC-EIMS), immediately after the compounds leave the capillary column, they are bombed by an electron beam and fragmented in ions that correspond to molecule fractions. Then, these ions are separated in the quadrupole according to their mass-to-charge ratio (m/z) before being further sensed and quantified by the detector.
A GC-EIMS chromatogram (named total ion chromatogram or TIC) is a graph showing the relationship between the retention time (x-axis) and ion abundance or total ion current data (y-axis). Since the TIC comes from the convolution of the individual abundances of all the monitored ions, it can be deconvoluted to obtain each ion distribution. It can be also displayed in three dimensions simultaneously (i.e., a 3D chromatogram recording the number of ions created along with their masses over time). By examining TIC and "slicing" along the third dimension (m/z) of a chosen peak, the mass spectrum can be evaluated at a given time. When all of these ions come from the same compound, the ion distribution can be plotted in an abundance versus m/z graph (aka the compound mass spectrum or fragmentation pattern). In the case that mass spectra are obtained under standard conditions, the distribution of ions is always the same independently of GC step and/or MS instrument. This important feature has enabled constructing big mass spectra libraries that serve for precise compound identification [31].
The first step to analyze small molecules or any metabolite by GC-EIMS is to make sure that all analytes can be volatilized under the injection port and oven temperatures. It is known that the vaporization of a specific compound occurs at a given temperature and pressure that, at the time, depends on the number of carbons and the polarity of the compounds (among other properties). Amino acids, SCFAs, and monosaccharides are polar compounds due to their amino, hydroxyl, carbonyl, and carboxylic groups. These functional groups allow intramolecular interactions such as hydrogen and van der Waals bonds, increasing the compounds' boiling point and making their analysis by GC-EIMS difficult. Thus, only some of these compounds can be GC-EIMS analyzed directly, and for the remaining ones, a derivatization step must be required for better analysis.

Direct GC-MS analysis of small molecules
Among small molecules (i.e., amino acids, monosaccharides, and SCFAs), SCFAs are suitable for direct GC-EIMS analysis due to their small aliphatic acyl chain and relatively low vapor pressure [32]. The critical issue in analyzing SCFAs directly is the correct stationary phase selection. Excellent results could be obtained by using a polar-phase capillary column such as Nukol (acid-modified poly-ethylene-glycol phase, Supelco, Sigma-Aldrich). Since these compounds do not have more than 12 carbons, neither a large column nor a long chromatographic method is required. For direct GC-EIMS analysis of SCFA, the chromatographic conditions are as follows: injection port temperature set at 250°C; GC oven temperature initially set at 90°C for 3 min, then subjected to a three-step program [(i) increased to 150°C with 15°C/min ramp rate, (ii) increased to 170°C with 5°C/min ramp rate, and (iii) increased to 200°C with 20°C/ min ramp rate and hold for 10 min]; transfer line temperature set at 250°C; stationary phase: a 30 m × 320 μm × 0.25 μm Nukol capillary column (Supelco, Sigma-Aldrich); carrier gas: a constant helium flow of 1 mL/min; standard MS parameters applied (electron energy of 70 eV, ion source temperature set at 230°C, quadrupole analyzer temperature set at 150°C). This method takes around 16 min for analysis of C2 to C12 SCFAs, with measurements obtained at a 25-300 m/z range and approximately 3 scans per second.

Derivatization of small molecules for GC-EIMS analysis
The goal of derivatization before GC-MS analysis is to obtain chemical derivatives being more volatile and less reactive than the interest compounds, thus presenting improved chromatographic characteristics [29,33].
The labile hydrogens of amino acids, SCFAs, and monosaccharides are commonly the target of derivatization procedures. In practice, these hydrogens of the carboxyl, amino, and hydroxyl groups of the abovementioned small molecules can be substituted by trimethylsilyl groups [33] according to the following general derivatization. The following GC-EIMS conditions could enable the individual and simultaneous analysis of amino acids, SCFAs, and monosaccharides. The injector temperature is set at 260°C. An HP-5-MS capillary column (30 m × 25 μm × 0.25 μm) is used with helium as carrier gas at a constant flow rate of 1 mL/min. The oven program begins at 45°C (hold for 5 min), then increases at a rate of 10°C/min until 300°C (hold for 25 min). The transfer line temperature is set at 280°C. The mass spectrometer operates at 70 eV of electron energy; the quadrupole and ion-source temperatures are set at 150 and 230°C, respectively. The scan mode is used in the range 40-550 m/z. Using this method, SCFAs, amino acids, and monosaccharides can be chromatographically separated if their mixture contains appropriate amounts. SCFAs elute from the GC column during the first 10-15 min of analysis, followed by amino acids and monosaccharides later. Due to the interference of solvent and derivatizing reagent traces, this method is not suitable for the analysis of compounds containing less than three to four carbons, for example, for C2-C3 SCFAs, a direct analysis is preferable.
For amino acids, it is important to consider that in the equilibrium, different forms of derivatization products can be found, i.e., the totally derivatized compound (i.e., all the labile hydrogens are derivatized) and partially derivatized compound (i.e., conserving one or more labile hydrogens). If the same conditions are applied and the equilibrium is reached in all samples, the proportion of totally and partially derivatized amino acids remains constant and they are both commonly included in the mass spectra libraries.
A critical point for the GC-EIMS analysis of monosaccharide silyl derivatives is the presence of natural isomers. d-Hexoses naturally have five isomers: two furanoses, two pyranoses, and one linear structure, and their silyl derivatives possess the same isomers too. These derivatives can be separated by GC and the mass spectra of pyranoses and furanoses can be differentiated [34]. In a comparable way to amino acids, in the equilibrium, the proportion of each isomer remains constant. It is important to note that in TLC analysis, the problem of isomers generation after derivatization process is avoided.
For better GC-EIMS results, the use of internal standards is advisable, for example, synthetic methylated SCFAs, synthetic or nonprotein amino acids, and nonbiological polyols or glycosides, as well as carbon-and hydrogen-labeled compounds [32,35,36].

GC-MS data treatment
Nowadays, the extraction of GC-MS data is a simple and fast task, thanks to convenience and availability of the commercial and free software products. For GC-MS data analysis of small molecules, ChemStation Data Analysis and MassHunter software products (Agilent Technologies, Inc.) are commonly used [29].
Using these software packages, TIC data could be extracted to CSV format, that is, a twocolumn data matrix is obtained, in which are stored the retention time and ion relative abundance values.
Since TIC data contain the distribution of all measured ions, the data of each individual ion can be extracted and merged to obtain a consensus chromatogram that includes the peaks, in which are present all the selected ions. This feature is very useful when characteristic ions have been detected for the compound(s) of interest. For example, for silylated monosaccharides, the 204 and 217 m/z ions can be used as a diagnostic tool and by monitoring both ions a complex chromatogram can be simplified to analyze only monosaccharides.

Chemometrics-based chromatographic data analysis
At this point, it can be recognized that the data obtained by TLC and GC-EIMS are in the same format, that is, a numeric matrix for subsequent chemometric analysis.
The following paragraphs are considered a practical guide as more than a theoretical description of the chemometric concepts that can be widely reviewed in Refs. [1,2,5,37]. It describes a workflow for obtaining a chemometric analysis of the TLC and GC-MS data (Figure 2), which can be divided into three distinct stages: data preprocessing, data processing, and model validation.

Data preprocessing
This stage includes the baseline correction, retention time or retention factor correction, and noise reduction [38]. It can be applied before or after numeric data extraction. For a prior data extraction, the software products ImageJ, winCATC, or VisionCATS can be used for TLC analysis, whereas OpenChrom, ChemStation Data Analysis, or MassHunter Workstation can be used for GC-MS analysis. For postdata extraction, the data matrices can be edited with statistical software products such as R or MATLAB. Normalization of data is recommended to minimize systemic variation in the data due to changes in instrumental response; this can be done by using internal standards and a subsequent data correction. A data cleanup step can be added to remove artifact peaks or peaks with low repeatability; this can be done by deleting suspected peaks. After doing this, the data must be presented in a single table all under the same conditions and sifted through the same filters.

Data processing and model validation
Data can be processed by unsupervised [principal component analysis (PCA) and clustering analysis] and supervised multivariate statistical methods [partial least square discriminant analysis (PLS-DA) and between-group analysis (BGA)] [39].
Regarding unsupervised methods, chromatographic data can be analyzed by the visualization of grouping trends and inspection of atypical values using PCA [38,40]. A way to do a good PCA analysis is to utilize the FactoMineR package [41] in the R software, using R studio as an interface [42,43] and considering eigenvalues to obtain suitable plots.
Clustering analysis uses resemblance or dissemblance measures between the samples to be analyzed. The goal of this analysis is to obtain a symbolic description of the data and an identification pattern [44]. The most commonly used clustering algorithm is the hierarchical  method [44,45]. For a hierarchical clustering procedure, the pvclust package in R could be used to provide approximately unbiased and bootstrap probability p values [46]. Other R packages could be also used to improve the appearance and analysis of the dendrograms [47]. In addition, model validation can be performed by an internal or external method [48].

Example of chemometrics-based TLC analysis
This example illustrates a TLC method using image analysis densitometry but with some consideration is also applicable to slit-scanning densitometry too. Four extracts of Agave (S2, S3, and S4) and Cichorium (S1) species containing considerable amounts of carbohydrates (mainly fructans) were chromatographed and derivatized by using diphenylamine-anilinephosphoric acid reagent. Using this derivatization step, fructose and fructans appear as reddish spots, whereas glucose and maltooligosaccharides as bluish ones. After chromatographic development and visualization, the acquisition of an image of the TLC plate was done by using a commercial image densitometer TLC-Visualizer 2 (CAMAG), as shown in Figure 3.
Once acquired, the TLC image was processed in the free software ImageJ (National Institutes of Health, NIH), and it was split into red, green, and blue channels. For this example, the red and blue channels were selected because the spots of interest were enriched and appear more defined in these channels. For noise removal, a median filter with an appropriate resolution was applied. A track containing the component(s) of interest was selected, designated as lane, and then plotted. The numeric data from each plot could be extracted to select exclusively the area of the plot line (avoid selecting any text or another data). In this study, a Cartesian graph was also obtained to extract single numeric data matrix for further chemometric analysis [49][50][51]. Besides the use of retention factor/relative abundance matrix, the relative peak area of all or part of the mixture components could also be used to construct data matrices.
The data matrix was used to construct a hierarchical clustering analysis with the software STATISTICA (StatSoft, Inc.), as displayed in Figure 3. This analysis allows to group the samples according to their monosaccharide and oligosaccharide composition without identification requirement. It is observed that the most closely related samples are S2 and S4, which form a clade and have the smallest distance among all the samples. S3 appears as sister of the S2-S4 clade, but with a distance equivalent to four times the distance between S2 and S4. S1 appears as an external group of the clade formed by S2-S4-S3, having a greater distance as compared to the other samples. According to this dendrogram, it is possible to conclude that based on the carbohydrate composition, S2 and S4 have roughly similar compositions among all the samples, S1 has the most different composition, and S3 has a composition most similar to S2-S4 than S1; this corresponds with the vegetal origin of the samples. It is observable that different polymerization degrees of the carbohydrates correspond with Rf values, that is, the higher the Rf value, the lower the polymerization degree. Thus, besides the grouping according to the type of carbohydrate (spots' color), the classification is also influenced by the size of carbohydrate molecules. The advantage of using chemometrics to analyze these TLC data consists of avoiding wasting time in compound identification and standard-based quantification, making the grouping and sample classification a fast but robust process.

Example of chemometrics-based GC-MS analysis
To demonstrate the application of the GC-MS methods herein described, a dataset obtained from the monosaccharide analysis of plant tissue extracts by GC-EIMS was studied. A total of nine samples obtained from three tissues by triplicate measurement (A = tissue 1, B = tissue 2, and C = tissue 3) were analyzed, and the total ion chromatograms were processed to extract the 204 and 217 m/z ion chromatograms with the aim of filtering data for silylated monosaccharides. After that, the chromatographic data were extracted and the numeric data matrix was subjected to principal component analysis. The resulting PCA data are shown in Figure 4; principal component 1 (PC1) and principal component 2 (PC2) explain 61.6 and 16.3% of the total variance, respectively. It can be observed that all samples are grouped according to the tissue from which they come. Samples of tissue B are closer, and samples of tissue C are more scattered. For B and C, the PC1 has a major effect on the data dispersion, whereas for A, the PC2 does the same thing. This nondirected analysis, in which no compounds have been identified, allowed grouping the data according to their origin, demonstrating the strength of the chemometric analysis. Although, in this example, only nine samples are used and their origin is known, this methodology has sufficient sturdiness to be applied to datasets of tens or even hundreds of samples with diverse sources as long as the same methodology is applied to all samples. In this chemometric analysis, our samples were grouped according to their monosaccharide content in a simple way, as well as for the above TLC chemometric analysis, avoiding wasting time in compound identification and standard-based quantification.

Conclusions
This chapter represents a practical guide for chemometric analysis of amino acids, shortchain fatty acids, and monosaccharides by thin-layer chromatography and gas chromatography coupled to mass spectrometry in complex mixtures. Furthermore, it provides a workflow to convert chromatographic data to numerical values, for which are described some chemometric analysis methods such as principal component analysis and clustering analysis.

Author details
Juan Vázquez-Martínez and Mercedes G. López* *Address all correspondence to: mercedes.lopez@cinvestav.mx Center for Research and Advanced Studies of IPN Campus Irapuato (Centro de Investigación y de Estudios Avanzados del IPN Unidad Irapuato), Guanajuato, Mexico