Application of Chemometrics to the Interpretation of Analytical Separations Data

Interesting real-world samples are almost always present as mixtures containing the analyte(s) of interest and a matrix of components that are irrelevant to answering the analytical question at hand. Additionally, the compounds comprising the matrix are usually present in far greater abundance (both number and concentration) than the analytes of interest, making quantification or even detection of these analytes difficult if not impossible.

There are many types of chromatography, with the most common being liquid chromatography (LC) where analytes partition between a mobile liquid phase and an immobile stationary phase, and gas chromatography (GC) where the mobile phase is a gas and the stationary phase is a solid or more often a viscous, liquid-like polymer.There are numerous modes for LC separations, including for example reverse-phase (RPLC), normalphase (NPLC), ion (IC), size exclusion (SEC), and hydrophilic interaction (HILIC) to name a few.From a point of view of chemometric data interpretation and the discussion in this chapter, all of these LC separations generate data which are equivalent.In any chromatographic separation, the sample is delivered to the inlet of the column while the outlet is connected to a detector, which records a continuous signal.The detector response rises and then falls to baseline based on the analyte flux passing through it, ideally generating one separate peak with an approximately Gaussian shape for each individual analyte.Assuming that the conditions for repeat analyses are not changed, the peak for a given analyte will appear at the same time in every analysis, with the peak area/height being proportional to the quantity of analyte present in a sample (Poole, 2003;Miller, 2005).
Another separations technique which is popular for some samples is capillary electrophoresis (CE).Here, an electric field applied across a fused silica capillary containing a buffer induces motion of the buffer and analytes in the sample.The CE separation is dependent on differential mobilities of analytes in the solution in the presence of the electric field.This difference in mobilities is based on the fact that different analytes have different charges and sizes in solution.While the separation mechanism of CE is fundamentally different from the chromatographic mechanism, the data are a series of peaks recorded as a function of time.Consequently, the same tools can be applied to data from a CE separation, and similar concerns exist for the interpretation of these data (Poole, 2003;Miller, 2005).For ease of readability, and because chemometrics are more often applied to chromatographic data than electrophoretic data, we will often refer to a chromatogram in this chapter.This could equally be an electropherogram; when considering the application of chemometric techniques to separations data whether the origin is electrophoretic or chromatographic is largely irrelevant.
When tasked with incredibly complex samples, analysts are now turning more and more frequently to so-called comprehensive multidimensional separations (e.g.: GC×GC, LC×LC, CE×CE) (Liu & Phillips, 1991;Erni & Frei, 1978;Michels et al., 2002).In these techniques, the mixture of compounds is sequentially separated by two different separation mechanisms.In the case of GC×GC, for example, a sample might be separated first on an apolar column, followed by a polar column.The exact workings of comprehensive multidimensional separations are beyond the scope of this work, and are discussed elsewhere (Górecki et al., 2004;Cortes et al., 2009;François et al., 2009;Kivilompolo et al., 2011;Li et al., 2011).However, these techniques are gaining in popularity, and are capable of separating exceedingly complex mixtures comprising thousands of individual compounds.Due to the vastly improved separation power of these techniques, the data are much more informationrich, and without some form of chemometric treatment it is essentially impossible to do more than scratch the surface of the information contained therein.

Separations data
The detector signal from a separations experiment, when plotted vs. time, yields a series of (ideally) Gaussian peaks, each representing one compound in the sample.Acquisition speed is one consideration for a chromatographic detector: it must be sufficient to faithfully record the profile of each compound as it passes through the detector.In order to obtain an accurate peak profile, the minimum number of acquisition points required across a peak is 10.Thus, the required speed of the detector is intrinsically linked to the nature of the separation.In separations where the base width of the peaks are on the order of 5 s, a data rate of 2 Hz would be acceptable, but when peak widths are 100-200 ms, as in GC×GC, then detector rates on the order of 50-100 Hz are required for quantitative analysis.
From a point of view of chemometric analysis of separations data, another important consideration is whether the detector is univariate or multivariate.Univariate detectors, such as the flame ionisation detector, or single-wavelength UV-visible spectrometer, record only one variable as a function of time, generating data which take the form of a vector of instrument response.Other detectors, typically mass spectrometers and multi-channel spectroscopic instruments, can be operated such that they record a multivariate response.Data from these instruments comprise an array of signal responses with each row representing a time when a response was recorded, and each column representing a variable that was recorded (e.g.: detector wavelength, ion mass-to-charge ratio).To the chemometrician, it is immediately obvious that there are numerous advantages to collecting multivariate chromatographic data; however, it is worth noting that most of this advantage has been by and large ignored by chromatographers.Typically, only the profile of a single variable vs. time would be used to selectively quantify an analyte, or the detector response across all channels at a given time used to help identify a peak.
One other aspect of raw separations data is the sheer number of variables measured for each sample.When a univariate detector is used for a 15 min separation, operating with an acquisition speed of 10 Hz, the data will be a vector of 9000 individual measurements per sample.If a multivariate detector is employed instead, for example a mass spectrometer operating over a 30-300 m/z mass range, this number increases to 2 439 000 individual variables arranged in a 9000 × 271 array per sample!In the case of GC×GC-MS analyses, which are typically 60 min in length but have a high-speed MS collecting data at rates of ~100 Hz, there are on the order of 100 million data points collected for each sample.

Challenges with chromatographic data
Variations in analytical separations data are, in principle, no different from those derived from any other instrument; being based on both chemical and non-chemical aspects of the analysis.All relevant information will be contained within the chemical variations and any chemometric approach to interpreting chromatographic data must be capable of identifying relevant chemical variation while minimizing the effects of irrelevant chemical and nonchemical variations.Sources of irrelevant chemical variation include matrix peaks, here defined as any chemical source of signal introduced with the sample, but having no bearing on the conclusions drawn from the data.Additionally, there is background signal which can for example derive from changes in mobile phase concentration which influence detector signals in LC or chemical "bleed" signatures from stationary phases as they degrade in GC.Non-chemical variations include, for example, baseline drift (for non-chemical reasons), retention time shifts (due to minor fluctuations in operating conditions), and electronic noise.These may easily interfere with the relevant chemical information, degrading model performance and the validity of results (de la Mata-Espinosa et al., 2011a).Figure 1 presents an overlay of several LC chromatograms of similar samples exemplifying the challenges of baseline drift and retention time shifts.One of the major challenges in handling chromatographic data using chemometric tools is appropriate pre-processing to remove as many non-chemical and irrelevant chemical variations as possible from the data set.Initial efforts into the application of statistical and chemometric tools to chromatographic data were accomplished using data that were processed to provide a list of detected, integrated peak areas or heights (or the calibrated concentrations for known compounds).However, the trend in recent years has turned towards the direct chemometric interpretation of raw chromatographic signals (Watson et al., 2006;Johnson & Synovec, 2002).The reason for this trend is that many errors can occur during integration of raw signals (Asher et al., 2009;de la Mata-Espinosa et al., 2011b).By applying chemometric tools directly to the raw data, many of these errors can be avoided.Of course, when working with the raw data, other issues become more important, most notably retention time shifts and the population of available variables.

Baseline and noise
Baseline variations, such as noise and drift, are due to small changes in experimental conditions, for example changes in detector response due to the mobile phase gradient in LC separations or increased levels of stationary phase bleed at higher temperatures in temperature-programmed GC.Other sources of noise and drift could include changes in detector response as its components age, contamination of solvents or gases, and of course electronic noise (which is minimal in modern chromatographic systems).
Chemometric approaches to handling chromatographic data should incorporate baseline correction of some form.When raw chromatographic data are processed, the method of baseline correction and its importance are generally obvious to the analyst.In the case where integrated peak tables are used, this is often done automatically by the chromatographic software with little consideration by the analyst, even though the manner in which the baseline is calculated will significantly influence the determination of peak areas/heights.

Retention time shifts
In all separations, retention times of peaks can easily shift by a few seconds from one analysis to the next.This is not much of an issue with simple samples having only a few peaks which are then integrated prior to chemometric analysis.However, retention times of peaks are used for identifying the compounds.With complex separations, unstable retention times may result in unreliable peak identification, making comparisons from one run to the next impossible.When comparing raw data this is even more important as one must ensure that the peak for a given component is always registered in the exact same position in the data matrix so that the algorithms will recognize the signals correctly.
The causes of retention time shifts depend on the separations technique being used.In GC, peaks may shift due to degradation of the stationary phase, decreasing retention times over time; build-up of heavy matrix components which foul the column, effectively changing the chemistry of the stationary phase; minor gas leaks which alter the flow rate; or even matrix effects on the evaporation rate in the injector, affecting the rate of mass transfer to the column.In LC, peak shifts may be due to small fluctuations in mobile phase chemistry from one run to the next; temperature fluctuations which in turn affect solvent viscosity and solute diffusion coefficients, altering the kinetics as well as the thermodynamics of the separation; or degradation / fouling of the stationary phase of the column.CE is the technique most prone to drastic shifts in migration time, due to the instability of the electroosmotic flow in the capillary (Figure 2).Electroosmotic flow depends on the applied voltage, the buffer concentration and composition, and is incredibly sensitive to the surface chemistry of the capillary.The act of analyzing a sample by CE will often have a minor, possibly irreversible effect on the capillary surface, resulting in a change in the migration time of an analyte.
Shifts in retention times are minimized by proper instrument maintenance, precise control of instrumental conditions or by using approaches such as retention time locking in GC to account for variations in instrument performance (Etxebarria et al., 2009;Mommers et al., 2011) and relative retention times in CE.Even with these approaches, some retention time shifting will occur and require more advanced alignment techniques for correction prior to chemometric analysis.

Incomplete separation
Another challenge with the interpretation of chromatographic data is incomplete separation of peaks.If two or more compounds have similar retention characteristics under a given set of separation conditions, they will not be completely resolved, as evidenced by the peak clusters in Figure 1.In these cases, apportioning the signal between the different compounds becomes a challenge, especially for univariate signals.The general approach used for these cases is one of deconvolution: decomposing the analytical signal to determine the contribution of each coeluting compound, or to determine the contribution of the compound of interest, disregarding the remaining data.

Data overload
As shown in Section 1.2, raw chromatographic signals present an overabundance of data to the analyst.This poses several challenges.From a practical point of view, attempts to construct a chemometric model using the entirety of the data set could easily exceed the capabilities of the computer system being used.More fundamentally, if the raw data are considered, the number of variables measured for each sample will vastly outnumber the number of samples available in the data set.These overdetermined systems can defeat many chemometric techniques due, for example, to collinear variables.Finally, for most chromatograms, especially multidimensional ones, only a small fraction of the data points actually contain meaningful signal.Most of the signal is due to background noise or irrelevant matrix components.Consequently, the raw data must somehow be reduced in size prior to chemometric analysis.This is typically achieved via a feature selection approach, as discussed in Section 3.3.3.

Baseline correction
The aim of baseline correction is to separate the analyte signal of interest from signal which arises due to changes in mobile phase composition or stationary phase bleed and signal due to electronic noise.Several baseline correction methods have been proposed in literature, with the two most common approaches being to fit a curve to the data and subtract this value from the signal, and modeling the baseline to exclude it using factor models (Amigo et al., 2010).
Curve fitting is the classical approach used in virtually all commercial software packages provided by vendors of separations equipment.The algorithms used in this approach fit a polynomial function across segments of the chromatogram using regions where no analyte peaks elute to determine the coefficients of the polynomial and then interpolating the background signal for regions where peaks are eluting.The functions are usually first-order polynomials; however, higher-order polynomials or a series of connected first-order polynomials are also used in some situations.Having determined the equation of the background signal, the fitted line is then subtracted from the signal (Brereton, 2003;Gan et al., 2006;Kaczmarek et al., 2005;Zhang et al., 2010;Persson & Strang, 2003;Eilers, 2003).Correction of the baseline using curve fitting is demonstrated in Figure 3.The approach of using models such as parallel factor analysis (PARAFAC) for background correction is analogous to the use of these approaches for deconvoluting coeluting peaks.As these models are more often used for this purpose than for simple background correction, they will be discussed in more detail in Section 3.3.These approaches often rely on having a multivariate signal and are applied to the chromatogram or more typically small selected regions where a single analyte elutes.The result of applying these deconvolution techniques for background correction is essentially the deconvolution of a single analyte peak, with the background noise making up the error matrix (Amigo et al., 2010).These approaches are generally more powerful and likely result in better quality analytical data, but they are not widely used in separation science.The reason for this is likely historical as these tools have only recently become available to the separation sciences, while the classical curve fitting approach is well established, works with univariate detectors, and performs well in most practical situations.

Alignment of separations data
The retention times of analytes in separations fluctuate from one analytical run to the next and, in order for chemometric techniques to be applied to separations data, these fluctuations must be corrected during pre-processing.This ensures that the signal from each analyte in each analysis is correctly registered within the data matrix to be processed.There are essentially two approaches to this problem: integrated peak tables, or mathematical warping and alignment of the raw signal.

Peak tables
Integrated peak tables are the simplest way to ensure that analytical separations data are properly aligned for chemometric processing.In order to use this approach, one must be able to reliably assign a unique identifier to each peak in each sample of the data set, and ensure that the same compound is identified with the same identifier in each sample.It should be noted that while the compound name is an obvious identifier, a series of labels such as Unknown x, where x is a numerical identifier would also be acceptable in the event that compound names were unknown, so long as compounds are matched correctly.Rather than identifying peaks by retention time, one could use relative retention times or retention indices in order to adjust for slight variations in the retention times of peaks.Algorithms for aligning peak tables exist and perform well, so long as some peaks can be easily and reliably matched across all chromatograms (Lavine et al., 2001).
The challenges with this approach stem from its reliance on integrated peak tables.Thus, any integration errors due to poorly-resolved peaks or peaks that are missed due to falling outside of integration parameters in the software will impact any subsequent analysis.

Raw signal alignment
Alignment of raw chromatographic signals prior to chemometric processing is more complex than the alignment of peak tables.In addition to the three more popular algorithms that will be presented below, there are several others that have been developed (Yao et al., 2007;Toppo et al., 2008;Eilers, 2004;Van Nederkassel et al., 2006).In deciding which approach to use, one of the first questions to be answered is if the analysis is to be qualitative or quantitative.This is because some alignment methods can distort peaks, affecting their quantification.Some of the more common algorithms include correlation optimized warping (COW) (Nielsen et al., 1998;Tomasi et al., 2004), correlation optimized shifting (coshift) (Van den Berg, 2005), and a piecewise peak-matching algorithm (Johnson et al., 2003).
In instances where there are non-systematic peak shifts, COW is a popular algorithm.COW relies on stretching or compressing segments of a sample signal such that the correlation coefficient between it and a reference signal is maximized for each interval.Care must be taken with the selection of the input parameters to avoid significant changes in peak shapes as this approach to the warping of the chromatogram has been shown to affect peak areas, leading to poor quantitative conclusions (Nielsen et al., 1998;Tomasi et al., 2004).
A fast and simple alignment algorithm is coshift.This algorithm is useful when data only require a single left-right shift in retention time.The entire data matrix is shifted in one direction or the other by a set amount, maximizing the correlation between a target and the data matrix that required alignment.The single shifting value for the entire data matrix is a weakness, especially for chromatographic data where peaks can shift in different directions and to different extents in a single file.To handle this, an algorithm termed icoshift (interval-correlation-shifting) has been derived from coshift.Icoshift aligns each data matrix to a target by maximizing the cross-correlation between the sample and the target within a series of user-defined intervals (Savorani et al., 2010).The use of multiple intervals permits the alignment of separations data where shifts of different magnitudes and directions occur.These alignment algorithms have been used successfully for both one-dimensional data (de la Mata-Espinosa, 2011a;Liang, 2010;Laursen, 2010) and two-dimensional data, with some modifications (Zhang, 2008).It is important to note that the shifting of chromatograms using coshift or icoshift does not lead to distortions of peak shape, and consequently does not introduce errors into quantitative results.
The piecewise peak matching approach (Johnson et al., 2003) provides another avenue for chromatographic alignment.In this approach, peaks are identified in a target signal to which all other signals will be aligned.The algorithm then identifies peaks within the sample signals located within predetermined windows of the peaks in the target.Peaks within windows are deemed to come from the same compound, and matched.The chromatograms are aligned by stretching or compressing the regions between peak apexes.A variant of this algorithm can be used when MS data are available.In this case, the mass spectrum at the apex of each peak in the target signal is compared to the mass spectrum of each peak within a set window on the sample signal and peaks are matched if their spectra have a high enough match quality (Watson et al., 2006).A general scheme for peak alignment using this approach is described in Figure 4. Depending on the number and relative positions of the peaks in chromatograms matched using this approach, peak shapes may be altered, possibly affecting quantitative results.
One of the biggest challenges for all alignment algorithms is that they depend on the data to be aligned being reasonably similar in terms of both matrix and analyte peaks.In some instances this will not be the case.In our laboratory, we have observed this when analyzing arson debris where the matrix and analytes form an incredibly complex and variable chromatogram from one sample to the next.A similar situation can be easily imagined when processing samples of biological origin.One solution to this issue is to add markers to every sample prior to the separation step in the analysis.These markers should be easily identifiable within the samples, even under conditions where they coelute with matrix components; should occur in multiple, evenly distributed locations along the chromatogram, and should not occur natively in the samples.One choice is a series of deuterated compounds which, with MS detection, are trivial to identify even in a complex mixture (Sinkov et al., 2011b).One additional benefit is that these compounds can act as internal standards if quantitative results are desired.(Johnson et al., 2003).

Deconvolution of overlapping peaks
The central issue in deconvolution is depicted in Figure 5.The instrument response is represented as a black solid line which is the sum of the four dashed, coloured peaks.Ideally, the four signals should be individually quantified.This is a common problem for analytical separations, even those of relatively simple mixtures.Some of these issues may be s o l v e d b y c h a n g i n g t h e e x p e rimental conditions or using characteristic features (wavelengths or ions) of the coeluting analytes and a multivariate detector to selectively detect and quantify them.However, in many cases this is insufficient and more advanced techniques must be used.The strategies used for deconvolution depend heavily on whether the detector signal is univariate or multivariate.

Deconvolution of univariate signals
In the case of univariate signals, one is typically limited to using univariate curve-fitting analyses where a number of Gaussian or modified Gaussian curves are determined such that the sum of these curves fits the experimentally observed cluster of peaks (Felinger, 1994).In these approaches, only a small window of chromatographic data (one peak cluster) should be processed at a time, and constraints such as fixed peak widths, shapes, unimodality, and non-negativity are often required to ensure the validity of the solution.
To solve a univariate deconvolution problem, approaches such as evolving factor analysis (EFA) (Maeder, 1987) or multivariate curve resolution (MCR) (Tauler & Barceló, 1993), among others (Vivó-Truyols et al., 2002;Sarkar et al., 1998;Kong et al. 2005) can be used.When these approaches are used with univariate data, the variables to be solved for are the number, positions, and abundances of each of the peaks that make up the signal.Multivariate curve resolution is widely applicable to separations data and is one of the most common approaches (Franch-Lage et al., 2011;Marini et al., 2011, de la Mata-Espinosa et al., 2011a).The aim of this technique is to determine the number of components present in a sample and the contribution of each component to the sample.In performing MCR, the concentration and response profiles for each analyte are obtained, providing a qualitative and semi-quantitative overview of the components in an unresolved mixture without a priori knowledge of the mixture composition.

Deconvolution of multivariate signals
When multivariate detectors are used for separations, the additional dimension of information can be exploited to aid in deconvolution.MCR and EFA can also be used with multivariate data.In the case of MCR, the experimental matrix is decomposed into a matrix of concentration vs. time profiles (deconvoluted peaks) and pure spectral profiles of each compound.Knowledge of the number of components contributing to the signal in the region being deconvoluted is useful to guide the process and improve the results (de Juan & Tauler, 2006), though strictly speaking it is not required.
Parallel factor analysis (PARAFAC) (Harshman, 1970;Bro, 1997;Amigo et al., 2010) is a technique that is ideally suited for interpreting multivariate separations data.PARAFAC is a decomposition model for multivariate data which provides three matrices, A, B and C which contain the scores and loadings for each component.The residuals, E, and the number of factors, r, are also extracted.The PARAFAC decomposition finds the best trilinear model that minimizes the sum squares of the residuals in the model through a procedure of alternating least squares.
The biggest advantage of using PARAFAC over other models is the uniqueness of the solution; PARAFAC is less flexible and uses fewer degrees of freedom, being a more restricted model.However, its unique solution reflects actual pure analyte profiles in both the time dimension and the spectral dimension.Thus, the results of PARAFAC analysis on a cluster of overlapping multivariate peaks provide both qualitative and quantitative data where the deconvoluted signals appear as analyte peaks.One restriction to the use of PARAFAC is that the data must be trilinear (Bro, 1997;Amigo et al., 2010).In the case of chromatographic techniques with a multivariate detector, the dimensions are retention time, detector signal, and samples.In the case of comprehensive multidimensional separations, such as GC×GC, PARAFAC considers retention in the two dimensions and the samples as the three dimensions.

Feature selection
High data acquisition rates combined with the length of time required for many separations results in a large number of data points collected for a given separation (see Section 1.2).In many situations, most of the data are collected when no analytes are eluting from the system, and represent background signal when only mobile phase is reaching the detector.In the case of spectroscopic and especially mass spectral detectors, at a given point in time, many of the recorded data in this dimension will not contain useful information, even when an analyte of interest is eluting.Furthermore, many components in the mixture can be completely irrelevant to analysis (Johnson & Synovec, 2002;Sinkov & Harynuk, 2011a).Consequently, only a small portion of separations data is potentially useful.It is also well known that any model will be heavily influenced by the specific variables that are included in its construction (Kjeldahl & Bro, 2010).
The inclusion of irrelevant data is detrimental to the model because the mathematics attempt to account for variations observed in these irrelevant variables.Consequently the model is forced to model noise, resulting in a decrease in its predictive ability.Worse yet, the model could fit the data well and provide a seemingly useful prediction, until crossvalidation shows otherwise.Finally, the inclusion of extraneous variables increases the demands on the computer system being employed, making model construction slower, or in some cases outright impossible.Thus, prior reduction of separations data to a manageable size is crucial.Figure 6 depicts situations where either too few or too many variables were used to model a system.
One common manner to achieve data reduction is to use a table of integrated peaks instead of raw chromatographic data.This has the advantage of reducing the number of variables to those compounds included in the peak list, removing baseline noise and, if the analyst knows which exact peaks to use, removing signal from irrelevant compounds.Problems with this approach include the restriction to identified compounds, which may or may not include all of the information required for modeling, and integration errors that skew results.Finally, even with an error-free comprehensive peak table, the analyst must still perform feature selection since many peaks will undoubtedly be irrelevant to the analysis.
In the case of multivariate detection, it can be advantageous to monitor only one or a few channels (wavelengths, ions, etc.) as this will selectively detect only a portion of the analytes, allowing the analyst to avoid many interfering species while greatly reducing the size of the data.However, in these cases the analyst must know exactly what signals to use and runs the risk of missing important features of the data encoded in the channels that were ignored.Further, using this approach destroys much of the multivariate advantage that can be realized through using these more complex (and expensive) detection strategies.
Objective feature selection techniques generally have two steps: variable ranking, and variable selection.Objective variable ranking techniques such as analysis of variance (ANOVA) (Johnson & Synovec, 2002), the discriminating variable test (DIVA) (Rajalahti et al., 2009a(Rajalahti et al., , 2009b)), and informative vectors (Teofilo et al., 2009) have the distinct advantage that variables are ranked based on a mathematically calculable "perceived utility" and not on subjective analyst perception.In essence, the data are given the chance to inform the user of what is relevant and what is likely noise, providing an approach that can be generalized to any set of analytical data.
ANOVA is an effective method when the goal is to discriminate between classes of samples.ANOVA calculates the F ratio for each variable: the ratio of between-class variance to within-class variance.If the F ratio for a given variable is high, it is deemed to be more valuable for describing the difference between classes.Once the F ratio has been calculated for every data point in the chromatogram, the variables can be ranked in order of decreasing F ratio.A chemometric model is then constructed using a fraction of variables having the highest F ratio.One significant advantage of ANOVA is that the algorithm can be written with memory conservation in mind and thus is easily applied to data sets with very large numbers of samples and variables (hundreds or thousands of samples, each containing millions of variables).Consequently, it can be easily applied to a set of GC-MS chromatograms across the entire chromatogram, something that is difficult for other feature ranking approaches.
DIVA is a feature ranking technique that aids feature selection prior to chemometric analysis (Rajalahti et al., 2009a(Rajalahti et al., , 2009b)).This approach involves the creation of a PLS-DA model using all candidate variables.Projecting this PLS-DA model onto a new single LV yields what is termed a target projected (TP) model (Rajalahti et al., 2009a).From this, the ratio of explained variance to residual variance for each variable in the TP model provides its selectivity ratio, upon which variables are ranked (Rajalahti et al., 2009a(Rajalahti et al., , 2009b;;Kvalheim, 1990;Kvalheim & Karstang, 1989).DIVA produces a ranking that is slightly different than that produced by ANOVA, though a direct comparison on chromatographic data has not yet been performed to our knowledge.
Once variables have been ranked, those to be included in the model must be selected.This is generally achieved by constructing a model using a forward-selection or backwards elimination approach, in an attempt to maximize some metric of model quality.Model quality can be assessed based on several metrics such as mean correct classification rates (Rajalahti et al., 2009b) or the degree of separation between classes of samples in principal component (or latent variable) space, for example using either a Euclidian distance-based metric (Pierce et al., 2005) or a metric that accounts for size and shape of clusters (Sinkov & Harynuk, 2011a).
The one exception to the rank-and-select approach are genetic algorithms (Yoshida et al., 2001), though due to the sheer number of variables present in a typical separation, these are not often used on the raw separations data as arriving at the optimal number and combination of variables is computationally inefficient and uncertain.
Sometimes, several feature selection methods are used for a given analysis.For example, an analyst might reduce chromatogram to a peak table, selecting a series of candidate variables of interest and then perform further variable ranking and optimization on the integrated peak table, especially in the case of multidimensional separations where hundreds, if not thousands of compounds can be resolved (Felkel et al., 2010).
Finally, cross-validation is extremely important, especially when processing raw separations data and using a feature ranking approach such as ANOVA.As discussed previously, raw separations data contain on the order of 10 5 to 10 6 data points for each sample.In these cases of overdetermined systems it is entirely possible that some combinations of variables containing only noise will, by random chance, indicate a difference between samples.When handling raw separations data, a good approach to avoid this problem is to break the data set into three separate sets: a training set to construct the model, an optimization set to optimize data processing parameters (such as alignment and feature selection), and finally a test set to determine if the optimized model has any meaning (Brereton, 2007).Of course this does require that one collect data for a large number of samples so that a representative population of samples exists for each of the three subsets of data.

Applications and examples
After applying the appropriate pre-processing, different chemometric techniques can be applied according to the aim of the study.Pattern recognition is one of the chemometric methods most used in analytical chemistry and this is true for separations data.Pattern recognition can be generally divided into two classes: exploratory data analysis and unsupervised and supervised pattern recognition (Otto, 2007;Brereton, 2007).
Exploratory data analysis aims to extract important information, detect outliers and identify relationships between samples and its use is recommended prior to the application of other chemometric techniques.Examples of the use of exploratory data analysis tools applied to separations data include principal component analyisis (PCA) (de la Mata-Espinosa et al., 2011a;Ruiz-Samblas et al., 2011) and factor analysis (Stanimirova et al., 2011).
Unsupervised pattern recognition techniques uncover patterns within a data set without a priori class assignment of samples.Here, the objective is to find patterns in the data which allow grouping of similar samples using, for example, cluster analysis which has been applied to separations data by Reid et al. (2007).When supervised pattern recognition is used, the classes of samples in a training set are known and used to calibrate a model, which is then used to predict class assignments of unknown samples.Some examples of which are linear discriminant analysis (LDA), and partial least squares-discriminant analysis (PLS-DA) (de la Mata-Espinosa et al., 2011b;Zorzetti et al., 2011;Sinkov et al., 2011b).In a study performed by Sinkov et al., two alignment techniques for chromatographic data were compared.The data comprised raw GC-MS chromatograms of simulated arson debris where some samples contained different types of gasoline weathered to different extents spiked into debris samples which themselves exhibited a high degree of variability in their chemical composition.The goal was to build a PLS-DA model that could correctly classify debris samples based on whether or not they contained gasoline (Figure 7).As can be seen, the alignment algorithm used has a direct impact on the quality of the predictions.In Figure 7A, there are multiple false positives, false negatives, and ambiguous samples.In Figure 7B, all samples are classified correctly and there are no ambiguous samples.
Another example of applying chemometrics to separations data is depicted in Figures 8 and  9. Here, interval PLS (iPLS) was applied to blends of oils in order to quantify the relative concentration of olive oil in the samples (de la Mata-Espinosa et al., 2011b).iPLS divides the data into a number of intervals and then calculates a PLS model for each interval.In this example, the two peak segments which presented the lower root mean square error of cross validation (RMSECV) were used for building the final PLS model.
As mentioned in Section 3.3.2,PARAFAC is a chemometric tool for multidimensional data treatment.The scores and loadings obtained with PARAFAC can be used in two-way models for data exploration and quantitative analysis (Vosough et al., 2010).When small deviations in trilinearity exist within the data, usually due to relatively small shifts in retention time in the case of separations data, a modified version of PARAFAC called PARAFAC2 is recommended for use (Bro et al., 1999).
Like PARAFAC, PARAFAC2 decomposes raw data into loading and score matrices but without the imposition of trilinearity as in PARAFAC.Even without this constraint, the PARAFAC2 model preserves the property of uniqueness that is so advantageous with PARAFAC.Thus, analyte profiles and concentrations can be estimated by PARAFAC2 even if chromatographic alignment is not perfect (Amigo et al., 2008;Skov et al., 2009).

Conclusions
The analyst must choose from a plethora of methods for processing separations data, a potentially daunting task.It is our hope that this review will help chromatographers entertaining thoughts of applying chemometrics to their data understand what they must consider when choosing how to prepare their data.Likewise, it is hoped that we have informed chemometricians of some of the specific challenges associated with the processing of chromatographic data and the origins of those limitations.In the development of a chemometric model for the interpretation of separations data, there are numerous opportunities for missteps that will exclude key information from the model and/or generate meaningless results.However, when due care is taken there are also many opportunities to apply chemometric techniques to transform the rich data generated by these powerful analytical tools into valuable information effectively and efficiently.

Fig. 1 .
Fig. 1.LC chromatograms of edible oils showing a high degree of variation in baseline.

Fig. 5 .
Fig. 5. Deconvolution of overlapping peaks.The black, solid trace represents the analytical signal observed at the detector, which is the sum of the four peaks represented by dashed lines.

Fig. 6 .
Fig. 6.Models constructed from the same data set using different numbers of top-ranked variables.(A) Too few variables; (B) Too many variables; (C) Optimal number of variables.

Fig. 7 .
Fig. 7. PLS-DA Models for identifying gasoline in simulated arson debris derived from the same raw data, but aligned with different techniques.(A) Feature-based alignment; (B) Deuterated alkane ladder -based alignment.All other treatment and model construction algorithms were the same in both cases.Hollow markers indicate data in the training set while filled markers indicate data in the validation set.Circles represent debris containing gasoline while triangles represent gasoline-free debris.Reprinted from Sinkov et al., 2011b, with permission.

Fig. 8 .
Fig. 8. Feature selection using iPLS.Segments in green showed lower RMSECV and were thus used to construct the final model.Reprinted from de la Mata-Espinosa et al., 2011b, with permission.

Fig. 9 .
Fig. 9. Predicted vs. actual % olive oil using PLS model constructed based on results in Figure 8. Reprinted from de la Mata-Espinosa et al., 2011b, with permission.