Statistical Analysis of Chemical Sensor Data

Chemical sensors measure and quantify substances via their associated chemical or physical response, thus providing data that can be analyzed to address a scientific question of interest (Eggins; 2002). Used in a variety of applications from monitoring to medicine, chemical sensors vary vastly by construction, style, format, size and dimension, and complexity. The common, underlying feature of these sensors lies in the associated data, which are abundant with technical and structural complexities, making statistical analysis a difficult task. These data further share a common need to be measured, analyzed, and interpreted properly so that the resulting inference is accurate. There are many image analysis algorithms available and amenable to a myriad of chemical analysis problems, thus potentially applicable to chemical sensor data problems in particular. By applying these tools to chemical sensor data, we can optimize and evaluate a chemical sensor’s ability to perform its intended tasks. This chapter is designed to give an overview of the modern statistical algorithms that are commonly used when designing and analyzing chemical sensor experiments. Without focusing the discussion around a specific chemical sensor platform, our goal is to provide a general framework that will be applicable, to some degree, for all chemical sensor data. From the beginning to the end of an experiment, various statistical methods can be employed to improve or understand the current scientific analysis. We decompose a general experiment into several facets and provide the motivation for potential statistical methods that can be applied within each component. Section 2 describes the pre-processing techniques that are available for summarizing the low-level image or signal data so that the subsequent scientific questions can be properly addressed. In this section, we particularly focus on removing background noise in order to isolate the chemical sensor signal data, quantifying this data, and normalizing it so that the resulting data are scalable across conditions. Section 3 introduces the higher-level statistical approaches that are used to analyze the pre-processed data, and Section 4 describes the statistical computational tools available for use to perform the analyses. Finally, Section 5 demonstrates and motivates the significance of these methods via a chemical sensor case study, and Section 6 concludes the chapter with discussion and summary.


Introduction
Chemical sensors measure and quantify substances via their associated chemical or physical response, thus providing data that can be analyzed to address a scientific question of interest (Eggins;2002).Used in a variety of applications from monitoring to medicine, chemical sensors vary vastly by construction, style, format, size and dimension, and complexity.The common, underlying feature of these sensors lies in the associated data, which are abundant with technical and structural complexities, making statistical analysis a difficult task.These data further share a common need to be measured, analyzed, and interpreted properly so that the resulting inference is accurate.There are many image analysis algorithms available and amenable to a myriad of chemical analysis problems, thus potentially applicable to chemical sensor data problems in particular.By applying these tools to chemical sensor data, we can optimize and evaluate a chemical sensor's ability to perform its intended tasks.This chapter is designed to give an overview of the modern statistical algorithms that are commonly used when designing and analyzing chemical sensor experiments.Without focusing the discussion around a specific chemical sensor platform, our goal is to provide a general framework that will be applicable, to some degree, for all chemical sensor data.From the beginning to the end of an experiment, various statistical methods can be employed to improve or understand the current scientific analysis.We decompose a general experiment into several facets and provide the motivation for potential statistical methods that can be applied within each component.Section 2 describes the pre-processing techniques that are available for summarizing the low-level image or signal data so that the subsequent scientific questions can be properly addressed.In this section, we particularly focus on removing background noise in order to isolate the chemical sensor signal data, quantifying this data, and normalizing it so that the resulting data are scalable across conditions.Section 3 introduces the higher-level statistical approaches that are used to analyze the pre-processed data, and Section 4 describes the statistical computational tools available for use to perform the analyses.Finally, Section 5 demonstrates and motivates the significance of these methods via a chemical sensor case study, and Section 6 concludes the chapter with discussion and summary.

Pre-processing
For the analysis of chemical sensor data, many of the suggested means to resolve low-level analysis problems have been posed by computer scientists or engineers, with little statistical contribution or consideration, and they remain open problems because of significant drawbacks in the proposed approaches.Alternatively, problems have been addressed in direct association with chemical sensor data analysis without recognizing that the resulting data represent a special case from a larger context of image data whose structure has been considered by statisticians.Scientists, for example, are interested in better tools that allow for a completely automated approach to detect chemical changes.They, therefore, recognize the need for minimal inherent noise in order to gain in data reproducibility and trust the obtained summary information for subsequent statistical analysis.Statistical and/or data mining tools, and machine learning algorithms are all methods that scientists can use to remove the noise from the meaningful signal contained within an experiment.Similar to other high throughput biological experiments, several initial steps are often necessary before analyzing the data for a scientific question.Natale et al. (2006) discuss several preprocessing steps including feature extraction, zero-centered scaling, autoscaling, and normalization.Jurs et al. (2000) outline these techniques for chemical sensor arrays, and further include discussion on background or baseline subtraction, and linearization.Meanwhile, a good introduction to the general notion of pre-processing is provided in Gentleman (2005).Although Gentleman (2005) introduce these methods motivated by a different data source, many of their techniques are general enough for chemical sensor data.These issues are all substantial problems that need to be addressed, because any subsequent chemical sensor data analyses are contingent particularly on appropriate and statistically sound low-level procedures.

Background correction
Background noise associated with chemical sensors can occur for any number of reasons, such as the nature of the chemical processes and the machines used to scan or quantify the sensor.Irrespective of the cause, the background effect must be removed in order for the data of interest to be accurate and informative.A variety of background correction techniques exist for various data forms, depending on the dimensionality and structure of the data.A general approach for background correction is to simply subtract the blank sample response, i.e. the response which is obtained before any sample is placed within the sensor (see Jurs et al. (2000), or Sellers et al. (2007) for an analogous approach).Other general approaches subtract either the global minimum from the data, or perform some local Winsorization at a low-percentile value.Such approaches are generally accepted for sensor data that appear in spectral form (e.g.Coombes et al. (2005); Lin et al. (2005); Morris et al. (2005)).While analogous approaches can likewise be applied to two-dimensional data, other methods have also been suggested, which include filtering in the wavelet domain (Coombes et al.;2003), and asymmetric least squares splines regression (Befekadu et al.;2008).

Sensor detection and quantification
In the event of fluorescent-or imaging-based sensor data, there are numerous computer algorithms and summary statistics available to identify the location and size of these features in the raw image data.Peak detection and quantification, for example, are of interest as they relate to spectral data.Various methods to achieve peak detection have been proposed via simple to complex means.One can recognize these methodological developments over time as further study was devoted to this area.Coombes et al. (2003) first suggested that peaks be detected by noting the locations where a change in slope occurs, and later fine-tuned the approach by instead considering the maximum value within the kth nearest neighbors; see Coombes et al. (2005), and independent discussion by Fushiki et al. (2006).More advanced proposals are to either apply an undecimated discrete wavelet transform (Morris et al.;2005), or a continuous wavelet transform (Du et al.;2006).These methods better eliminate the risk of detecting false positive peaks (i.e.data believed to represent peaks from a true signal when the data are actually, say for example, representative of residual noise).In a two-dimensional chemical sensor setting, there are several classes of algorithms that can be applied for spot detection and quantification.Determining the locations and boundaries for chemical sensors in two dimensions falls under the general research area of image segmentation.Within image segmentation there are four main approaches: threshold techniques, boundary-based techniques, region-based methods, and hybrid techniques that combine boundary and region criteria (Adams and Bischof;1994).Threshold techniques are based on the theory that all pixels whose values lie within a certain range belong to one class.This method neglects spatial information within the image and, in general, does not work well with noisy or blurred images.Boundary-based methods are motivated by the postulate that pixel values change rapidly at the boundary between two regions.Such methods apply a gradient operator in order to determine rapid changes in intensity values.High values in a gradient image provide candidates for region boundaries which must then be modified to produce closed curves that delineate the spot boundaries.The conversion of edge pixel candidates to boundaries of the regions of interest is often a difficult task.The complement of the boundary-based approach is to work within the region of interest, e.g. the chemical sensor.Region-based methods work under the theory that neighboring pixels within the region have similar values.This leads to the class of algorithms known as "region growing", of which the "split and merge" techniques are popular.In this technique, the general procedure is to compare one pixel to its neighbor.If some criterion of homogeneity is satisfied, then that pixel is said to belong to the same class as one or more of its neighbors.As expected, the choice of the homogeneity criterion is critical for even moderate success and can be highly deceiving in the presence of noise.Finally, the class of hybrid techniques that combine boundary and region criteria includes morphological watershed segmentation and variable-order surface fitting.The watershed method is generally applied to the gradient of the image.In this case, the gradient image can be viewed as a topography map with boundaries between the regions represented as "ridges".Segmentation is then equivalent to "flooding" the topography from local minima with region boundaries erected to keep water from different minima exclusive.Unlike the boundary-based methods above, the watershed is guaranteed to produce closed boundaries even if the transitions between regions are of variable strength or sharpness.Such hybrid techniques, like the watershed method, encounter difficulties with chemical sensor images in which regions are both noisy or have blurred or indistinct boundaries.A popular alternative is seeded region growing (SRG).This method is based on the similarity of pixels within regions but has an algorithm similar to the watershed method.SRG is controlled by choosing a small number of pixels or regions called "seeds".These seeds will control the location and formation of the regions in which the image will be segmented.The number of seeds determines what is a feature and what is irrelevant or noise-embedded.Once given the seeds, SRG divides the image into regions such that each connected region component intersects with exactly one of the seeds.The choice of the number of seeds is crucial to this algorithm's success.Fortunately, with many chemical sensor experiments, the number of chemical sensors, and thus the seed, is known beforehand; see Adams and Bischof (1994) for further details.Feature quantification is also an important issue, as there are several options that aid in reducing data dimensionality and complexity.At the same time, one ideally wants to measure a feature in such a way that captures an optimal amount of sensor information.Pardo and 329 Statistical Analysis of Chemical Sensor Data www.intechopen.comSberveglieri (2007) compare the performance of five feature summaries in chemical sensor arrays: the relative change in resistance; the area under the curve over gas adsorption, and gas desorption; and the phase space integral over adsorption, and desorption.In their study, while they do not attain uniform results across the various datasets, they find (on average) that the phase integral over desorption performed best.Further, the integral and phase space integral over desorption performed better than the analogous computations associated with adsorption.These results are consistent with other applied fields where such feature quantification is performed by computing the associated area under the curve.Carmel et al. (2003) instead argue that focusing on such features (such as the difference between the peak and baseline, the area under the curve, the area under curve left of the peak, or the time from the beginning to the peak of a signal) does not fully capture certain sensor properties, thus limiting one's ability to perform analyses.Focusing on transient signals, the authors fit various parametric models to chemical sensors for electronic noses, namely exponential, Lorentzian, and double-sigmoid models.Their results show that the double-sigmoid models fit optimally, followed by the Lorentzian model, with the exponential model being the worst of the three but still with decent performance.The computational time needed to fit these models, however, showed that the Lorentzian and exponential models were estimated far more quickly than for the double-sigmoid model.This makes sense because the double-sigmoid model has nine parameters that require estimation, while the Lorentzian and exponential models only have two parameters.Given this tradeoff, the authors propose using the Lorentzian model to analyze such chemical sensor data.

Normalization
In a normalization step, the goal is to remove obscuring sources of variation to give accurate measurements of the desired signal.Normalization could proceed in a manner similar to that described in Sellers et al. (2007) to remove known possible sources of variation, where one can obtain associated response data based on the presence of these factors in the design.Linearization can also be performed by considering the engineering-derived equations that drive the signal (see, e.g., Robins et al. (2005)).Some chemical sensors are ruled by a power law relationship between sensor signal and analyte concentration; this is often the case, for example, with metal-oxide semiconductor gas sensors (Natale et al.;2006).Using least squares approaches, it is possible to estimate the parameters in a power law relationship.This is a popular approach when preprocessing chemical sensor data because many of the subsequent analyses (e.g.linear discriminant analysis, principal component analysis, principal component regression, partial least squares) assume a linear relationship between sensor response and sample class (Jurs et al.;2000).Relative scaling is a common practice in the normalization of chemical sensor arrays, however the approach by which it is performed may vary.Options include dividing the signal by either the maximum signal value, the Euclidean norm from the signal, or the maximum value from a reference signal.In any respect, relative scaling serves to normalize the data in order to be on the same scale.
the same empirical distribution of the chemical sensor intensity of each profile (e.g. the profile for each experimental unit will have the same quartiles, etc).The algorithm proposed in Bolstad et al. (2003) is designed so that all profiles are matched (aligned) with the empirical distribution of the averaged sample profiles.

Low-level analysis discussion
Any or all low-level analysis procedures can be performed to obtain summary information on the raw chemical sensor data.The order of operations for these algorithms, however, are inconsistent and generally unrecoverable.As a result, the resulting preprocessed chemical sensor data can vary, thus potentially causing severe repercussions in the high-level analysis (see Baggerly et al. (2004)).To this end, one should be mindful of the low-level analyses performed (along with their order of operations) and comfortable with their use in data preprocessing.Nonetheless, data preprocessing results in the S × I summary matrix, X =( x si ), where x si denotes the normalized measure of sensor s in sample i.This data matrix will be used for subsequent statistical analysis.

Data analysis
There are several approaches that can be pursued to analyze the preprocessed data, depending on the question of interest.This section introduces these high-level, downstream methods.Here, we assume that the resulting preprocessed data matrix has rows associated with the chemical sensors used for the analysis, while the columns refer to the samples or patients.Jurs et al. (2000) classifies several methods as either statistical (including linear discriminant analysis (LDA), and principal component analysis (PCA)), or using neural networks while cluster analysis tools are classified separately.Given the popularity of LDA and PCA, we focus on these statistical methods here; see Jurs et al. (2000) for added discussion regarding various alternatives.Linear discriminant analysis (LDA) is a statistical method (credited to Fisher) for dimension reduction and potential classification in that it distinguishes between two or more groups.The discriminant functions are derived from means and covariance matrices, thus working to maximize the distance between groups (relative to the variance within respective groups).While LDA is a popular dimension reduction technique for its natural approach, it tends to overfit when the ratio of training samples to dimensionality is small; see Wang et al. (2004).Jurs et al. (2000) concur that one needs a "relatively large number of samples from each class in the training data" that is representative of the population.Principal component analysis (PCA) is an alternative statistical approach for dimension reduction.Invented by Karl Pearson, PCA performs singular value decomposition on the data matrix, X, where the resulting terms relate to the eigenvalue-eigenvector form of X ′ X and XX ′ , respectively.PCA is a popular choice in chemical sensor analysis because the first two principal components often account for at least 80% of the chemical sensor data variance (Jurs et al.;2000), and is more robust to overfitting than LDA (Wang et al.;2004).This method, however, may not successfully classify groups.Low-variance sensors, or nonlinear or nonadditive sensors can make classification difficult (Jurs et al.;2000;Wang et al.;2004).Recognizing the limitations of these statistical methods, Wang et al. (2004) propose a "hybrid" model, termed Principal Discriminant Analysis (PDA).The hybrid matrix, can be interpreted as a weighted average of the within-and between-group matrices from the LDA solution, and the total data covariance associated with PCA.The optimal ǫ ∈ [0, 1] is attained via cross-validation, where ǫ = 0 attains the LDA eigenvalues, and ǫ = 1 produces the PCA projection.As a result, PDA provides a compromise between the popular LDA and PCA.Similar to Jurs et al. (2000), we explore supervised and unsupervised machine learning/ statistical techniques to understand large complex datasets.Specifically, we examine linear modeling techniques to determine significantly different chemical sensors between two or more populations (e.g.neural nets as in Hashem et al. (1995) 1 ).We also explore classification (supervised) and clustering (unsupervised) techniques to explore the similarities and differences between samples or between sensors.Within classification methods, we explore methods to both build and validate (e.g.data splitting/ cross validation) the classification schemes.Ultimately, we use the concepts of sensitivity and specificity to choose among a class of classification schemes.

Multiple testing
While LDA, PCA, and PDA all work with the entire dataset, commonly researchers would like to identify a subset of chemical sensors that are associated with the outcome.In this sense, the researchers are performing a data reduction, where the goal is to choose a subset of chemical sensors that are related or associated with the outcome.In this setting, the researcher commonly uses hypothesis testing to choose the important subset of chemical sensors.Hypothesis testing seeks to obtain statistically significant results regarding a question of interest.In this process, the null hypothesis (H 0 ) represents the status quo statement while the alternative hypothesis (usually denoted as H 1 or H a ) defines that which is to be potentially proven or determined.When performing a hypothesis test, one wants to make a correct decision.There are, however, four possible scenarios that can occur when performing such a test; see Table 1.Two scenarios represent correct decisions, while the other two are incorrect decisions or "errors": (1) when one rejects the null hypothesis when it is actually true, and (2) when one does not reject the null hypothesis when it is actually false.The probability associated with the first scenario is referred to as Type I error (denoted α), and the second scenario's probability is termed Type II error (denoted β).For completeness in this discussion, statistical power refers to the probability of rejecting the null hypothesis when (in fact) the null hypothesis is false.In other words, statistical power equals one minus the Type II error (i.e. 1 − β).Even when performing one hypothesis test, one wants to minimize the error probabilities.
We assume that our chemical sensor is multivariate, in the sense that each experimental unit (i. Table 3.A summary of results from analyzing multiple hypothesis tests, where each cell represents the number (counts) in each category with m total tests.light of the multiple chemical sensor levels measured in each experimental unit.In this setting, researchers and statisticians usually design their tests such that rejecting H 0 will yield discoveries or chemical sensors of interest.For example, when testing case samples against control samples on a chemical sensor platform, we would like to configure our hypothesis tests such that we reject H 0 for chemical sensors that are distinct between the cases and controls; see Section 5. Commonly these hypothesis tests are performed using (linear) regression models.In these regressions, we estimate parameters designed to measure the effects of a chemical sensor in relation to an outcome.Common outcomes might be survival times or group membership (case vs. control).Using estimates of these parameters for a given chemical sensor, we can determine its significance.The interested reader is referred to Rawlings et al. (1998) and Cohen (2003) for comprehensive discussion of linear models and associated hypothesis testing.
In light of this discussion for multiple sensors, we can define our Type I and Type II errors in terms of sensitivity and specificity.That is, sensitivity is defined as sensitivity = number of true positives number of true positives + number of false negatives , while specificity is defined as specificity = number of true negatives number of true negatives + number of false positives . ( This situation is also summarized in Table 2.Note that sensitivity and specificity are estimated values because, in any experiment, we do not know the number of true positives or true negatives.Our goal when performing multiple testing and classification is to maximize the sensitivity and specificity, thus limiting the number of errors committed.Table 2 can be refined in light of performing multiple hypothesis tests.In Table 3, we generalize the hypothesis testing in light of performing m hypothesis tests.Note that in Table 3, U, V, T, Q denote random variables (unknowns), while we assume that m is a fixed unknown quantity of hypothesis tests.

333
Statistical Analysis of Chemical Sensor Data

www.intechopen.com
In light of performing multiple hypothesis tests (one for each sensor), we need a method to control the Type I error across the multiple tests.A first attempt for controlling Type I error is to perform a Bonferroni correction where, given m hypothesis tests, the measure for statistical significance is now attained if the associated p-value is less than α/m.I n other words, the significance level is now scaled by the number of hypothesis tests.While this approach successfully adjusts for multiple tests, the procedure is far too stringent!The following subsections detail alternative Type I error rates and the available methods to control those errors, where Table 3 provides the associated notation in terms of probabilities (Pr): the familywise error rate, FWER = Pr(V ≥ 1); the k-familywise error rate, k-FWER = Pr(V ≥ k); and the false discovery rate (FDR), which is E(V/R) if R > 0, or otherwise 0 with E() denoting the expected value function.

Familywise error rates
The k-FWER error rate is a generalized version of the familywise error rate (FWER).Control of FWER refers to controlling the probability of committing one or more false discoveries.
If we let V denote the number of false positives from m hypothesis tests, then notationally (according to Lehmann and Romano (2005)), α control of FWER can be expressed as or equivalently, Note that α is usually chosen to be small, e.g., 0.05.Often Equation ( 3) is abbreviated as FWER ≤ α.I nk-FWER, the equation becomes where k ≥ 1 and α are usually determined prior to the analysis.Similar to FWER, control of k-FWER can be expressed as k-FWER ≤ α.Note that there is the potential for ambiguity in control of k-FWER since, occasionally (as in Gentleman ( 2005)), k-FWER may be expressed as The adjusted Bonferroni method to control k-FWER is a generalized version of the Bonferroni correction designed to control FWER (Lehmann and Romano;2005).The Bonferroni correction is designed to control the FWER at level α by doing each individual test at level α/m, where m is the number of tests.The adjustment given in Lehmann and Romano (2005) to control k-FWER at α is done by performing each test at level kα/m.By performing each test at this level, the probability against k or more false positives is no larger than α; that is, k-FWER ≤ α.The proof is supplied in Lehmann and Romano (2005) and is a generalization of the proof for the original Bonferroni method designed to control FWER.For a description of other methods to control k-FWER and a power comparison of k-FWER methods, see Miecznikowski et al. (2011).

False discovery rate
Multiple statistical testing procedures began to be reexamined in the early 1990s with the advent of high-throughput genomic technologies.The Benjamini and Hochberg (BH) method was proposed to control the false discovery rate (FDR), or the expected rate of false test positives (see Benjamini and Hochberg (1995)).In the BH multiple testing procedure, the FDR is controlled by the following scheme: 2. Denote t = p (k) for the largest k such that p (k) ≤ kα m .3. Reject all null hypotheses, H 0i , for which p i ≤ t.
Note that we define FDR such that where E denotes the expected value function.Benjamini and Hochberg (1995) proves that, if the above procedure is applied, FDR ≤ α.Storey (2002) further show that, for p-value threshold t, where π is the probability that an alternative hypothesis is true, and F(t) is the distribution of p-values given the alternative.FDR performance has been evaluated for sensor detection given a variety of scenarios, for example, in the presence of correlation (Benjamini and Yekutieli;2001;Shao and Tseng;2007).Importantly, note that FDR analysis does not control what Genovese and Wasserman (2004) call "the realized FDR" (rFDR), i.e. the number of false rejections V divided by the number of rejections R (assuming at least one rejection) which, in fact, can be quite variable (as shown in Gold et al. (2009)).

Classification
This section provides an overview of classification models.Throughout this section, we assume that the outcome variable of interest is binary.This is commonly the situation with case/control experiments where, for example, the goal may be to predict the presence or absence of a disease.The simplest and most direct approach to classification with a binary outcome variable is to estimate the regression function, r(x)=E(Y|X = x), and use the classification rule, ĥ(x)= 1i fr(x) > 0.5 0 otherwise.(8) Here, the simplest regression model is the linear regression model, where the errors, ǫ, have mean 0. A simple example of this classifier is provided in Section 5.
Other examples of classifiers include linear discriminant analysis, support vector machines, and ensemble classifiers using bootstrapping and bagging techniques; see Hastie et al. (2005) and Wasserman (2004) for a more complete treatment of classification models.Similar to hypothesis testing, we want an accurate classifier that commits relatively few errors; i.e. we would like a classifier with a high sensitivity and specificity (see Equations ( 1) and ( 2)).To estimate sensitivity and specificity for a given classification model, we commonly use cross validation methods, as described in the next section.

335
Statistical Analysis of Chemical Sensor Data www.intechopen.com

Cross validation
Cross validation can be classified under the general realm of sample splitting.Its objective is to obtain an estimate of the prediction qualities of the model when using that same data to build the prediction model.The simplest version of cross-validation involves randomly splitting the data into two pieces: the training set, and the validation set.The classifier is constructed from the training set, and the associated error is estimated using the validation set; the error is defined as Error = number of misclassifications/ number of predictions.( 10) Two extensions of this method are g-fold cross validation, and leave-one-out cross validation (LOOCV).Note that LOOCV is a special case of g-fold cross validation, where g is equal to the number of objects in the dataset.As described in Wasserman (2004) in a g-fold cross validation, we do the following: 1. Randomly divide the data into G groups of approximately equal size.
2. For g = 1toG, (a) delete group g from the data.(b) fit or compute the classifier from the remaining data.
(c) use the classifier to predict the data in group g, and let L denote that observed error rate.
3. Let the overall error rate be estimated from averaging over the error rates from the previous step.
See Section 3.2.2 for a discussion of other summary measures (e.g.sensitivity and specificity) that are commonly estimated with cross validation.

Summary measures in a population
Naturally, we want our classifiers to make accurate predictions.We have seen that specificity and sensitivity as estimated via cross validation are reasonable measures to summarize our classification models.In this section, however, we highlight some of the measures used to evaluate our classification models in a population.The measures introduced in this section are often crucial in deciding the utility of a chemical sensor.When determining a classifier's effectiveness, analysts usually calculate the sensitivity and specificity using cross-validation.To understand the potential utility of the chemical sensor-derived classifier, however, it is important to calculate the positive predictive value (PPV) and negative predictive value (NPV).The PPV is the proportion of subjects with positive test results who are correctly diagnosed.It reflects the probability that a positive test reflects the truly positive underlying condition.The PPV depends heavily on the prevalence of the outcome of interest, which is usually unknown.Using Bayes Theorem (see Wasserman (2004)) and Table 2, we can derive the positive predictive value as Note that we define the prevalence in terms of epidemiologic factors, i.e. prevalence.Prevalence (of disease) is the total number of (disease) cases in the population divided by the number of individuals in the population.Prevalence is (essentially) an estimate of how common the underlying condition is within a population over a certain period of time.

336
Advances in Chemical Sensors

www.intechopen.com
Defining a as the number of individuals in a given population with the disease at a given time, and b as the number of individuals in the same population at risk of developing the disease at this given time (not including those already with the disease), the prevalence is specified by Similarly, we can define the NPV as the proportion of subjects with a negative test result who are correctly diagnosed.A high NPV means that, when the test yields a negative result, it is uncommon that the result should have been positive.Mathematically, the NPV is computed as While sensitivity and specificity play a role in assessing a chemical sensor, we stress that NPV and PPV are often the measures used when deciding the clinical and medical utility of a potential chemical sensor panel.A thorough handling of the topic of estimation with regard to sensitivity, specificity, PPV, and NPV is provided in Pepe (2004), Cai et al. (2006), and Pepe et al. (2008).

Software development
Bioconductor (Gentleman et al.;2004) and R (R Development Core Team; 2008) are two statistical computing tools that can be used for statistical programming and data analysis.
Both are freeware tools that are downloadable from the internet.Researchers developing novel chemical sensors should consider writing a computational package for analyzing their chemical sensor data in R.This will enable the statistical methods and algorithms to be used by the scientific community.For various examples of R packages, see Gaile et al. (2010) and Gentleman (2005).For example, Chandrasekhar et al. (2009) authored a software package specific to Xerogel chemical sensor images such as those shown in Figure 1 (A).These images and the Xerogel technology are further described in Chandrasekhar et al. (2009).The Xerogel R package consists of routines that import the tagged image file format (TIFF) image, read the image into a matrix, binarize the image matrix, identify the position and structure of the spots, and return statistics such as the mean, median, and total intensity for these spots.The summary statistics represent the results after preprocessing and, ultimately, provide information on how the intensity of light varies with varying amounts of the volatile organic compounds (VOCs).A summary set of images demonstrating the pre-processing of a representative Xerogel image is shown in Figure 1 (A)-(D).

Case study
In this case study, we examine a subset of the data from Schröder et al. (2010) which represents a chemical sensor dataset designed to classify pancreatic cancer patients from normal patients.This dataset is publicly available and can be downloaded from ArrayExpress (see Parkinson et al. (2009)) with ID accession number E-MEXP-3006; see http://www.ebi.ac.uk/arrayexpress/.This dataset is representative of a two dimensional chemical sensor dataset.The experiment employs protein antibody microarrays with two color channels on an array consisting of 1800 features (proteins).The preprocessing for this data is consistent with the methods described in Section 2. The data in this study were preprocessed using the following scheme: • Background Correction -as recommended in Ritchie et al. (2007), a convolution of normal and exponential distributions is fitted to the foreground intensities using the background intensities as a covariate, and the expected signal given the observed foreground becomes the corrected intensity.
• Normalizationlowess is applied as proposed in Yang et al. (2002) and Smyth and Speed (2003).Here, the signal in each array is adjusted to account for the intensity bias with a nonlinear curve fitting method, lowess.
All preprocessing steps were performed using the limma package in R (see Smyth (2004)).
The experimental design for this experiment is fully described in Schröder et al. (2010).For our subset of data, we have a total of 12 subjects, specifically three patients in each of four groups (male controls, female controls, males with pancreatic cancer, and females with pancreatic cancer).The patients were named in terms of their disease status (healthy = h, cancer = c) and gender (male = m, female = f); e.g.hf_1 refers to the sample for the first healthy female, while cm_1 refers to the sample for the first pancreatic cancer male.Midstream urine samples were collected from each patient and pH was adjusted to 7. After sample preparation, the samples were dye-labeled and incubated to antibody microarrays containing 1,800 features.Figure 2 shows a representative fluorescence array from this study where a urine sample and reference consisting of a pool of samples from diseased and healthy subjects were labeled with different dyes.After preprocessing, we arrive at a matrix containing 1678 rows (proteins) and 12 columns (samples), where each entry represents the logarithmic intensity (amount) of the protein present in the sample (relative to the reference channel).The heatmap representing this data is shown in Figure 3.
In performing an exploratory data analysis, we looked at clustering the samples as well as the distance between the samples.The clustering results in terms of a hierarchical clustering are shown in Figure 4. From this figure, we see that sample hf_2 may represent an outlier in this study; this is further confirmed in Figure 5. From this figure, we see that hf_2 is widely separated from the other samples in the study (white band).Note that the Euclidean distance metric was used to calculate the distance between each sample.For an overview of distance metrics, see Chapter 12 of Gentleman (2005).This potential outlier, hf_2, is also confirmed by studying the protein profile in Figure 3, where we see a pattern that is inconsistent with the other samples.
To further the analysis in this case study, we build linear model(s) to discover potential differentially expressed proteins between cancer patients and normal patients; see Section 3.1.We explored univariate protein models and multivariate protein models that adjusted for gender.Building these models allows us to test each protein for association with disease status.In particular, we let y ji = μ j + β dz x dz + ǫ ji (14) denote the observed protein level for protein j in sample i (i = 1, . . ., 12), with x dz equaling 1 if sample j is a cancer sample and 0 otherwise and ǫ ji is assumed to be normally distributed with mean 0 and variance σ 2 j .In this model, we are interested in estimating the parameters for each protein.For protein j, the pancreatic cancer samples have a mean of μ j + β dz while the healthy samples have a mean of μ j .We are especially interested in proteins where β dz is significantly different from 0, as this indicates proteins that are significantly different between diseased patients and healthy patients.Accordingly, for each protein, we wish to test the hypothesis, Using an empirical Bayes test described in Smyth (2004), we obtain a test statistic and p-value for each protein corresponding to the test in Equation ( 15).Using a Šidàk control method described in Miecznikowski et al. (2011), we control k-FWER such that the probability of committing no more than five false positives is no larger than 0.05.Under this scheme, we discover three significant proteins; see Table 4.
In Equation ( 16), we introduce a more complex model.We include the explanatory variable x gen , which is 1 if sample j is a female and 0 otherwise.By incorporating a variable for the patient's gender, this model is more complex than the model described in Equation ( 14).This model is described as Table 4. Table displaying the significant proteins from an analysis with univariate and multivariate models as described in Equations ( 14) and ( 16), respectively.
After estimating the parameters in Equation ( 16), we are also interested in testing the null hypothesis in Equation ( 15).With this model, however, the parameter β dz represents proteins that are significantly different in disease states (case/control) after adjusting for potential gender biases.For protein j, the female pancreatic cancer patients have estimates, μ j + β gen + β dz ; while the female healthy patients have estimates, μ j + β gen .Similarly, the male pancreatic cancer patients have estimates μ j + β dz , while the male healthy patients have estimates, μ j .Using the model described in Equation ( 16) for each protein, we obtain the test statistic and p-value for β dz from an empirical Bayes test (Smyth;2004).As with the model in Equation ( 14), we use a Šidàk method to control k-FWER such that the probability of committing 10 or more false positives to be no larger than 0.05.Under this scheme, we obtain four significant proteins.Figure 6 displays the heatmap of the significant proteins under this setting and   Fig. 6.Heatmap of the significant proteins as determined in case study using Equation ( 16) and a k-FWER error controlling scheme.
After determining the significant proteins in this study, it is reasonable to examine models for prediction.We explored using a logistic regression model as described in Section 3.2.We choose a classification model using the most significant protein (Prot1) as determined from fitting the model in Equation ( 14).The fitted logistic regression equation is r(Prot1)= exp(−10.303* Prot1) 1 + exp(−10.303* Prot1) where Prot1 is the intensity of the most significant protein after fitting Equation ( 14); the profile for this protein is shown in Row 1 in Figure 6.Our logistic classifier is then specified by ĥ(x)= Disease if r(x) > 0.5 Healthy if r(x) ≤ 0.5.( 18) Using a leave-one-out cross-validation method as described in Section 3.2.1,we obtain a sensitivity estimate of 83.3% (5/6) and a specificity estimate of 100% (6/6); the misclassified sample is cf_2.As seen in Row 1 of Figure 6, cf_2 has a value for Prot1 indicating a pattern more aligned with the healthy samples.Similar to Schröder et al. (2010), our conclusion from this analysis is that a urine proteomic profile as measured on antibody arrays shows promise in diagnosing pancreatic cancer.Due to the limited sample size (12 samples) of our case study, however, we caution the reader not to rely heavily on these estimates of sensitivity and specificity.

Summary
Chemical sensor data appear in a variety of contexts, from breathalyzers to carbon monoxide or smoke detectors.Given their pervasive existence in various aspects of life, it is essential that these sensors work and provide proper analysis to accurately assess chemical questions of interest.This can only happen with proper statistical insight and tools to provide accurate assessments that can lead to appropriate decision-making.Hirschfeld et al. (1984) noted the importance of chemometrics with sensor arrays, particularly the use of pattern recognition strategies and learning algorithms.As a result, calibration, quantification, and reproducibility are all attainable.In this chapter, we highlight some of the state-of-the-art statistical methods that are used to calibrate, quantify, and ultimately, benchmark modern chemical sensors.We stress that further advancements in the use and utility of chemical sensors will require input from statisticians to determine the accuracy/reproducibility of the sensors as well as their ability to make inferences on a population.

Appendix
In this section, we provide a brief overview of neural networks as they are commonly used for prediction with chemical sensor data (see Hashem et al. (1995)).As discussed in Wasserman (2004), neural networks often take the form, where σ is a smooth function.When compared to models such as in Equations ( 14) and ( 16), the neural networks model is obviously more complex.As such, these models are often difficult to fit to datasets and often require large datasets and heavy computational power.

343
Statistical Analysis of Chemical Sensor Data www.intechopen.com

Fig. 3 .
Fig.3.Heatmap showing the proteins levels for each sample after preprocessing.Note sample hf_2 appears to be an outlier in this subset of data.

Fig. 5 .
Fig.5.The distance matrix heatmap showing the measured Euclidean distance between the samples using the protein profile for each sample.From this figure, the profile for sample hf_2 is the furthest in terms of Euclidean distance from the other samples.

Table 1 .
Possible outcomes for a null hypothesis, H 0 , and associated outcome (rejecting or failing to reject H 0 ).

Table 2 .
e., subject, patient, sample, or animal) is measured with several chemical sensors.Thus each unit acquires a chemical sensor profile, that is a collection of signals acquired from the chemical sensor.The goal in hypothesis testing is to examine each chemical sensor in Table displaying the summary measures to distinguish positives (cases) from negatives (controls).TP (FP) denotes the number of true (false) positives, while TN (FN) denotes the number of true (false) negatives.PPV and NPV denote positive and negative predictive value, respectively.See Section 3.2.2 for details.