Principal Component Analysis – A Realization of Classification Success in Multi Sensor Data Fusion

The field of measurement technology in the sensors domain is rapidly changing due to the availability of statistical tools to handle many variables simultaneously. The phenomenon has led to a change in the approach of generating dataset from sensors. Nowadays, multiple sensors, or more specifically multi sensor data fusion (MSDF) are more favourable than a single sensor due to significant advantages over single source data and has better presentation of real cases. MSDF is an evolving technique related to the problem for combining data systematically from one or multiple (and possibly diverse) sensors in order to make inferences about a physical event, activity or situation. Mitchell (2007) defined MSDF as the theory, techniques, and tools which are used for combining sensor data, or data derived from sensory data into a common representational format. The definition also includes multiple measurements produced at different time instants by a single sensor as described by (Smith & Erickson, 1991).


Introduction
The field of measurement technology in the sensors domain is rapidly changing due to the availability of statistical tools to handle many variables simultaneously.The phenomenon has led to a change in the approach of generating dataset from sensors.Nowadays, multiple sensors, or more specifically multi sensor data fusion (MSDF) are more favourable than a single sensor due to significant advantages over single source data and has better p r e s e n t a t i o n o f r e a l c a s e s .M S D F i s a n e v olving technique related to the problem for combining data systematically from one or multiple (and possibly diverse) sensors in order to make inferences about a physical event, activity or situation.Mitchell (2007) defined MSDF as the theory, techniques, and tools which are used for combining sensor data, or data derived from sensory data into a common representational format.The definition also includes multiple measurements produced at different time instants by a single sensor as described by (Smith & Erickson, 1991).

The fusion of artificial sensors
The appreciation of food is basically based on the combination of many human senses including sight, touch, sound, taste and smell.However, due to the expensive cost of having panels of trained expert to evaluate food quality parameters, a more rapid technique for objective measurement of food products in a consistent and cost-effective manner is highly needed in the food industry (Winquist et al., 2003).Two human senses that are believed to be closely correlated in the perception of flavour are the sense of smell and taste.The e-nose and e-tongue have been defined as the artificial sensing systems capable of producing a digital fingerprint of a given chemical ambient (D'Amico, 2000).Both devices consist of chemical sensor arrays coupled with an appropriate pattern recognition system capable of extracting information from complex signals (Buratti et al., 2004).
Basically, an e-nose is formed by having an array of gas sensors with different selectivity, a signal collecting unit and suitable pattern recognition software, all controlled and executed by a computer.The principle of e-tongue is similar to that of the e-nose, except for the array of sensors designed for liquids (Cosio et al., 2007).The ultimate task of these sensors is to collect the digital fingerprint or signals that would be further interpreted using multivariate statistical tools before the objective of the fusion approach is attained.One of the most popular exploratory data analyses in chemical sensors is PCA (Di Natale et al., 2006).PCA is a procedure that permits to extract useful information from the data, to explore the data structure, the relationship between the objects and features, and the global correlation of the features.Further details of PCA are described in Section 2. The selected principal components based on certain criteria will be used as an input for classification procedure using linear discriminant analaysis (LDA).Further descriptions of this technique are illustrated in section 3 of this chapter.
The selected architecture of MSDF in this research focuses on the approach of identity fusion.Identity fusion is a fusion of parametric data to determine the identity of an observed object.Our interest is to convert multiple sensor observations of a target attributes (such as e-nose and e-tongue responses) to a joint declaration of target identity.One of the key issues in developing an MSDF system is to determine the stage or phase in the data flow to combine or fuse the data (Hall & Llinas, 1997).In an identity fusion, Hall (1992) suggested three frameworks to be applied; (i) low level data fusion (or data level fusion); (ii) intermediate level data fusion (or feature level fusion); and (iii) high level data fusion (or decision level fusion).However, for the purpose of this discussion only data level and feature level fusion are discussed.

Low level data fusion
In low level data fusion, the e-nose and e-tongue sensors observe the target objects independently, and later the raw sensor data (i.e.original data collected from each sensor) are combined.In order to fuse raw sensor data, the original sensor data must be commensurate i.e. must be observations of similar physical quantities (Hall et al., 1997).Sometimes, the number of features recorded by the e-nose and e-tongue are different, but the raw sensor data can still be fused if both datasets are of the same sample size (equal n).
It is important to ensure the new dataset is formed from the original non-normalized data.A framework of low level data fusion is illustrated in Fig. 1.Fig. 1.Framework of low level data fusion by Hall (1992) It is believed that the low level data fusion in identity fusion provides the most accurate result (Hall et al., 1997).This may be due to the fact that the originality information from each sensor is maintained and used in further processes.Thus, low level data fusion is potentially more accurate than the other two fusion methods.However, the difficulties in the application of low level data fusion method are due to the noise that frequently occurs in the sensor data and redundant data, which have an adverse effect on the classification results.

Intermediate level data fusion
This approach consists of extracting features from the signals of each sensor to yield feature vectors.Then, the feature vectors are fused and identity declaration is made based on the joint feature vectors.The identity declaration process includes techniques such as www.intechopen.comPrincipal Component Analysis -Engineering Applications 4 knowledge-based approach that includes expert system and fuzzy logic, or training-based approach like discriminant analysis, neural network, Bayesian technique, k-nearest neighbors and centre mobile algorithms.Fig. 2 illustrates the framework of the intermediate level data fusion.

Principal component analysis
Principal component analysis (PCA) was first described by Karl Pearson in 1901.A description of practical computing methods came much later from Harold Hotelling in 1933 (Manly, 2004).The idea of PCA is to keep the variation of the number of p original features into a fewer number of k unobservable variables (k ≤ p), which is termed as principal components, as maximum as possible.Let Table 1 Before we proceed to discuss on the issue of reducing the dimension intended for further analysis, it is a need to understand which matrix of information should be used, either a correlation matrix or a covariance matrix to allow for a computation of principal components.One should clearly understand when to use either one of the input matrix as often the results of these two are different.The next sections 2.1 and 2.2 briefly discuss the guidelines.

Principal component using covariance matrix
An implicit assumption when using covariance matrix as an input is that the features should not have grossly different variances.Such differences in variance might arise because of different scales of measurements, different magnitude of measurements, or some combination of the two factors (Krzanowski, 2000).If they do, then the first few principal components will be pulled toward those features with the larger variances (Dillon & Goldstein, 1984).
In such cases, the data should be standardized and it means the correlation matrix is used in the PCA.As a general guideline, it would seem sensible to standardize first whenever the measured features show differences in variances, or whenever the user is concerned with very different measured entities or units (Krzanowski, 2000).However, transformation on the original data would result PC scores of a different meaning (Martinez & A.R. Martinez, 2001).Obviously, the big drawback of PCA based on covariance matrix is the sensitivity of the PCs to the units of measurement used for each element of X (Jolliffe, 2002).

Principal component analysis using correlation matrix
PCA aims to create linear combination of new variables that are uncorrelated to each other, thus, if the correlation matrix portrays nearly small correlation, then there is probably not much point in carrying PCA (Chatfield & Collins, 1980).PCA calculation based on correlation matrix is suitable for features with unequal scales of measure.One way to trace unequal scales is through wide differing variances among the features.In computing a correlation coefficient between two features, differences due to the mean and the dispersion of the features are removed (Dillon & Goldstein, 1984).This is recommended as the original features are all standardized to unit variance (Borgognone et al., 2001).
Therefore, data that is used to calculate PCA for correlation input does not need any transformation as it is applied automatically in the correlation computation.However, a disadvantage in using correlation matrix to calculate the principal components are that they give coefficients for standardized variables and are therefore less easy to interpret directly.Thus, to interpret the principal components in terms of the original variables, each coefficient must be divided by the standard deviation of the corresponding variables (Jolliffe, 2002).

Deciding the number of components to retain
Mathematically, the choice of values for coefficients α is subjected to the restrictions given in equations ( 2) and (3).Thus, the obtained principal components are in decreasing order of variance, In practice, only the first k numbers of principal components account for most of the variability of the original data, thus keeping all the p principal components sound impractical.This mean, only the first k principal components will be used in further analysis while the p-k principal components will be ignored.However, there is no universally accepted method to do so because the decision is largely judgemental and a matter of taste (Dillon & Goldstein, 1984).A number of procedures to determine k have been suggested.Among the most common procedures are as follows.

Average eigenvalue
The most common criterion to determine the number of informative principal components in PCA is the Guttman-Kaiser criterion (Jackson, 1993).Principal components associated with eigenvalues (  ) derived from a covariance matrix which are larger in magnitude than the average of the eigenvalues, are retained.In the case of eigenvalues derived from a correlation matrix, the average is 1.0 for the variables to retain.Therefore, any principal component associated with an eigenvalue whose magnitude is greater than or equal to 1.0 is choosen for further analysis.However, Rencher (1998) warned that this method works well in practice but when it identifies wrongly, it is likely to retain too many components.It is well known as simple and the most suitable criterion to be applied especially when confronted with numerous variables.

Proportion of total variance explained
In a PCA model, each eigenvalue represents the level of variation of the original features explained by the associated principal components.Another popular decision criterion is based on the proportion of the total variance explained by the principal components retained in the model.If k-components are retained, then we may represent the cumulative variance explained by the first k principal components by, Often, the researcher decides on a satisfactory value for t k and then determines k accordingly.The obvious problem with the technique is to decide on an appropriate t k .In practice, it is common to select from 70% to 90% (Jolliffe, 2002).Because of such obviously arbitrary, this approach has sometimes been criticized for its subjectivity (Kim & Mueller, 1978).While Jackson (1993) strongly argues against the use of this method except possibly for exploratory purposes when little are known about the population of the data.

Scree plot
Perhaps much easier decision on k can be made based on graphical approaches as suggested by Cattell (1966) called the scree plot.A scree plot is a plot of the eigenvalues versus the index of the eigenvalue.With this approach, the eigenvalues of each component are plotted in successive order of their extraction, and then an elbow in the curve is identified by applying a straightedge to the bottom portion of the eigenvalues to see where they form an approximate straight line (Dillon & Goldstein, 1984).
The value of k is given by the point at which the components curve above the straight line formed by the smaller eigenvalues.Fig. 3 shows a case in which k is equal to three and the straight line (shallow) begins at the forth until the last component.As we can observe from Fig. 3, the third component is marked exactly at eigenvalue is equal to 1. Dillon and Goldstein (1984) argue that this method is inconclusive when there is no obvious break or there may be several breaks.And it become more troublesome when two breaks occur among the first half of the eigenvalues, since it will be difficult to decide which of the breaks reflect the correct number of components.

Linear discriminant analysis
Linear discriminant analysis or discriminant function analysis or in short discriminant analysis is a supervised technique for classifying objects into two or more groups, given the measurements for these objects is from several features (i.e.sensor responses).It involves deriving linear combinations of the independent features that will discriminate between the a priori defined groups in such a way that the misclassification error are minimized (Dillon & Goldstein, 1984).The discrimination can be accomplished by maximizing the between group variance relative to the within-group variance.The basic discriminant analysis is the one that involves only two-group problem which was first suggested by R. A. Fisher (1936).
In the two-group problem, the aim is to find a single linear composite of the predictor features that could discriminate between the two groups.The linear composite then acts as a new axis along which the groups were maximally separated.
In reality, we may encounter discrimination problems of more than two groups which require an extension of the basic discriminant analysis called the multiple discriminant analysis.The goal in multiple discriminant analysis is much similar with discriminant analysis for two groups.Dillon and Goldstein (1984) describe in general, with k groups and p predictor features, there are in total, min(p, k-1) possible discriminant functions (i.e.linear composites).In most applications, since the number of features (p) is exceeding the number of groups (k), at most k-1 discriminant functions will be considered.However, not all of these functions show statistically significant variation among the groups, and fewer than k-1 discriminant functions is actually needed.Likewise in forming principal components in PCA, discriminant functions are generated so that the scores of each new discriminant function are uncorrelated with the scores of previously obtained discriminant function.Thus, each linear composite is the new single function that maximizes the ratio of the between-groups to within-groups variability, accordingly.Besides, the discriminant functions are extracted in a decreasing order of accounted variation.
There are assumptions that need to be considered by researchers for obtaining optimal procedure in the sense of producing smallest misclassification error rate.According to Dillon and Goldstein (1984), for optimality, we assume (i) multivariate normality of the p predictor features, and (ii) equal variance-covariance matrices in each of the k groups.They added that the objectives of multiple discriminant analysis are for the most part is the generalizations of those of the two-group problem.Among others it includes: i.To find the linear composites with as large as possible between-groups variability subject to each uncovered linear composites being uncorrelated with previously extracted composites.The accounted variations for all linear composites are in decreasing order.ii.To determine whether the group centroids are statistically different.iii.To determine the number of discriminant functions that is statistically significant.iv.To successfully assign new signal or observation to one of the several groups.v.To determine the predictor features that contributes most for discrimination among groups.
The goal in constructing classification rules is to minimize the mistakes in assigning new signals to its groups.Less mistakes means less error for the classification rules to correctly allocate the signals.In real problem, often one has a set of data to be discriminated www.intechopen.com Principal Component Analysis -A Realization of Classification Success in Multi Sensor Data Fusion 9 accordingly to g groups.However, using the same data for constructing a rule and evaluating a rule is biased.As the matter of fact, it does not mimic the real use of discrimination rule to classify a future object where the rule is constructed based on the existing data.There are some techniques that can be considered in an attempt to avoid such bias.Some of the techniques are re-substitution method, cross validation method which is also known as sample-splitting method and leave-one-out method.Lachenbruch and Mickey (1968) in (Krzanowski, 2000) proposed the leave-one-out method that was believed to be able to overcome most problems inherent in the previous two methods.The technique consists of determining the allocation rule using the sample data minus one observation and then using the subsequent rule to classify the omitted observation.Repeating this procedure by omitting each of the individuals in the two training set in turn yields, an estimate of the error rates, the proportions of misclassified signals in the two training sets.

Materials and methods
The experiment was implemented in the Sensor Laboratory, Centre of Excellence for Advanced Sensor Technology, University Malaysia Perlis.The aim is to identify and classify different types of pure honey, beet sugar, cane sugar and adulterated samples (i.e.mixtures of pure honey with cane sugar and beet sugar) by applying the low level data fusion and intermediate level data fusion.PCA was employed to reduce the data dimension and further classification was fulfilled by LDA.

Sample selection and preparation
In this experiment, 10 different brands of Tualang honey were purchased from the local market with three different batches of each particular honey.While for the adulteration purposes, two types of sugar solution namely beet sugar and cane sugar were imported from Germany and United Kingdom respectively.Display of pure honey and sugar are illustrated in Fig. 4 and all honey and sugar samples are summarized in Based on the three different batches of each pure honey, three samples of 5ml was prepared for further measurement.For adulteration samples, each pure honey was mixed with sugar of different concentration (i.e.20% and 40%) as shown in Table 3.Each pure sugar was also measured.Each sampling of pure honey, sugar and adulterated were repeated ten times.In total there were about 172 samples of pure honey, pure sugar and adulterated mixtures.

Percentage of Pure Honey Descriptions
20% pure honey 1:4 (ratio of pure honey /sugar solution) 40% pure honey 2:3 (ratio of pure honey /sugar solution) Table 3. Description of mixture for different samples of honey and sugar

Electronic nose setup and measurement
The e-nose used was Cyranose320 from Smith Detection TM , consists of 32 non-selective sensors of different types of polymer matrix, blended with carbon black composite, configured as an array.It can be trained to analyze both simple and complex vapor mixtures with equal ease.When the sensors are exposed to vapors or aromatic volatile compounds they swell, changing the conductivity of the carbon pathways and causing an increase in the resistance value that is monitored as the sensor signal.The resistance changes across the array are captured as a digital pattern i.e. representative of the test smell (Dutta et al., 2006).
The e-nose setup for this experiment is illustrated in Fig. 5 and the setting of the sniffing cycle is also indicated in Table 4.Each sample was drawn from the bottle using 10ml syringe and kept in a 13 x 100 mm test tube and seal with a silicone stopper.Each sample was replicated ten times.Before measurement, each sample was placed in a heater block and heat up for 10 minutes to generate sufficient headspace volatiles.The temperature of sample was controlled at 50  °C during the headspace collection.
Preliminary experiments were performed to determine the optimal experimental setup for the purging, baseline purge and sample draw durations.Ten seconds baseline purge with 40 seconds sample draw produced an optimal result (result is not shown).Baseline purge was set longer to ensure residual gases were properly removed since all the samples are in a liquid form and contains moisture.The pump setting was set to medium speed during sample draw.The filter used is made up of activated carbon granules and has large surface area which is effective to remove a wide range of volatile organic compounds and moisture in the ambient air.The experiment was carried out using e-nose for a variety of honey samples followed by sugar and adulterated samples.4. E-nose parameter setting for honey, sugar and adulterated samples assessment

Electronic tongue setup and measurement
The chalcogenide-based potentiometric e-tongu e w a s m a d e u p o f e l e v e n d i s t i n c t i o nselective sensors from Sensor Systems (St.Petersburg, Russia).The e-tongue system shown in Figure 6 was implemented by arranging an array of potentiometric sensors around the reference probe.Table 5 describes the potentiometric sensors used in this experiment.Each sensor output was connected to the analogue input of a data acquisition board (NI USB-6008) from National Instruments (Austin TX, USA).
A 10% (w/v) solution of honey in distilled water was prepared and stirred for 3 minutes at 1000rpm before making any measurements.Each sample was replicated ten times.For each measurement, the e-tongue was steeped simultaneously and left for two minutes, and the potential readings were recorded for the whole duration.After each sampling, the e-tongue was rinsed twice using distilled water (stirred at 400rpm for two minutes) to remove any C320 Reference probe using Ag/AgCl electrode Table 5. Chalcogenide-based potentiometric electrodes used in the e-tongue.

Data preprocessing
The fractional measurement method is essential when using a multi-modalities sensor fusion.This technique is often known as baseline manipulation and was applied to preprocess the data of both modalities (Gardner & Bartlett, 1999).The maximum sensor response, S t is subtracted from the baseline, S 0 and then divided again by the S 0 .The formula for this dimensionless and normalized S frac , is determined as follows: This gives a unit response for each sensor array output with respect to the baseline, which compensates for sensors that have intrinsically large varying response levels.It can also further minimize the effect of temperature, humidity and temporal drifts (Gardner & Bartlett, 1999).
The data from different modalities were processed separately and all sensors were used in this analysis.In the case of the e-nose, S 0 is the minimum value taken during the baseline purge with ambient air and S t was measured during the sample draw.Each sampling cycle was repeated three times and the average was obtained for each of ten replicated samples.
For the e-tongue measurements, S 0 (baseline reading) is the average reading of distilled water, while S t is the sensor reading when steeped in the solution.The steeping cycle was repeated three times for each sample and the average was obtained for each ten of the replicated samples.Each S frac data point from each e-nose and e-tongue sensor formed the S frac matrix for further analyses.

Low level data fusion
For the purpose of low level data fusion, measurements recorded from both sensors were fused during the data level.For the e-nose data, there were 720 observations with 32 features from 16 different honey, sugar and adulterated samples.Likewise for the e-tongue data, 720 observations with 11 features from 16 different honey, sugar and adulterated samples were recorded.As a result, a new dimension for the fused data was represented by 720 observations with 43 features.At this stage, the original data from both measurements is formed in a data matrix, and is described in Fig. 7 as follows.No transformation is being applied at this stage.

Fig. 7. Illustration of fusing data in low level data fusion
The correlation input matrix from the fused data was proceeded for the PCA calculation.For the purpose of classification in LDA, the reduced number of principal components was selected based on magnitude eigenvalues greater or equal to 1 ( 1 i   ).The result from the scree plot is also applied for comparison and confirmation purposes.

Intermediate level data fusion
In this framework, fusion was applied after feature extraction process.For that purpose, PCA was calculated based on the correlation matrix from both datasets.The number of principal components to retain is decided based on the associated eigenvalues with magnitude greater than or equal to 1.0 ( scree plot of each dataset.Fig. 8 illustrates the related processes.The resulting principal components from each sensor which is three principal components were then combined before the classification using LDA is performed.

Results and discussion
Before the analyses of PCA was continued, a thorough study on each and every selected principal components (i.e. at low level data fusion) considered for classification using LDA was performed and the resulting classification error rate for each case are highlighted in Fig. 9. Comparisons and evaluations of classification error rate were performed differently based on correlation or covariance input matrix, procedure to evaluate performance of leave-oneout approach and the elimination of the least important of principal components (i.e.elimination begin with principal components of the smallest eigenvalue).Table 6 shows the total of variance explained using the correlation and covariance matrix input for the low level data fusion.Fig. 9 clearly reveals similar classification performance of correlation and covariance input matrix with a leave-one-out approach for the low level data fusion.It should be highlighted here that the performance of classification for the correlation and covariance input is not much differ because the standard deviations for each features in the fused dataset is slightly small.

Number of
In reality, good classification performance is not determined by the greater number of features included in data.What we need is features with the most discriminative effect which often measured by the error rate.In the case of low level data fusion, the PCA based on the correlation matrix of fused data was used to extract the most important features in a linear combination form.Table 7 displays the total of variance explained for the principal components of low level data fusion.Six principal components with eigenvalues greater than or equal to 1.0 were retained to be the input for classification using LDA.It can be seen that with only six linear combinations of the original features out of 43-principalcomponent, we only loose about 9.3% of information to proceed with classification task.The scree plot in Fig. 10  Based on the eigenvalues greater than or equal to 1.0 from both e-tongue and e-nose data, three principal components each were retained to be the input for classification using LDA.With the three principal components selected from etongue and e-nose data, we loose about 31% of information which is quite high compared to the low level data fusion.The scree plot in Fig. 11 seems agrees that three principal components are adequate to represent the original features.The selected principal components for low and intermediate level data fusion are further analyzed.The classification and prediction of the class of different types of pure honey, sugar, and adulterated samples were carried out using LDA with leave-one-out procedure.Table 10 indicates the significant differences in means of the predictors (i.e. the selected principal components) between the seven groups for both fused models.The results indirectly show the importance of the principal component to the discrimination function.Based on the Wilk's Lambda, principal component with smaller value means it is an important predictor.The most important principal components to the least important were arranged according to the italic number.Note in contrast, the bigger the Wilk's Lambda, the smaller the F values.Besides knowing the important predictors for the discrimination function, it is worth to investigate whether the assumption of homogeneity of covariance matrices is met.

Fig. 2 .
Fig. 2. Framework of intermediate level data fusion by Hall (1992) It is important to note that both low and intermediate level data fusion apply feature extraction in transforming the raw signals provided by the sensor into a reduced vector of features describing parsimoniously the original information.Then, in the identity declaration, a quality class is assigned to the signals based on the feature extraction result.

Fig. 4 .
Fig. 4. Display of different samples of honey and sugar

Fig. 5 .
Fig. 5. E-nose setup for headspace evaluation of honey, sugar concentration and adulteration sample

Fig. 8 .
Fig. 8. Illustration of fusing extracted features in intermediate level data fusion

Fig. 9 .
Fig. 9. Different classification performance for correlation and covariance input matrix with leave-one-out approach.
Fig. 11.plot for (a) e-tongue data and (b) e-nose data low level data fusion

Table 7 .
also shows that six principal components should be retained.Total variance explained for low level data fusion

Table 8 .
Total variance explained of e-tongue data for intermediate level data fusion

Table 9 .
Total variance explained of e-nose data for intermediate level data fusion Table 8 and 9 display the total of variance explained for the principal components of intermediate level data fusion.

Table 10 .
Table 11 displays the Box's M test for both data fusion models.The significant values of both data fusion models indicate that the covariance matrices are not similar for the seven groups.Test of equality of group means to identify the important variable to the discrimination function

Table 11 .
Test null hypothesis of equal population covariance matrices.Based on Table12 and Table 13, all the first five discriminant functions for low and intermediate level data fusion are able to explain 100%of the total variance.However, the canonical correlation values greater than 0.5 reveal that only the first two discriminant functions from both fusion model describe strong relationship.

Table 12 .
Percentage of variance explained for each discrimination function for low level data fusion.

Table 13 .
Percentage of variance explained for each discrimination function for intermediate level data fusion.The best predictors in predicting the types of honey, sugar, and adulterated samples from the respective discrimination functions of each data fusion model are marked italic in Table14.The highest value in each function (column) marks as the best predictor.For example, the best predictor for the first discriminant function of the low level data fusion is the third principal components (PC3).

Table 14 .
Indication of relative importance of the independent variables in predicting the groups for both data fusion models.Graphical representations of the classification for low level data fusion and intermediate level data fusion are as of Fig.12and Fig.13respectively.Table15and 16 describes in detail the classification results for each fusion model.It seems that the classification of several types of pure honey (group 1), beet sugar (group 2) and cane sugar (group 3) were very