The Basics of Linear Principal Components Analysis

When you have obtained measures on a large number of variables, there may exist redundancy in those variables. Redundancy means that some of the variables are correlated with one another, possibly because they are measuring the same “thing”. Because of this redundancy, it should be possible to reduce the observed variables into a smaller number of variables. For example, if a group of variables are strongly correlated with one another, you do not need all of them in your analysis, but only one since you can predict the evolution of all the variables from that of one. This opens the central issue of how to select or build the representative variables of each group of correlated variables. The simplest way to do this is to keep one variable and discard all others, but this is not reasonable. Another alternative is to combine the variables in some way by taking perhaps a weighted average, as in the line of the well-known Human Development Indicator published by UNDP. However, such an approach calls the basic question of how to set the appropriate weights. If one has sufficient insight into the nature and magnitude of the interrelations among the variables, one might choose weights using one's individual judgment. Obviously, this introduces a certain amount of subjectivity into the analysis and may be questioned by practitioners. To overcome this shortcoming, another method is to let the data set uncover itself the relevant weights of variables. Principal Components Analysis (PCA) is a variable reduction method that can be used to achieve this goal. Technically this method delivers a relatively small set of synthetic variables called principal components that account for most of the variance in the original dataset.


Introduction
When you have obtained measures on a large number of variables, there may exist redundancy in those variables. Redundancy means that some of the variables are correlated with one another, possibly because they are measuring the same "thing". Because of this redundancy, it should be possible to reduce the observed variables into a smaller number of variables. For example, if a group of variables are strongly correlated with one another, you do not need all of them in your analysis, but only one since you can predict the evolution of all the variables from that of one. This opens the central issue of how to select or build the representative variables of each group of correlated variables. The simplest way to do this is to keep one variable and discard all others, but this is not reasonable. Another alternative is to combine the variables in some way by taking perhaps a weighted average, as in the line of the well-known Human Development Indicator published by UNDP. However, such an approach calls the basic question of how to set the appropriate weights. If one has sufficient insight into the nature and magnitude of the interrelations among the variables, one might choose weights using one's individual judgment. Obviously, this introduces a certain amount of subjectivity into the analysis and may be questioned by practitioners. To overcome this shortcoming, another method is to let the data set uncover itself the relevant weights of variables. Principal Components Analysis (PCA) is a variable reduction method that can be used to achieve this goal. Technically this method delivers a relatively small set of synthetic variables called principal components that account for most of the variance in the original dataset.
Introduced by Pearson (1901) and Hotelling (1933), Principal Components Analysis has become a popular data-processing and dimension-reduction technique, with numerous applications in engineering, biology, economy and social science. Today, PCA can be implemented through statistical software by students and professionals but it is often poorly understood. The goal of this Chapter is to dispel the magic behind this statistical tool. The Chapter presents the basic intuitions for how and why principal component analysis works, and provides guidelines regarding the interpretation of the results. The mathematics aspects will be limited. At the end of this Chapter, readers of all levels will be able to gain a better understanding of PCA as well as the when, the why and the how of applying this technique. They will be able to determine the number of meaningful components to retain from PCA, create factor scores and interpret the components. More emphasis will be placed on examples explaining in detail the steps of implementation of PCA in practice.
We think that the well understanding of this Chapter will facilitate that of the following chapters and novel extensions of PCA proposed in this book (sparse PCA, Kernel PCA, Multilinear PCA, …).

The basic prerequisite -Variance and correlation
PCA is useful when you have data on a large number of quantitative variables and wish to collapse them into a smaller number of artificial variables that will account for most of the variance in the data. The method is mainly concerned with identifying variances and correlations in the data. Let us focus our attention to the meaning of these concepts. Consider the dataset given in Table 1. This dataset will serve to illustrate how PCA works in practice. The variance of a given variable x is defined as the average of the squared differences from the mean: The square root of the variance is the standard deviation and is symbolized by the small Greek sigma x  . It is a measure of how spread out numbers are.

www.intechopen.com
The variance and the standard deviation are important in data analysis because of their relationships to correlation and the normal curve. Correlation between a pair of variables measures to what extent their values co-vary. The term covariance is undoubtedly associatively prompted immediately. There are numerous models for describing the behavioral nature of a simultaneous change in values, such as linear, exponential and more. The linear correlation is used in PCA. The linear correlation coefficient for two variables x and y is given by: where x  and y  denote the standard deviation of x and y, respectively. This definition is the most widely-used type of correlation coefficient in statistics and is also called Pearson correlation or product-moment correlation. Correlation coefficients lie between -1.00 and +1.00. The value of -1.00 represents a perfect negative correlation while a value of +1.00 represents a perfect positive correlation. A value of 0.00 represents a lack of correlation. Correlation coefficients are used to assess the degree of collinearity or redundancy among variables. Notice that the value of correlation coefficient does not depend on the specific measurement units used.
When correlations among several variables are computed, they are typically summarized in the form of a correlation matrix. For the five variables in Table 1, we obtain the results reported in Table 2 In this Table a given row and column intersect shows the correlation between the two corresponding variables. For example, the correlation between variables X 1 and X 2 is 0.94.
As can be seen from the correlations, the five variables seem to hang together in two distinct groups. First, notice that variables X 1 , X 2 and X 3 show relatively strong correlations with one another. This could be because they are measuring the same "thing". In the same way, variables X 4 and X 5 correlate strongly with each another, a possible indication that they measure the same "thing" as well. Notice that those two variables show very weak correlations with the rest of the variables.
Given that the 5 variables contain some "redundant" information, it is likely that they are not really measuring five different independent constructs, but two constructs or underlying factors. What are these factors? To what extent does each variable measure each of these factors? The purpose of PCA is to provide answers to these questions. Before presenting the mathematics of the method, let's see how PCA works with the data in Table 1.
In linear PCA each of the two artificial variables is computed as the linear combination of the original variables.
Notice that different coefficients were assigned to the original variables in computing subject scores on the two components. X 1 , X 2 and X 3 are assigned relatively large weights that range from 0.554 to 0.579, while variables X 4 and X 5 are assigned very small weights ranging from 0.098 to 0.126. As a result, component Z 1 should account for much of the variability in the first three variables. In creating subject scores on the second component, much weight is given to X 4 and X 5 , while little weight is given to X 1 , X 2 and X 3 . Subject scores on each component are computed by adding together weighted scores on the observed variables. For example, the value of a subject along the first component Z 1 is 0.579 times the standardized value of X 1 plus 0.577 times the standardized value of X 2 plus 0.554 times the standardized value of X 3 plus 0.126 times the standardized value of X 4 plus 0.098 times the standardized value of X 5 .
At this stage of our analysis, it is reasonable to wonder how the weights from the preceding equations are determined. Are they optimal in the sense that no other set of weights could produce components that best account for variance in the dataset? How principal components are computed?

Graphs and distances among points
Our dataset in Table 1 can be represented into two graphs: one representing the subjects, and the other the variables. In the first, we consider each subject (individual) as a vector with coordinates given by the 5 observations of the variables. Clearly, the cloud of points belongs to a R 5 space. In the second one each variable is regarded as a vector belonging to a R 20 space. This quantity measures how spread out the points are around the centroid. We will need this quantity when determining principal components.
We define the distance between subjects s i and s i' using the Euclidian distance as follows: Two subjects are close one to another when they take similar values for all variables. We can use this distance to measure the overall dispersion of the data around the centroid or to cluster the points as in classification methods.

How work when data are in different units?
There are different problems when variables are measured in different units. The first problem is the meaning of the variance: how to sum quantities with different measurement units? The second problem is that the distance between points can be greatly influenced. To illustrate this point, let us consider the distances between subjects 7, 8 and 9. Applying Eq. (7), we obtain the following results: Now we observe that subject 7 is closer to subject 9 than to subject 8. It is hard to accept how the measurement units of the variables can change greatly the comparison results among subjects. Indeed, we could by this way render a tall man as shorter as we want! As seen, PCA is sensitive to scale. If you multiply one variable by a scalar you get different results. In particular, the principal components are dependent on the units used to measure the original variables as well as on the range of values they assume (variance). This makes comparison very difficult. It is for these reasons we should often standardize the variables prior to using PCA. A common standardization method is to subtract the mean and divide by the standard deviation. This yields the following: where X and x  are the mean and standard deviation of X, respectively.
Thus, the new variables all have zero mean and unit standard deviation. Therefore the total variance of the data set is the number of observed variables being analyzed.
Throughout, we assume that the data have been centered and standardized. Graphically, this implies that the centroid or center of gravity of the whole dataset is at the origin. In this case, the PCA is called normalized principal component analysis, and will be based on the correlation matrix (and not on variance-covariance matrix). The variables will lie on the unit sphere; their projection on the subspace spanned by the principal components is the "correlation circle". Standardization allows the use of variables which are not measured in the same units (e.g. temperature, weight, distance, size, etc.). Also, as we will see later, working with standardized data makes interpretation easier.
Consider a dataset consisting of p variables observed on n subjects. Variables are denoted by 12 (,, . . . ,) p xx x . In general, data are in a table with the rows representing the subjects (individuals) and the columns the variables. The dataset can also be viewed as a n p  rectangular matrix X. Note that variables are such that their means make sense. The variables are also standardized.
We can represent these data in two graphs: on the one hand, in a subject graph where we try to find similarities or differences between subjects, on the other, in a variable graph where we try to find correlations between variables. Subjects graph belongs to an p-dimensional space, i.e. to R p , while variables graph belongs to an n-dimensional space, i.e. to R n . We have two clouds of points in high-dimensional spaces, too large for us to plot and see something in them. We cannot see beyond a three-dimensional space! The PCA will give us a subspace of reasonable dimension so that the projection onto this subspace retains "as much as possible" of the information present in the dataset, i.e., so that the projected clouds of points be as "dispersed" as possible. In other words, the goal of PCA is to compute another basis that best re-express the dataset. The hope is that this new basis will filter out the noise and reveal hidden structure.
Dimensionality reduction implies information loss. How to represent the data in a lowerdimensional form without losing too much information? Preserve as much information as possible is the objective of the mathematics behind the PCA procedure.
We first of all assume that we want to project the data points on a 1-dimensional space. The principal component corresponding to this axis is a linear combination of the original variables and can be expressed as follows: www.intechopen.com 11 1 11 2 2 1 1 where 11 1 1 2 is a column vector of weights. The principal component z 1 is determined such that the overall variance of the resulting points is as large as possible. Of course, one could make the variance of z 1 as large as possible by choosing large values for the weights 11 12 1 , ,..., p   . To prevent this, weights are calculated with the constraint that their sum of squares is one, that is 1 u is a unit vector subject to the constraint: Eq. (14) is also the projections of the n subjects on the first component. PCA finds 1 u so that is the correlation matrix of the variables. The optimization problem is: This program means that we search for a unit vector 1 u so as to maximize the variance of the projection on the first component. The technique for solving such optimization problems (linearly constrained) involves a construction of a Lagrangian function.
Taking the partial derivative By premultiplying each side of this condition by 1 ' u and using the condition 11 '1 uu we get: It is known from matrix algebra that the parameters 1 u and 1  that satisfy conditions (19) and (20) are the maximum eigenvalue and the corresponding eigenvector of the correlation matrix C. Thus the optimum coefficients of the original variables generating the first principal component z 1 are the elements of the eigenvector corresponding to the largest eigenvalue of the correlation matrix. These elements are also known as loadings.
The second principal component is calculated in the same way, with the condition that it is uncorrelated (orthogonal) with the first principal component and that it accounts for the largest part of the remaining variance.
The optimization problem is therefore: Using the technique of Lagrangian function the following conditions: are obtained again. So once more the second vector comes to be the eignevector corresponding to the second highest eigenvalue of the correlation matrix.
Using induction, it can be proven that PCA is a procedure of eigenvalue decomposition of the correlation matrix. The coefficients generating the linear combinations that transform the original variables into uncorrelated variables are the eigenvectors of the correlation matrix. This is a good new, because finding eigenvectors is something which can be done rapidly using many statistical packages (SAS, Stata, R, SPSS, SPAD…), and because eigenvectors have many nice mathematical properties. Note that rather than maximizing variance, it might sound more plausible to look for the projection with the smallest average (mean-squared) distance between the original points and their projections on the principal components. This turns out to be equivalent to maximizing the variance (Pythagorean Theorem).
An interesting property of the principal components is that they are all uncorrelated (orthogonal) to one another. This is because matrix C is a real symmetric matrix and then linear algebra tells us that it is diagonalizable and the eigenvectors are orthogonal to one another. Again because C is a covariance matrix, it is a positive matrix in the sense that '0 uCu for any vector u . This tells us that the eigenvalues of C are all non-negative.
The eigenvectors are the "preferential directions" of the data set. The principal components are derived in decreasing order of importance; and have a variance equal to their corresponding eigenvalue. The first principal component is the direction along which the data have the most variance. The second principal component is the direction orthogonal to the first component with the most variance. It is clear that all components explain together 100% of the variability in the data. This is why we say that PCA works like a change of basis. Analyzing the original data in the canonical space yields the same results than examining it in the components space. However, PCA allows us to obtain a linear projection of our data, originally in R p , onto R q , where q < p. The variance of the projections on to the first q principal components is the sum of the eigenvalues corresponding to these components. If the data fall near a q-dimensional subspace, then p-q of the eigenvalues will be nearly zero.
Summarizing the computational steps of PCA Suppose 12 , ,..., p xx xare 1 p  vectors collected from n subjects. The computational steps that need to be accomplished in order to obtain the results of PCA are the following: Step 1. Compute mean: Step 2. Standardize the data: Step 4. Compute the eigenvalues of C: 12 ... p  


Step 5. Compute the eigenvectors of C: 12 , ,..., p uu u Step 6. Proceed to the linear tranformation R p ->R q that performs the dimensionality reduction.
Notice that, in this analysis, we gave the same weight to each subject. We could have give more weight to some subjects, to reflect their representativity in the population.

Criteria for determining the number of meaningful components to retain
In principal component analysis the number of components extracted is equal to the number of variables being analyzed (under the general condition np  ). This means that an analysis of our 5 variables would actually result in 5 components, not two. However, since PCA aims at reducing dimensionality, only the first few components will be important enough to be retained for interpretation and used to present the data. It is therefore reasonable to wonder how many independent components are necessary to best describe the data.
Eigenvalues are thought of as quantitative assessment of how much a component represents the data. The higher the eigenvalues of a component, the more representative it is of the data. Eigenvalues are therefore used to determine the meaningfulness of components. Table 3 provides the eigenvalues from the PCA applied to our dataset. In the column headed "Eignenvalue", the eigenvalue for each component is presented. Each raw in the table presents information about one of the 5 components: the raw "1" provides information about the first component (PCA1) extracted, the raw "2" provides information about the second component (PCA2) extracted, and so forth. Eigenvalues are ranked from the highest to the lowest.  Table 3. Eigenvalues from PCA Several criteria have been proposed for determining how many meaningful components should be retained for interpretation. This section will describe three criteria: the Kaiser eigenvalue-one criterion, the Cattell Scree test, and the cumulative percent of variance accounted for.

Kaiser method
The Kaiser (1960) method provides a handy rule of thumb that can be used to retain meaningful components. This rule suggests keeping only components with eigenvalues greater than 1. This method is also known as the eigenvalue-one criterion. The rationale for this criterion is straightforward. Each observed variable contributes one unit of variance to the total variance in the data set. Any component that displays an eigenvalue greater than 1 is accounts for a greater amount of variance than does any single variable. Such a component is therefore accounting for a meaningful amount of variance, and is worthy of being retained. On the other hand, a component with an eigenvalue of less than 1 accounts for less variance than does one variable. The purpose of principal component analysis is to reduce variables into a relatively smaller number of components; this cannot be effectively achieved if we retain components that account for less variance than do individual variables. For this reason, components with eigenvalues less than 1 are of little use and are not retained. When a covariance matrix is used, this criterion retains components whose eigenvalue is greater than the average variance of the data (Kaiser-Guttman criterion).
However, this method can lead to retaining the wrong number of components under circumstances that are often encountered in research. The thoughtless application of this rule can lead to errors of interpretation when differences in the eigenvalues of successive components are trivial. For example, if component 2 displays an eigenvalue of 1.01 and component 3 displays an eigenvalue of 0.99, then component 2 will be retained but component 3 will not; this may mislead us into believing that the third component is meaningless when, in fact, it accounts for almost exactly the same amount of variance as the second component. It is possible to use statistical tests to test for difference between successive eigenvalues. In fact, the Kaiser criterion ignores error associated with each www.intechopen.com eigenvalue due to sampling. Lambert, Wildt and Durand (1990) proposed a bootstrapped version of the Kaiser approach to determine the interpretability of eigenvalues. Table 3 shows that the first component has an eigenvalue substantially greater than 1. It therefore explains more variance than a single variable, in fact 2.653 times as much. The second component displays an eigenvalue of 1.98, which is substantially greater than 1, and the third component displays an eigenvalue of 0.269, which is clearly lower than 1. The application of the Kaiser criterion leads us to retain unambiguously the first two principal components.

Cattell scree test
The scree test is another device for determining the appropriate number of components to retain. First, it graphs the eigenvalues against the component number. As eigenvalues are constrained to decrease monotonically from the first principal component to the last, the scree plot shows the decreasing rate at which variance is explained by additional principal components. To choose the number of meaningful components, we next look at the scree plot and stop at the point it begins to level off (Cattell, 1966;Horn, 1965). The components that appear before the "break" are assumed to be meaningful and are retained for interpretation; those appearing after the break are assumed to be unimportant and are not retained. Between the components before and after the break lies a scree.
The scree plot of eigenvalues derived from Table 3 is displayed in Figure 1. The component numbers are listed on the horizontal axis, while eigenvalues are listed on the vertical axis. The Figure shows a relatively large break appearing between components 2 and 3, meaning the each successive component is accounting for smaller and smaller amounts of the total variance. This agrees with the preceding conclusion that two principal components provide a reasonable summary of the data, accounting for about 93% of the total variance. Sometimes a scree plot will display a pattern such that it is difficult to determine exactly where a break exists. When encountered, the use of the scree plot must be supplemented with additional criteria, such as the Kaiser method or the cumulative percent of variance accounted for criterion.

Cumulative percent of total variance accounted for
When determining the number of meaningful components, remember that the subspace of components retained must account for a reasonable amount of variance in the data. It is usually typical to express the eigenvalues as a percentage of the total. The fraction of an eigenvalue out of the sum of all eigenvalues represents the amount of variance accounted by the corresponding principal component. The cumulative percent of variance explained by the first q components is calculated with the formula:  How many principal components we should use depends on how big an q r we need. This criterion involves retaining all components up to a total percent variance (Lebart, Morineau & Piron, 1995;Jolliffe, 2002). It is recommended that the components retained account for at least 60% of the variance. The principal components that offer little increase in the total variance explained are ignored; those components are considered to be noise. When PCA works well, the first two eigenvalues usually account for more than 60% of the total variation in the data.
In our current example, the percentage of variance accounted for by each component and the cumulative percent variance appear in Table 3. From this Table we can see that the first component alone accounts for 53.057% of the total variance and the second component alone accounts for 39.597% of the total variance. Adding these percentages together results in a sum of 92.65%. This means that the cumulative percent of variance accounted for by the first two components is about 93%. This provides a reasonable summary of the data. Thus we can keep the first two components and "throw away" the other components.

Interpretation of principal components
Running a PCA has become easy with statistical software. However, interpreting the results can be a difficult task. Here are a few guidelines that should help practitioners through the analysis.

The visual approach of correlation
Once the analysis is complete, we wish to assign a name to each retained component that describes its content. To do this, we need to know what variables explain the components. Correlations of the variables with the principal components are useful tools that can help interpreting the meaning of components. The correlations between each variable and each principal component are given in Table 4.  Those correlations are also known as component loadings. A coefficient greater than 0.4 in absolute value is considered as significant (see, Stevens (1986) for a discussion). We can interpret PCA1 as being highly positively correlated with variables X 1 , X 2 and X 3 , and weakly positively correlated to variables X 4 and X 5 . So X 1 , X 2 and X 3 are the most important variables in the first principal component. PCA2, on the other hand, is highly positively correlated with X 4 and X 5 , and weakly negatively related to X 1 and X 2 . So X 4 and X 5 are most important in explaining the second principal component. Therefore, the name of the first component comes from variables X 1 , X 2 and X 3 while that of the second component comes from X 4 and X 5 .

Variables
It can be shown that the coordinate of a variable on a component is the correlation coefficient between that variable and the principal component. This allows us to plot the reduced dimension representation of variables in the plane constructed from the first two components. Variables highly correlated with a component show a small angle. Eq.(29) shows the connection between the cosine measurement and the numerical measurement of correlation: the cosine of the angle between two variables is interpreted in terms of correlation. Variables highly positively correlated with each another show a small angle, while those are negatively correlated are directed in opposite sense, i.e. they form a flat angle. From Figure 2 we can see that the five variables hang together in two distinct groups. Variables X 1 , X 2 and X 3 are positively correlated with each other, and form the first group. Variables X 4 and X 5 also correlate strongly with each other, and form the second group. Those two groups are weakly correlated. In fact, Figure 2 gives a reduced dimension representation of the correlation matrix given in Table 2. It is extremely important, however, to notice that the angle between variables is interpreted in terms of correlation only when variables are well-represented, that is they are close to the border of the circle of correlation. Remember that the goal of PCA is to explain multiple variables by a lesser number of components, and keep in mind that graphs obtained from that reduction method are projections that optimize global criterion (i.e. the total variance). As such some relationships between variables may be greatly altered. Correlations between variables and components supply insights about variables that are not well-represented. In a subspace of components, the quality of representation of a variable is assessed by the sumof-squared component loadings across components. This is called the communality of the www.intechopen.com variable. It measures the proportion of the variance of a variable accounted for by the components. For example, in our example, the communality of the variable X 1 is 0.943 2 +0.241 2 =0.948. This means that the first two components explain about 95% of the variance of the variable X 1 . This is quite substantial to enable us fully interpreting the variability in this variable as well as its relationship with the other variables. Communality can be used as a measure of goodness-of-fit of the projection. The communalities of the 5 variables of our data are displayed in Table 5. As shown by this Table, the first two components explain more than 80% of variance in each variable. This is enough to reveal the structure of correlation among the variables. Do not interpret as correlation the angle between two variables when at least one of them has a low communality. Using communality prevent potential biases that may arise by directly interpreting numerical and graphical results yielded by the PCA. All these interesting results show that outcomes from normalized PCA can be easily interpreted without additional complicated calculations. From a visual inspection of the graph, we can see the groups of variables that are correlated, interpret the principal components and name them.

Factor scores and their use in multivariate models
A useful by product of PCA is factor scores. Factor scores are coordinates of subjects (individuals) on each component. They indicate where a subject stands on the retained component. Factor scores are computed as weighted values on the observed variables. Results for our dataset are reported in Table 6.
Factor scores can be used to plot a reduced representation of subjects. This is displayed by Figure 3.
How do we interpret the position of points on this diagram? Recall that this graph is a projection. As such some distances could be spurious. To distinguish wrong projections from real ones and better interpret the plot, we need to use that is called "the quality of representation" of subjects. This is computed as the squared of the cosine of the angle between a subject i s and a component z , following the formula:   Cos 2 is interpreted as a measure of goodness-of-fit of the projection of a subject on a given component. Notice that in Eq. (30), 2 i s is the distance of subject i s from the origin. It measures how far the subject is from the center. So if cos 2 =1 the component extracted is reproducing a great amount of the original behavior of the subject. Since the components are orthogonal, the quality of representation of a subject in a given subspace of components is the sum of the associated cos 2 . This notion is similar to the concept of communality previously defined for variables.
In Table 6 we also reported these statistics. As can be seen, the two components retained explain more than 80% of the behavior of subjects, except for subjects 6 and 7. Now we are confident that almost all the subjects are well-represented, we can interpret the graph. Thus, we can tell that subjects located in the right side and having larger coordinates on the first component, i.e.1, 9, 6, 3 and 5, have values of X 1 , X 2 and X 3 greater than the average. Those located in the left side and having smaller coordinates on the first axis, i.e. 20, 19, 18, 16, 12, 11 and 10, record lesser values for these variables. On the other hand, subjects 15 and 17 are characterized by highest values for variables X 4 and X 5 , while subjects 8 and 13 record lowest values for these variables.
Very often a small number of subjects can determine the direction of principal components. This is because PCA uses the notions of mean, variance and correlation; and it is well known that these statistics are influenced by outliers or atypical observations in the data. To detect what are these atypical subjects we define the notion of "contribution" that measures how much a subject contributes to the variance of a component. Contributions (CTR) are computed following: Contributions are reported in the last two columns of Table 6. Subject 4 contributes greatly to the first component with a contribution of 16.97%. This indicates that subject 4 explains alone 16.97% of the variance of the first component. Therefore, this subject takes higher values for X1, X2 and X3. This can be easily verified from the original Table 1. Regarding the second component, over 25% of the variance of the data accounted for by this component is explained by subjects 15 and 17. These subjects exhibit high values for variables X 4 and X 5 .
The principal components obtained from PCA could be used in subsequent analyses (regressions, poverty analysis, classification…). For example, in linear regression models, the presence of correlated variables poses the econometric well-known problem of multicolinearity that makes instable regression coefficients. This problem is avoided when using the principal components that are orthogonal with one another. At the end of the analysis you can re-express the model with the original variables using the equations defining principal components. If there are variables that are not correlated with the other variables, you can delete them prior to the PCA, and reintroduce them in your model once the model is estimated.

A Case study with illustration using SPSS
We collected data on 10 socio-demographic variables for a sample of 132 countries. We use these data to illustrate how performing PCA using the SPSS software package. By following the indications provided here, user can try to reproduce himself the results obtained.
To perform a principal components analysis with SPSS, follow these steps: In what follows, we review and comment on the main outputs.

Correlation Matrix
To discover the pattern of intercorrelations among variables, we examine the correlation matrix. That is given in  The variables can be grouped into two groups of correlated variables. We will see this later.

Testing for the Factorability of the Data
Before applying PCA to the data, we need to test whether they are suitable for reduction. SPSS provides two tests to assist users: www.intechopen.com Kaiser-Meyer-Olkin Measure of Sampling Adequacy (Kaiser, 1974): This measure varies between 0 and 1, and values closer to 1 are better. A value of 0.6 is a suggested minimum for good PCA. (Bartlett, 1950): This tests the null hypothesis that the correlation matrix is an identity matrix in which all of the diagonal elements are 1 and all off diagonal elements are 0. We reject the null hypothesis when the level of significance exceeds 0.05.

Bartlett's Test of Sphericity
The results reported in Table 8 suggest that the data may be grouped into smaller set of underlying factors.  The scree plot is displayed in Figure 4. From the second component on, we observe that the line is almost flat with a relatively large break following component 1. So the scree test would lead us to retain only the first component. The components appearing after the break (2-10) would be regarded as trivial (less than 10%).

Fig. 4. Scree Plot
In conclusion, the dimensionality of the data could be reduced to 1. Nevertheless, we shall add the second component for representation purpose. Plot in a plane is easier to interpret than a three or 10-dimensional plot. Note that by default SPSS uses the Kaiser criterion to extract components. It belongs to the user to specify the number of components to be extracted if the Kaiser-criterion under-estimate the appropriate number. Here we specified 2 as the number of components to be extracted.

Component loadings
Table 10 displays the loading matrix. The entries in this matrix are correlations between the variables and the components. As can be seen, all the variables load heavily on the first component. It is now necessary to turn to the content of the variables being analyzed in order to decide how this component should be named. What common construct do variables seem to be measuring?
In Figure 5 we observe two opposite groups of variables. The right-side variables are positively correlated one with another, and deal with social status of the countries. The leftside variables are also positively correlated one with another, and talk about another aspect of social life. It is therefore appropriate to name the first component the "social development" component.  Figure 5 we say that countries with high positive scores on the first component demonstrate higher level of social development relatively to countries with negative scores. In Figure 6 we can see that countries such as Burkina Faso, Niger, Sierra Leone, Tchad, Burundi, Centrafrique and Angola belong to the under-developed group.
SPSS does not provide directly the scatterplot for subjects. Since factor scores have been created and saved as variables, we can use the Graph menu to request a scatterplot. This is an easy task on SPSS. The character variable Country is used as an identifier variable. Notice that in SPSS factor scores are standardized with a mean zero and a standard deviation of 1.

Fig. 6. Scatterplot of the Countries
A social development index is most useful to identify the groups of countries in connection with their level of development. The construction of this index assigns a social development-ranking score to each country. We rescale factor scores as follows: where F min and F max are the minimum and maximum values of the factor scores F. Using the rescaled-scores, countries are sorted in ascending. Lower scores identify socially underdeveloped countries, whereas higher scores identify socially developed countries. www.intechopen.com

Conclusion
Principal components analysis (PCA) is widely used in statistical multivariate data analysis. It is extremely useful when we expect variables to be correlated to each other and want to reduce them to a lesser number of factors. However, we encounter situations where variables are non linearly related to each other. In such cases, PCA would fail to reduce the dimension of the variables. On the other hand, PCA suffers from the fact each principal component is a linear combination of all the original variables and the loadings are typically nonzero. This makes it often difficult to interpret the derived components. Rotation techniques are commonly used to help practitioners to interpret principal components, but we do not recommend them.