Application of Principal Component Analysis to Image Compression Application of Principal Component Analysis to Image Compression

In this chapter, an introduction to the basics of principal component analysis (PCA) is given, aimed at presenting PCA applications to image compression. Here, concepts of linear algebra used in PCA are introduced, and PCA theoretical foundations are explained in connection with those concepts. Next, an image is compressed by using different principal components, and concepts such as image dimension reduction and image recon- struction quality are explained. Also, using the almost periodicity of the first principal component, a quality comparative analysis of a compressed image using two and eight principal components is carried out. Finally, a novel construction of principal components by periodicity of principal components has been included, in order to reduce the compu- tational cost for their calculation, although decreasing the accuracy.


Introduction
Principal component analysis, also known as the Hotelling transform or Karhunen-Loeve transform, is a statistical technique that was proposed by Karl Pearson (1901) as part of factorial analysis; however, its first theoretical development appeared in 1933 in a paper written by Hotelling [1][2][3][4][5][6][7][8]. The complexity of the calculations involved in this technique delayed its development until the birth of computers, and its effective use started in the second half of the twentieth century. The relatively recent development of methods based on principal components makes them little used by a large number of non-statistician researchers. The purposes of these notes are to disclose the nature of the principal component analysis and show some of its possible applications.
Principal component analysis refers to the explanation of the structure of variances and covariances through a few linear combinations of the original variables, without losing a significant part of the original information. In other words, it is about finding a new set of orthogonal axes in which the variance of the data is maximum. Its objectives are to reduce the dimensionality of the problem and, once the transformation has been carried out, to facilitate its interpretation.
By having p variables collected on the units analyzed, all are required to reproduce the total variability of the system, and sometimes the majority of this variability can be found in a small number, k, of principal components. Its origin lies in the redundancy that there exists many times between different variables, so the redundancy is data, not information. The k principal components can replace the p initial variables, so that the original set of data, consisting of n measures of p variables, is reduced to n measures of k principal components.
The objective pursued by the analysis of principal components is the representation of the numerical measurements of several variables in a space of few dimensions, where our senses can perceive relationships that would otherwise remain hidden in higher dimensions. The abovementioned representation must be such that, when discarding higher dimensions, the loss of information is minimal. A simile could illustrate the idea: imagine a large rectangular plate that is a three-dimensional object, but that for practical purposes, we consider it as a flat two-dimensional object. When carrying out this reduction in dimensionality, a certain amount of information is lost since, for example, opposite points located on the two sides of the rectangular plate will appear confused in a single one. However, the loss of information is largely compensated by the simplification made, since many relationships, such as the neighborhood between points, are more evident when they are drawn on a plane than when done by a three-dimensional figure that must necessarily be drawn in perspective.
The analysis of principal components can reveal relationships between variables that are not evident at the first sight, which facilitates the analysis of the dispersion of observations, highlighting possible groupings and detecting the variables that are responsible for the dispersion.

Preliminaries
The study of multivariate methods is greatly facilitated by means of matrix algebra [9][10][11]. Next, we introduce some basic concepts that are essential for the explanation of statistical techniques, as well as for geometric interpretations. In addition, the relationships that can be expressed in terms of matrices are easily programmable on computers, so we can apply calculation routines to obtain other quantities of interest. It is a basic introduction about concepts and relationships.

The vector of means and the covariance matrix
Let X ¼ X 1 … X p Â Ã t be a random column vector of dimension p. Each component, X i , is a random variable (r.v.) with mean E X i ½ ¼ μ i and variance V X i ½ ¼ E X i À μ i À Á 2 h i ¼ σ ii . Given two r.v., X i and X j , we define the covariance between them as Cov X i ; X j Â Ã ¼ E X i À μ i À Á Â X j À μ j ¼ σ ij . The expected values, variances, and covariances can be grouped into vectors and matrices that we will call population mean vector, μ, and population covariance matrix, P : (1) The population correlation matrix is given by In the case of having n values of the r.v.s, we will consider estimators of the previous population quantities, which we will call sample estimators.
The generalized sample variance is the determinant of S, S j j: The sample correlation Proposition 2.1: Let X 1 , …, X p be a simple random sample of a p-dimensional r.v. X with mean vector μ and covariance matrix P . The unbiased estimators of μ and P are X and S:.

Eigenvalues and eigenvectors
One of the problems that linear algebra deals with is the simplification of matrices through methods that produce diagonal or triangular matrices, which are widely used in the resolution of linear systems of the form Ax ¼ b: 3. The eigenvectors are mutually perpendicular.

4.
The eigenvectors are unique unless two or more eigenvalues are equal.

The spectral decomposition of
6. If P ¼ e 1 ; …; e p Â Ã is an orthogonal matrix and Λ is a diagonal matrix with main diagonal entries λ 1 ; …; λ p À Á , the spectral decomposition of A can be given by A ¼ PΛP t . Therefore, Remark 2.1: Let X be a matrix with the values of a simple random sample in each column of a pdimensional r.v., and let y t Þbe the n by one vector with all its coordinates equal to 1. It can be proven that: 1. The projection of the vector y t i on the vector 1 n is the vector x i 1 n , whose 2-norm is equal to ffiffiffi n p x i j j.
2. Matrix S n is obtained from the residuals e i ¼ y i À x i 1 n , the squared 2-norm of e i is equal to n À 1 ð Þs ii , and the scalar product of e i and e j is equal to n À 1 ð Þs ij .
3. The sample correlation coefficient r ij is the cosine of the angle between e i and e j .

4.
If U is the volume generated by the vectors e i , with i ¼ 1…p, then S j j ¼ U 2 nÀ1 ð Þ p . Therefore, the generalized sample variance is proportional to the square of the volume generated by deviation vectors. The volume will increase if the norm of some e i is increased.

Distances
Many techniques of multivariate statistical analysis are based on the concept of distance. Let Q ¼ x 1 ; x 2 ð Þ be a point in the plane. The Euclidean distance from Q to the origin, O, is and R ¼ y 1 ; …; y p , the Euclidean distance between these two points of ℜ p is d Q; R ð Þ¼ ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ffi . All points x 1 ; …; x p À Á whose square distance to the origin is a fixed quantity, for example, are the points of the p-dimensional sphere of radius c j j.
For many statistical purposes, the Euclidean distance is unsatisfactory, since each coordinate contributes in the same way to the calculation of such a distance. When the coordinates represent measures subject to random changes, it is desirable to assign weights to the coordinates depending on how high or low the variability of the measurements is. This suggests a measure of distance that is different from the Euclidean.
Next, we introduce a statistical distance that will take into account the different variabilities and correlations. Therefore, it will depend on the variances and covariances, and this distance is fundamental in multivariate analysis.
Suppose we have a fixed set of observations in ℜ p , and, to illustrate the situation, consider n pairs of measures of two variables, x 1 and x 2 . Suppose that the measurements of x 1 vary independently of x 2 and that the variability of the measures of x 1 are much greater than those of x 2 . This situation is shown in Figure 1, and our first objective is to define a distance from the points to the origin.
In Figure 1, we see that the values that have a given deviation from the origin are farther from the origin in the x 1 direction than in the x 2 direction, due to the greater variability inherent in the direction of x 1 . Therefore, it seems reasonable to give more weight in the coordinate x 2 than in the x 1 . One way to obtain these weights is to standardize the coordinates, that is, q . Therefore, the points that are equidistant from the origin of a constant distance c are on an ellipse centered at the origin, whose major axis coincides with the coordinate that has the greatest variability. In the case that the variability of one variable is analogous to that of the other and that the coordinates are independent, the Euclidean distance is proportional to the statistical distance.  Figure 2 shows a situation where the pairs x 1 ; x 2 ð Þseem to have an increasing trend, so the sample correlation coefficient will be positive. In Figure 2, we see that if we make a rotation of amplitude α and consider the axes g 1 ; g 2 À Á we are in conditions analogous to those of Figure 1 (a). Therefore, the distance from the point q , wheres ii is the sample variance of the variable g i .
The relationships between the original coordinates and the new coordinates can be expressed as and, after some algebraic manipulations, , where a ij are values that depend on the angle and the dispersions, and also must meet the condition that the distance between any two points must be positive.
The distance from a point Q ¼ . So, in this case, the coordinates of all points Q ¼ x 1 ; x 2 ð Þverify the equation a 11 x 1 À y 1 À Á 2 þ 2a 12 x 1 ð which is the equation of an ellipse of center R ¼ y 1 ; y 2 À Á and with axes parallel to g 1 ; g 2 À Á . Figure 3 shows ellipses with constant statistical distances.
This distance can be generalized to ℜ p if a 11 , …, a pp , a 12 , …, a pÀ1, p are values such that the distance from Q to R is given by.
This distance, therefore, is completely determined by the coefficient a ij , with i, j ∈ 1; …; p f g , which can be arranged in a matrix given by The elements of Eq. (4) cannot be arbitraries. In order to define a distance over a vector space, Eq. (4) must be a square, symmetric, positive definite matrix. Therefore, the sample covariance matrix of a data matrix, S, is a candidate to define a statistical distance. Figure 4 shows a cloud of points with center of gravity, x 1 ; x 2 ð Þ, at point R. At the first glance, it can be seen that the Euclidean distance from point R to point Q is greater than the Euclidean   distance from point R to the origin; however, Q seems to have more to do with the cloud of points than the origin. If we take into account the variability of the points in the cloud and take the statistical measure, then Q will be closer to R than the origin.
The above given explanation has tried to be an illustration of the need to consider distances other than the Euclidean.

Population principal components
Principal components are a particular case of linear combinations of p r.v.s, X 1 , …, X p . These linear combinations represent, geometrically, a new coordinate system that is obtained by rotating the original reference system that has X 1 , …, X p as coordinate axes. The new axes represent the directions with maximum variability and provide a simple description of the structure of the covariance.
Principal components depend only on the variance/covariance matrix P (or on the correlation matrix r) of X 1 , …, X p , and it is not necessary to assume that the r.v.s follows an approximately normal distribution. In case of having a normal multivariate distribution, we will have interpretations in terms of ellipsoids of constant density, if we consider the distance that defines the P matrix, and the inferences can be made from the population components.
Let X ¼ X 1 … X p Â Ã t be a p-dimensional random vector with covariance matrix P and eigenvalues λ 1 ≥ λ 2 ≥ ⋯ ≥ λ p . Let us consider the following p linear combinations: These new r.v.s verify the following equalities: Principal components are those linear combinations that, being uncorrelated among them, have the greatest possible variance. Thus, the first principal component is the linear combination with the greatest variance, that is, V Y 1 ½ ¼ l t 1 Σl 1 is maximum. Since if we multiply l 1 by some constant the previous variance grows, we will restrict our attention to vectors of norm one, with which the aforementioned indeterminacy disappears. The second principal component is the linear combination that maximizes the variance and is uncorrelated with the first one, and the norm of the coefficient vector is equal to 1.

Proposition 3.1: Let
P be the covariance matrix of the random vector X ¼ X 1 … X p Â Ã t . Let us assume that P has p pairs of eigenvalues and eigenvectors, λ 1 ; e 1 ð Þ, …, λ p ; e p À Á , with λ 1 ≥ λ 2 ≥ ⋯ ≥ λ p . Then, the ith principal component is given by In addition, with this choice it is verified that: 3. If any of the eigenvalues are equal, the choice of the corresponding eigenvectors as vectors of coefficients is not unique.
Due to the previous result, principal components are uncorrelated among them, with variances equal to the eigenvalues of P , and the proportion of the population variance due to the ith principal component is given by If a high percentage of the population variance, for example, the 90%, of a p-dimensional r.v., with large p, can be attributed to, for example, the five first principal components, then we can replace all the r.v.s by those five components without a great loss of information.
Each component of the coefficient vector e t i ¼ e 1i ; …; e pi Â Ã , e ki , also deserves our attention, since it is a measure of the relationship between the r.v.s X k and Y i .
are the principal components obtained from the covariance matrix P , with pairs of eigenvalues and eigenvectors λ 1 ; e 1 ð Þ … λ p ; e p À Á , then the linear correlation coefficients between the variables X k and the components Y i are given by Therefore, e ki is proportional to the correlation coefficient between X k and Y i .
In the particular case that X has a normal p-dimensional distribution, Ν p μ; Σ À Á , the density of X is constant in the ellipsoids with the center at μ given by Þare the pairs of eigenvalues and eigenvectors of P .
If the covariance matrix, P , can be decomposed into Σ ¼ PΛP t , where P is orthogonal and Λ diagonal, it can be shown that If the principal components y 1 ¼ e t 1 x, …, y p ¼ e t p x are considered, the equation of the constant density ellipsoid is given by Therefore, the axes of the ellipsoid have the directions of the principal components.
It can be verified that the pairs of eigenvalues and eigenvectors are λ 1 ¼ 9:243; e t . Therefore, the principal components are the following: The norm of all the eigenvectors is equal to 1, and, in addition, the variable X 1 is the second principal component, because X 1 is uncorrelated with the other two variables.
The results of Proposition 3.1 can be verified for this data, for example, V Y 1 ½ ¼ 9:243 and Thus, the proportion of the total variance explained by the first component is λ 1 =12 ¼ 77%, and the one explained by the first two is λ 1 þ λ 2 ð Þ =12 ¼ 93:69%, so that the components Y 1 and Y 2 can replace the original variables with a small loss of information.
The correlation coefficients between the principal components and the variables are the following: In view of these values, it can be concluded that X 2 and X 3 individually are practically equally important with respect to the first principal component, although this is not the case with respect to the third component. If, in addition, it is assumed that the distribution of X is normal, Ν 3 μ; Σ À Á , with a null mean vector, ellipsoids of constant density x t Σ À1 x ¼ c 2 can be considered. An ellipsoid of constant statistical distance and projections is shown in Figure 5.
The ellipsoid with c 2 ¼ 8 has been represented in Figure 5 (a), together with its axes and the ellipsoid projections on planes parallel to the coordinate axes. The aforementioned projections are ellipses of red, green, and blue colors that are reproduced in Figure 5 (b). Also, in this figure, the black ellipse obtained by projecting the ellipsoid on the plane determined by the first two main components has been represented. The equation of this ellipse is p , with η 1 and η 2 being the two smallest eigenvalues of Σ À1 , and the axes are determined by Y 1 and Y 2 . As can be seen, the diameters of the ellipse determined by the first two components are larger than the others. Therefore, the area enclosed by this ellipse is the largest of all, indicating that it is the one that gathers the greatest variability.

Principal components with respect to standardized variables
The principal components of the normalized variables Z 1 ¼ can also be considered, which in matrix notation is Principal components of Z are obtained by the eigenvalues and eigenvectors of the correlation matrix, r, of X. Furthermore, with some simplification, the previous results can be applied, since the variance of each Z i is equal to 1.
Let W 1 , …, W p be the principal components of Z and v i ; u t i À Á , i ¼ 1, …, p, the pairs of eigenvalues and eigenvectors of r, since they do not have to be the same.
In addition, with this choice it is verified that: 3. If any of the eigenvalues are equal, the choice of the corresponding eigenvectors as vectors of coefficients is not unique.

5.
The linear correlation coefficients between the variables Z k and the principal components W i are These results are a consequence of those obtained in Proposition 3.1 and Proposition 3.2 applied to Z and r instead of X and P .
The total population variance of the normalized variables is the sum of the elements of the diagonal of r, that is, p. Therefore, the proportion of the total variability explained by the ith principal component is vi Example 3.2: Let X 1 and X 2 be the two-unidimensional r.v.s and X ¼ X 1 ; X 2 ½ t with the covariance matrix, P , and correlation matrix, r, given by It can be verified that the pairs of eigenvalues and eigenvectors for S are λ 1 ¼ 100:04; e t 1 ¼ À À0:02 À0:999 ½ Þ and λ 2 ¼ 0:96; e t 2 ¼ À0:999 0:02 ½ À Á . Therefore, the principal components are the following: Furthermore, the eigenvalues and eigenvectors of r are v 1 ¼ 1:2; u t 1 ¼ 0:707 0:707 ½ À Á and v 2 ¼ 0:8; u t 2 ¼ À 0:707 0:707 ½ À Á ; hence, the principal components of the normalized variables are the following: Because the variance of X 2 is much greater than that of X 1 , the first principal component for Σ is determined by X 2 , and the proportion of variability explained by that first component is λ1 When considering the normalized variables, each variable also contributes to the components determined by r, and the dependencies between the normalized variables and their first component are ffiffiffiffiffiffi ffi 1:2 p ¼ À0:774. The proportion of the total variability explained by the first component is v1 Therefore, the importance of the first component is strongly affected by normalization. In fact, the weights, in terms of X i are 0:707 and 0:0707 for r, as opposed to À0:02 and À0:999 for Σ.

Remark 3.2:
The above example shows that the principal components deduced from the original variables are, in general, different from those derived from the normalized variables. So, normalization has important consequences.
When the units in which the different one-dimensional random variables are given are very different and in the case that one of the variances is very dominant compared to the others, the first principal component, with respect to the original variables, will be determined by the variable whose variance is the dominant one. On the other hand, if the variables are normalized, their relationship with the first components will be more balanced.
Principal components can be expressed in particular ways if the covariance matrix, or the correlation matrix, has special structures, such as diagonal ones, or structures of the form Σ ¼ σ 2 A.

Sample principal components
Once we have the theoretical framework, we can now address the problem of summarizing the variation of n measurements made on p variables. Let x 1 , …, x n be a sample of a p-dimensional r.v. Xwith mean vector μ and covariance matrix Σ. These data have a vector of sample means x, covariance matrix S, and correlation matrix R.
This section is aimed at constructing linear uncorrelated combinations of the measured characteristics that contain the greatest amount of variability contained in the sample. These linear combinations are called principal sample components.
Given n values of any linear combination l t 1 x j ¼ l 11 x 1j þ ⋯ þ l p1 x pj , j ¼ 1, …, n, its sample mean is l t 1 x j , and its sample variance is l t 1 Sl 1 . If we consider two linear combinations, l t 1 x j and l t 2 x j , their sample covariance is l t 1 Sl 2 .
The first principal component will be the linear combination, l t 1 x j , which maximizes the sample variance, subject to the condition l t 1 l 1 ¼ 1. The second component will be the linear combination, l t 2 x j , which maximizes the sample variance, subject to the condition that l t 2 l 2 ¼ 1 and that the sample covariance of the pairs l t 1 x j ; l t 2 x j À Á is equal to zero. This procedure is continued until the p principal components are completed. Proposition 4.1: Let S ¼ s ik ð Þ be the p by p matrix of sample covariances, whose pairs of eigenvalues and eigenvectors areλ 1 ;ê 1 , …,λ p ;ê p , withλ 1 ≥λ 2 ≥ ⋯ ≥λ p ≥ 0. Let x be an observation of the pdimensional random variable X, then: 2. The sample variance ofŷ k isλ k , k ¼ 1, …, p.

The sample correlation coefficients between x k andŷ
In the case that the random variables have a normal distribution, the principal components can be obtained from a maximum likelihood estimationΣ ¼ S n , and, in this case, the sampling principal components can be considered as maximum likelihood estimates of the population principal components. Although the eigenvalues of S andΣ are different but proportional, with constant proportionality fixed, the proportion of variability they explain is the same. The sample correlation matrix is the same for S andΣ. We still do not consider the particular case of normal distribution of the variables, so as not to have to include hypotheses that should be verified for the data under study.
Sometimes, the observations x are centered by subtracting the mean x. This operation does not affect the covariance matrix and produces principal components of the formŷ i ¼ê t i x À x ð Þ, and in this caseŷ i for any component, while the sample variances remainλ 1 , …,λ p .
When trying to interpret the principal components, the correlation coefficients r xk,ŷ i are more reliable guides than the coefficientsê ik , since they avoid interpretive problems caused by the different scales in which the variables are measured.

Interpretations of the principal sample components
Principal sample components have several interpretations. If the distribution of X is close to N p μ; Σ ð Þ, then componentsŷ i ¼ê t i x À x ð Þare realizations of the main population components Y i ¼ e t i X À μ ð Þ, which will have distribution N p 0; Λ ð Þ, where Λ is the diagonal matrix whose elements are the eigenvalues, ordered from major to minor, from the sample covariance matrix. Keeping in mind the hypothesis of normality, contours of constant density, , can be estimated and make inferences from them.
Although it is not possible to assume normality in the data, geometrically the data are n points ℜ p , and the principal components represent an orthogonal transformation whose coordinate axes are the axes of the ellipsoid E p and with lengths proportional to ffiffiffiffi f λ i q , withλ i being the eigenvalues of S. Since all eigenvectors have been chosen such that their norm is equal to 1, the is the length of the projection of the vector Þon the vectorê i . Therefore, the principal components can be seen as a translation of the origin to the point x and a rotation of the axes until they pass through the directions with greater variability.
When there is a high positive correlation between all the variables and a principal component with all its coordinates of the same sign, this component can be considered as a weighted average of all the variables or the size of the index that forms that component. The components that have coordinates of different signs oppose a subset of variables against another, being a weighted average of two groups of variables.
The interpretation of the results is simplified assuming that the small coefficients are zero and rounding the rest to express the component as sums, differences, or quotients of variables.
The interpretation of the principal components can be facilitated by graphic representations in two dimensions. A usual graph is to represent two components as coordinate axes and project all points on those axes. These representations also help to test hypotheses of normality and to detect anomalous observations. If there is an observation that is atypical in the first variable, we will have that the variability in that first variable will grow and that the covariance with the other variables will decrease, in absolute value. Consequently, the first component will be strongly influenced by the first variable, distorting the analysis.
Sometimes, it is necessary to verify that the first components are approximately normal, although it is not reasonable to expect this result from a linear combination of variables that do not have to be normal.
The last component can help detect suspicious observations. Each observation x can be expressed as a linear combination of the eigenvectors of S, x j ¼ŷ 1jê 1 þ ⋯ þŷ pjê p , with which the difference between the first componentsŷ 1jê 1 þ ⋯ þŷ qjê q and the observation x j iŝ y qÀ1jê qÀ1 þ ⋯ þŷ pjê p , which is a vector with square of the normŷ 2 qÀ1j þ ⋯ þŷ 2 pj , and we will suspect of observations that have a large contribution to the square of the aforementioned norm.
An especially small value of the last eigenvalue of the covariance matrix, or correlation matrix, can indicate a linear dependence between the variables that have not been taken into account. In this case, some variable is redundant and should be removed from the analysis. If we have four variables and the fourth is the sum of the other three, then the last eigenvalue will be close to zero due to rounding errors, in which case we should suspect some dependence. In general, eigenvalues close to zero should not be ignored, and eigenvalues associated with these eigenvalues can indicate linear dependencies in the data and cause deformations in the interpretations, calculations, and consequent analysis.

Standardized sample principal components
In general, principal components are not invariant against changes of scale in the original variables, as has been mentioned when referring to the normalized population principal components. Normalizing, or standardizing, the variables consists of performing the following If the matrix Z is the p by n matrix whose columns are z j , it can be shown that its sample mean vector is the null vector and that its correlation matrix is the sample correlation matrix, R, of the original variables.

Remark 4.1:
Applying that the principal components of the normalized variables are those obtained for the sample observations but substituting the matrix S for R, we can establish that if z 1 , …, z n are the normalized observations, with covariance matrix R ¼ r ik ð Þ, where r ik is the sample correlation coefficient between observations x i and x k , and if the pairs of eigenvalues and eigenvectors of R arê 2. The sample variance ofω k isv k , k ¼ 1, …, p.

5.
The sample correlation coefficients between z k andω i are r z k ,ωi ¼û ki ffiffiffif v i p , i, k ¼ 1, …, p.
6. The proportion of the total sample variance explained by the ith principal component isv i p .

Criteria for reducing the dimension
The eigenvalues and eigenvectors of the covariance matrix, or correlation matrix, are the essence of the analysis of principal components, since the eigenvalues indicate the directions of maximum variability and the eigenvectors determine the variances. If a few eigenvalues are much larger than the rest, most of the variance can be explained with less than p variables.
In practice, decisions about the number of components to be considered must be made in terms of the pairs of eigenvalues and eigenvectors of the covariance matrix, or correlation matrix, and different rules have been suggested: a. When performing the graph i;λ i , it has been empirically verified that with the first values there is a decrease with a linear tendency of quite steep slope and that from a certain eigenvalue this decrease is stabilized. That is, there is a point from which the eigenvalues are very similar. The criterion consists of staying with the components that exclude the small eigenvalues and that are approximately equal.
b. Select components until obtaining a proportion of the preset variance (e.g., 80%). This rule should be applied with care, since components that are interesting to reflect certain nuances suitable for the interpretation of the analysis could be excluded.
c. A rule that does not have a great theoretical support, which must be applied carefully so as not to discard any valid component for the analysis, but which has given good empirical results, is to retain those components with variances,λ i , above a certain threshold. If the work matrix is the correlation matrix, in which case the average value of the eigenvalues is one, the criterion is to keep the components associated with eigenvalues greater than unity and discard the rest.

Application to image compression
We are going to illustrate the use of principal components to compress images. To this end, the image of Lena was considered. This photograph has been used by engineers, researchers, and students for experiments related to image processing.

Black and white photography
, which were grouped in the observation matrix Fourth, the average of each column, x ¼ x 1 ; …; x 64 ½ , was calculated obtaining the vector of means, and from each observation x ij , its corresponding mean x j was subtracted. Thus, the matrix of centered observations U was obtained. The covariance matrix of x was S ¼ U t U ∈ Μ 64, 64 ℜ ð Þ.
Fifth, the 64 pairs of eigenvalues and eigenvectors of S,λ i ;ê i , were found, and they were ordered according to the eigenvalues from highest to lowest. The 8 largest eigenvalues are drawn in Figure 7. As can be seen, the first eigenvalue is much larger than the rest. Thus, the first principal component completely dominates the total variability.
Seventh, each vectorê j ¼ê 1, j ; …;ê 64, j Â Ã t was grouped by rows in a matrix Μ 8, 8 :  Each of the 64 matricesÊ j was converted into an image. The images of the first three principal components are shown in Figure 8.
At this point, it is important to mention that the data matrix x has been assumed to be formed by 4096 vectors of ℜ 64 expressed in the canonical base, B. Also, the base whose vectors were the eigenvectors of S, B 0 ¼ê 1 ; …;ê 64 f g , was considered. The coordinates with respect to the canonical basis of the vectors of B 0 were the columns of the matrix PC ¼ê t 1 ; …;ê t 64 Â Ã . Then, given a vector v that with respect to the canonical base had coordinates x 1 ; …; x 64 ð Þand with respect to the base B 0 had coordinates y 1 ; …; y 64 À Á , the relation between them was Thus, the coordinates of the 4096 vectors that formed the observations matrix had as coordinates, with respect to the new base, the rows of the matrix of dimension 4096x64 given by y ¼ x Á PC.
Eight, in order to reduce the dimension, it was taken into consideration that if we keep all the vectors of B 0 , we can perfectly reconstruct our data matrix, because y ¼ In order to compress the image, the first vectors of the base B 0 were used. Moreover, supposing that we were left with M, M < 64, the matrix T M given by Eq. (19) was defined: Therefore, the dimension of y M ¼ y Á T M was 4096 Â 64.
Ninth, to reconstruct the compressed image, each row of y M was regrouped in an 8x8matrix. The ith row of y M , denoted by   By increasing the number of principal components, the percentage of the variability explained is increased by very small percentages, but, nevertheless, nuances are added to the photo sufficiently remarkable, since they make it sharper, smooth out the contours, and mark the tones more precisely.

Objective measures of the quality of reconstructions
The two methods that we will use are the peak signal-to-noise ratio (PSNR) and the entropy of the error image. The PSNR measure evaluates the quality in terms of deviations between the processed and the original image, and the entropy of an image is a measure of the information content contained in that image.  Figure 10 (b) shows the values of the PSNR when we use three quarters (black), half (red), quarter (blue), eighth (green), sixteenth (brown), and the thirty-second part (yellow) of the components, which means a corresponding reduction in compression. A behavior close to linearity with a slope of approximately 0:2 can be seen. With the reductions considered, the PSNR varies between 27 and 63.
If the entropy is high, the variability of the pixels is very high, and there is little redundancy. Thus, if we exceed a certain threshold in compression, the original image cannot be recovered exactly. If the entropy is small, then the variability will be smaller. Therefore, the information of a pixel with respect to the pixels of its surroundings is high and, therefore, randomness is lost.  Figure 11 (a) shows the entropy of the reconstructions from 1 to 256 components. As can be seen, the entropy is increasing until the first 10 components, and then it becomes damped tending asymptotically to the value of the entropy of the image (7:4452). It can be seen that the difference with more than 170 components is insignificant. Figure 11 (b) shows the entropy of  Finally, we consider the entropy of the images of the errors. Given an image, I, the value of each of its pixels is an element of the set 0; …; 255 f g , and if we have a reconstruction,Î , and consider the error, E ¼ I ÀÎ, then the value of its pixels will be an element of the set À255; …; 255 f g . Therefore, E cannot be considered as an image. Since a pixel of value e ij in E is an error of the same size as Àe ij , to consider images we denominate image of the error to Figure 12 (a) shows the entropy of the error image versus the number of principal components used for the reconstruction, together with an adjusted line of slope À 0.02. Figure 12(b) shows the entropy when we use 8 components (black), 16 components (brown), 32 components (green), 64 components (blue), and 128 components (red), respectively. With more than 200 principal components, the entropy of the errors is zero, which means that the errors have very little variability, and with fewer components, the decrease seems linear with slope À0:02.

Coordinates of the first principal component
In this section, we will consider the coordinates of the first vectors that form the principal components. If we consider that the vectors have been obtained as 2 3 x2 3 dimension blocks, vectors will have 64 coordinates. Figure 13 shows the coordinates of the first six principal components with respect to the canonical base.
As can be seen from Figure 13, all coordinates seem to have some component with period 8. This suggests that there may be some relationship with the shape of the blocks chosen and that most vectors are close to being periodic with period 8, because when we consider each of the   4096 vectors of 64 components, the first 8 pixels are adjacent to the next 16 pixels, and these are adjacent to the next 8 pixels, and so on, up to 8 times.
Since the first principal components collect a large part of the characteristics of the vectors, it is plausible that they also reflect the periodicity of the vectors. Recall that principal components are linear combinations of vectors and that if all of them had all their periodic coordinates with the same period, then all components would be periodic as well.
In Figure 14, the coordinates of the first three principal components are shown when the vectors are constructed from blocks of 2 2 x2 2 (see Figure 14 (a-c)) and from blocks of 2 4 x2 4 (see Figure 14 (d-f)). As can be seen, the periodicity in the first components is again appreciated.

Reduction of the first principal component by periodicity
Using the almost periodicity of the first principal component, we can use less information to obtain acceptable reconstructions of the image. If in the first principal component of dimension 64 we repeat the first eight values periodically and use k principal components to reconstruct the image, we go from a reduction of k=64 to another of k À 1 ð Þþ8=64 ½ =64. Figure 15 shows both the reconstruction of the image with 2 and 8 original principal components and the reconstruction of the image with 2 and 8 principal components, but with the first component replaced by a vector whose coordinates have period 8, we call this components per .
The first components per component is not the true one. Therefore, reconstructions from this set cannot be made with total precision. If we use to compare the 1-norm, 2-norm, and ∞-norm of the image and the corresponding reconstruction, with the original principal components and the principal components using their periodicity, we obtain, by varying the number of used principal components, the results shown in Figure 16.
With the original principal components (blue), the original image can be completely reconstructed, while if we use only a few components, in this case 10 or less, approximations similar to the original ones are obtained with components per (green).

Conclusions
This chapter has been devoted to give a short but comprehensive introduction to the basics of the statistical technique known as principal component analysis, aimed at its application to image compression. The first part of the chapter was focused on preliminaries, mean vector, covariance matrix, eigenvectors, eigenvalues, and distances. That part finished bringing up the problems that the Euclidean distance presents and highlights the importance of using a statistical distance that takes into account the different variabilities and correlations. To that end, a brief introduction was made to a distance that depends on variances and covariances.
Next, in the second part of the chapter, principal components were introduced and connected with the previously explained concepts. Here, principal components were presented as a particular case of linear combinations of random variables, but with the peculiarity that those linear combinations represent a new coordinate system that is obtained by rotating the original reference system, which has the aforementioned random variables as coordinate axes. The new axes represent the directions with maximum variability and provide a simple description of the structure of the covariance.
Then, the third part of the chapter was devoted to show an application of principal component analysis to image compression. An original image was taken and compressed by using different principal components. The importance of carrying out objective measures of quality reconstructions was highlighted. Also, a novel contribution of this chapter was the introduction to the study of the periodicity of the principal components and to the importance of the reduction of the first principal component by periodicity. In short, a novel construction of principal components by periodicity of principal components has been included, in order to reduce the computational cost for their calculation, although decreasing the accuracy. It can be said that using the almost periodicity of the first principal component, less information to obtain acceptable reconstructions of the image can be used.
Finally, we would not like to finish this chapter without saying that few pages cannot gather the wide range of applications that this statistical technique has found in solving real-life problems. There is a countless number of applications of principal component analysis to solve problems that both scientists and engineers have to face in real-life situations. However, in order to be practical, it was decided to choose and develop step by step an application example that could be of interest for a wide range of readers. Accordingly, we thought that such an example could be one related to data compression, because with the advancements of information and communication technologies both scientists and engineers need to either store or transmit more information at lower costs, faster, and at greater distances with higher quality. In this sense, one example is image compression by using statistical techniques, and this is the reason why, in this chapter, it was decided to take advantage of statistical properties of an image to present a practical application of principal component analysis to image compression.