Open access peer-reviewed chapter

# Application of Principal Component Analysis to Image Compression

Written By

Wilmar Hernandez and Alfredo Mendez

Submitted: 29 January 2018 Reviewed: 06 February 2018 Published: 16 March 2018

DOI: 10.5772/intechopen.75007

From the Edited Volume

## Statistics - Growing Data Sets and Growing Demand for Statistics

Edited by Türkmen Göksel

Chapter metrics overview

View Full Metrics

## Abstract

In this chapter, an introduction to the basics of principal component analysis (PCA) is given, aimed at presenting PCA applications to image compression. Here, concepts of linear algebra used in PCA are introduced, and PCA theoretical foundations are explained in connection with those concepts. Next, an image is compressed by using different principal components, and concepts such as image dimension reduction and image reconstruction quality are explained. Also, using the almost periodicity of the first principal component, a quality comparative analysis of a compressed image using two and eight principal components is carried out. Finally, a novel construction of principal components by periodicity of principal components has been included, in order to reduce the computational cost for their calculation, although decreasing the accuracy.

### Keywords

• principal component analysis
• population principal components
• sample principal components
• image compression
• image dimension reduction
• image reconstruction quality

## 1. Introduction

Principal component analysis, also known as the Hotelling transform or Karhunen-Loeve transform, is a statistical technique that was proposed by Karl Pearson (1901) as part of factorial analysis; however, its first theoretical development appeared in 1933 in a paper written by Hotelling [1, 2, 3, 4, 5, 6, 7, 8]. The complexity of the calculations involved in this technique delayed its development until the birth of computers, and its effective use started in the second half of the twentieth century. The relatively recent development of methods based on principal components makes them little used by a large number of non-statistician researchers. The purposes of these notes are to disclose the nature of the principal component analysis and show some of its possible applications.

Principal component analysis refers to the explanation of the structure of variances and covariances through a few linear combinations of the original variables, without losing a significant part of the original information. In other words, it is about finding a new set of orthogonal axes in which the variance of the data is maximum. Its objectives are to reduce the dimensionality of the problem and, once the transformation has been carried out, to facilitate its interpretation.

By having p variables collected on the units analyzed, all are required to reproduce the total variability of the system, and sometimes the majority of this variability can be found in a small number, k, of principal components. Its origin lies in the redundancy that there exists many times between different variables, so the redundancy is data, not information. The k principal components can replace the p initial variables, so that the original set of data, consisting of n measures of p variables, is reduced to n measures of k principal components.

The objective pursued by the analysis of principal components is the representation of the numerical measurements of several variables in a space of few dimensions, where our senses can perceive relationships that would otherwise remain hidden in higher dimensions. The abovementioned representation must be such that, when discarding higher dimensions, the loss of information is minimal. A simile could illustrate the idea: imagine a large rectangular plate that is a three-dimensional object, but that for practical purposes, we consider it as a flat two-dimensional object. When carrying out this reduction in dimensionality, a certain amount of information is lost since, for example, opposite points located on the two sides of the rectangular plate will appear confused in a single one. However, the loss of information is largely compensated by the simplification made, since many relationships, such as the neighborhood between points, are more evident when they are drawn on a plane than when done by a three-dimensional figure that must necessarily be drawn in perspective.

The analysis of principal components can reveal relationships between variables that are not evident at the first sight, which facilitates the analysis of the dispersion of observations, highlighting possible groupings and detecting the variables that are responsible for the dispersion.

## 2. Preliminaries

The study of multivariate methods is greatly facilitated by means of matrix algebra [9, 10, 11]. Next, we introduce some basic concepts that are essential for the explanation of statistical techniques, as well as for geometric interpretations. In addition, the relationships that can be expressed in terms of matrices are easily programmable on computers, so we can apply calculation routines to obtain other quantities of interest. It is a basic introduction about concepts and relationships.

### 2.1. The vector of means and the covariance matrix

Let X=X1Xpt be a random column vector of dimension p. Each component, Xi, is a random variable (r.v.) with mean EXi=μi and variance VXi=EXiμi2=σii. Given two r.v., Xi and Xj, we define the covariance between them as CovXiXj=EXiμiXjμj=σij. The expected values, variances, and covariances can be grouped into vectors and matrices that we will call population mean vector, μ, and population covariance matrix, :

μ=EX=μ1μp,=CovX=EXμXμt=σ11σ1pσp1σppE1

The population correlation matrix is given by ρ=ρij, where ρij=σijσiiσjj.

In the case of having n values of the r.v.s, we will consider estimators of the previous population quantities, which we will call sample estimators.

Definition 2.1: Let X=x11x1pxp1xpp be a simple random sample of a p-dimensional r.v. ordered in the data matrix, with the values of the r.v.s in each column. The p-dimensional sample mean column vector is X¯=x¯i, where x¯i=1pm=1pxim. The sample covariance matrix is S=sij=nn1Sn=nn1XX¯XX¯t. The generalized sample variance is the determinant of S, S. The sample correlation matrix is R=rij, where rij=sijsiisjj with i,j=1p.

Proposition 2.1: Let X1,,Xp be a simple random sample of a p-dimensional r.v. X with mean vector μ and covariance matrix . The unbiased estimators of μ and are X¯ and S..

### 2.2. Eigenvalues and eigenvectors

One of the problems that linear algebra deals with is the simplification of matrices through methods that produce diagonal or triangular matrices, which are widely used in the resolution of linear systems of the form Ax=b.

Definition 2.2: Let A be a square matrix. If vtAv0 for any vector v, A is a nonnegative definite matrix. If Av=λv, with v0, λ is an eigenvalue associated with the eigenvector v.

Proposition 2.2: Let A be a symmetric p by p matrix with real-valued entries. A has p pairs of eigenvalues and eigenvectors, λ1e1,,λpep, such that:

1. All the eigenvalues are real. Also,

1. A is positive definite if all the eigenvalues are positive.

2. A is nonnegative definite if all the eigenvalues are nonnegative.

2. The eigenvectors can be chosen with 2-norm equal to 1.

3. The eigenvectors are mutually perpendicular.

4. The eigenvectors are unique unless two or more eigenvalues are equal.

5. The spectral decomposition of A is A=λ1e1e1t++λpepept.

6. If P=e1ep is an orthogonal matrix and Λ is a diagonal matrix with main diagonal entries λ1λp, the spectral decomposition of A can be given by A=PΛPt. Therefore, A1=PΛ1Pt=i=1p1λieieit.

Remark 2.1: Let X be a matrix with the values of a simple random sample in each column of a p-dimensional r.v., and let yit=xi1xin, with i=1p, be the ith row of X. Let 1nt=11 be the n by one vector with all its coordinates equal to 1. It can be proven that:

1. The projection of the vector yit on the vector 1n is the vector x¯i1n, whose 2-norm is equal to nx¯i.

2. Matrix Sn is obtained from the residuals ei=yix¯i1n, the squared 2-norm of ei is equal to n1sii, and the scalar product of ei and ej is equal to n1sij.

3. The sample correlation coefficient rij is the cosine of the angle between ei and ej.

4. If U is the volume generated by the vectors ei, with i=1p, then S=U2n1p. Therefore, the generalized sample variance is proportional to the square of the volume generated by deviation vectors. The volume will increase if the norm of some ei is increased.

### 2.3. Distances

Many techniques of multivariate statistical analysis are based on the concept of distance. Let Q=x1x2 be a point in the plane. The Euclidean distance from Q to the origin, O, is dQO=x12+x22. If Q=x1xp and R=y1yp, the Euclidean distance between these two points of ℜp is dQR=x1y12++xpyp2. All points x1xp whose square distance to the origin is a fixed quantity, for example, x12++xp2=c2, are the points of the p-dimensional sphere of radius c.

For many statistical purposes, the Euclidean distance is unsatisfactory, since each coordinate contributes in the same way to the calculation of such a distance. When the coordinates represent measures subject to random changes, it is desirable to assign weights to the coordinates depending on how high or low the variability of the measurements is. This suggests a measure of distance that is different from the Euclidean.

Next, we introduce a statistical distance that will take into account the different variabilities and correlations. Therefore, it will depend on the variances and covariances, and this distance is fundamental in multivariate analysis.

Suppose we have a fixed set of observations in ℜp, and, to illustrate the situation, consider n pairs of measures of two variables, x1 and x2. Suppose that the measurements of x1 vary independently of x2 and that the variability of the measures of x1 are much greater than those of x2. This situation is shown in Figure 1, and our first objective is to define a distance from the points to the origin.

In Figure 1, we see that the values that have a given deviation from the origin are farther from the origin in the x1 direction than in the x2 direction, due to the greater variability inherent in the direction of x1. Therefore, it seems reasonable to give more weight in the coordinate x2 than in the x1. One way to obtain these weights is to standardize the coordinates, that is, x1=x1/s11 and x2=x2/s22, where sii is the sample variance of the variable xi. Thus, the statistical distance from a point Q=x1x2 to the origin is dQO=x12s11+x22s22. Therefore, the points that are equidistant from the origin of a constant distance c are on an ellipse centered at the origin, whose major axis coincides with the coordinate that has the greatest variability. In the case that the variability of one variable is analogous to that of the other and that the coordinates are independent, the Euclidean distance is proportional to the statistical distance.

If Q=x1xp and R=y1yp are two points of ℜp, the statistical distance between them is dQR=x1y12s11++xpyp2spp, with sii being the sample variance of the variable xi. The statistical distance defined so far does not include most of the important cases where the variables are not independent. Figure 2 shows a situation where the pairs x1x2 seem to have an increasing trend, so the sample correlation coefficient will be positive. In Figure 2, we see that if we make a rotation of amplitude α and consider the axes g1g2 we are in conditions analogous to those of Figure 1 (a). Therefore, the distance from the point Q=g1g2 to the origin will be dQO=g12s˜11+g22s˜22, where s˜ii is the sample variance of the variable gi.

The relationships between the original coordinates and the new coordinates can be expressed as

g1=x1cosα+x2sinαg2=x1sinα+x2cosαE2

and, after some algebraic manipulations, dQO=a11x12+2a12x1x2+a22x22, where aij are values that depend on the angle and the dispersions, and also must meet the condition that the distance between any two points must be positive.

The distance from a point Q=x1x2 to a fixed point R=y1y2 in situations where there is a positive correlation is dQR=a11x1y12+2a12x1y1x2y2+a22x2y22. So, in this case, the coordinates of all points Q=x1x2 verify the equation a11x1y12+2a12x1y1x2y2+a22x2y22=c2, which is the equation of an ellipse of center R=y1y2 and with axes parallel to g1g2. Figure 3 shows ellipses with constant statistical distances.

This distance can be generalized to ℜp if a11,,app,a12,,ap1,p are values such that the distance from Q to R is given by.

dQR=A+B,whereA=a11x1y12++appxpyp2B=2a12x1y1x2y2++2ap1,pxp1yp1xpypE3

This distance, therefore, is completely determined by the coefficient aij, with i,j1p, which can be arranged in a matrix given by

A=a11a1pap1appE4

The elements of Eq. (4) cannot be arbitraries. In order to define a distance over a vector space, Eq. (4) must be a square, symmetric, positive definite matrix. Therefore, the sample covariance matrix of a data matrix, S, is a candidate to define a statistical distance.

Figure 4 shows a cloud of points with center of gravity, x¯1x¯2, at point R. At the first glance, it can be seen that the Euclidean distance from point R to point Q is greater than the Euclidean distance from point R to the origin; however, Q seems to have more to do with the cloud of points than the origin. If we take into account the variability of the points in the cloud and take the statistical measure, then Q will be closer to R than the origin.

The above given explanation has tried to be an illustration of the need to consider distances other than the Euclidean.

## 3. Population principal components

Principal components are a particular case of linear combinations of p r.v.s, X1,,Xp. These linear combinations represent, geometrically, a new coordinate system that is obtained by rotating the original reference system that has X1,,Xp as coordinate axes. The new axes represent the directions with maximum variability and provide a simple description of the structure of the covariance.

Principal components depend only on the variance/covariance matrix (or on the correlation matrix ρ) of X1,,Xp, and it is not necessary to assume that the r.v.s follows an approximately normal distribution. In case of having a normal multivariate distribution, we will have interpretations in terms of ellipsoids of constant density, if we consider the distance that defines the matrix, and the inferences can be made from the population components.

Let X=X1Xpt be a p-dimensional random vector with covariance matrix and eigenvalues λ1λ2λp. Let us consider the following p linear combinations:

Y1=l1tX=l11X1++lp1XpYp=lptX=l1pX1++lppXpE5

These new r.v.s verify the following equalities:

VYi=litΣlii=1,,pCovYiYj=litΣlji,j=1,,pijE6

Principal components are those linear combinations that, being uncorrelated among them, have the greatest possible variance. Thus, the first principal component is the linear combination with the greatest variance, that is, VY1=l1tΣl1 is maximum. Since if we multiply l1 by some constant the previous variance grows, we will restrict our attention to vectors of norm one, with which the aforementioned indeterminacy disappears. The second principal component is the linear combination that maximizes the variance and is uncorrelated with the first one, and the norm of the coefficient vector is equal to 1.

Proposition 3.1: Let be the covariance matrix of the random vector X=X1Xpt. Let us assume that has p pairs of eigenvalues and eigenvectors, λ1e1,,λpep, with λ1λ2λp. Then, the ith principal component is given by

Yi=eitX=e1iX1++epiXpi=1,,pE7

In addition, with this choice it is verified that:

1. VYi=eitΣei=λii=1,,p.

2. CovYiYj=0i,j=1,,pij.

3. If any of the eigenvalues are equal, the choice of the corresponding eigenvectors as vectors of coefficients is not unique.

4. σ11++σpp=i=1pVXi=λ1++λp=j=1pVYj.

Remark 3.1: For the demonstration of these results, expressions are used on maximums of quadratic forms between vectors of fixed norm maxl0ltΣlltl=λ1. Also, the Lagrange multipliers method can be used, expressions when the abovementioned maximum is subject to orthogonality conditions and properties on the trace of a matrix (if Σ=PΛPt, then trΣ=trPΛPt=trΛ).

Due to the previous result, principal components are uncorrelated among them, with variances equal to the eigenvalues of , and the proportion of the population variance due to the ith principal component is given by λiλ1++λp.

If a high percentage of the population variance, for example, the 90%, of a p-dimensional r.v., with large p, can be attributed to, for example, the five first principal components, then we can replace all the r.v.s by those five components without a great loss of information.

Each component of the coefficient vector eit=e1iepi, eki, also deserves our attention, since it is a measure of the relationship between the r.v.s Xk and Yi.

Proposition 3.2: If Y1=e1tX,,Yp=eptX are the principal components obtained from the covariance matrix , with pairs of eigenvalues and eigenvectors λ1e1λpep, then the linear correlation coefficients between the variables Xk and the components Yi are given by

ρXk,Yi=ekiλiσkki,k=1,,pE8

Therefore, eki is proportional to the correlation coefficient between Xk and Yi.

In the particular case that X has a normal p-dimensional distribution, ΝpμΣ, the density of X is constant in the ellipsoids with the center at μ given by XμtΣ1Xμ=c2 that have axes ±cλiei and i=1,,p, where λiei are the pairs of eigenvalues and eigenvectors of .

If the covariance matrix, , can be decomposed into Σ=PΛPt, where P is orthogonal and Λ diagonal, it can be shown that Σ1=PΛ1Pt=i=1p1λieieit. Also, if it can be assumed that μ=0, to simplify the expressions, then

c2=xtΣ1x=1λ1e1tx2+1λ2e2tx2++1λpeptx2E9

If the principal components y1=e1tx,,yp=eptx are considered, the equation of the constant density ellipsoid is given by

c2=1λ1y12+1λ2y22++1λpyp2E10

Therefore, the axes of the ellipsoid have the directions of the principal components.

Example 3.1: Let X1,X2,X3 be the three-unidimensional r.v.s and X=X1X2X3t, with covariance matrix

Σ=200083032E11

It can be verified that the pairs of eigenvalues and eigenvectors are λ1=9.243e1t=00.9240.383, λ2=2e2t=100, and λ3=0.757e3t=00.3830.924. Therefore, the principal components are the following:

Y1=e1tX=0.924X20.383X3Y2=e2tX=X1Y3=e3tX=0.383X2+0.924X3E12

The norm of all the eigenvectors is equal to 1, and, in addition, the variable X1 is the second principal component, because X1 is uncorrelated with the other two variables.

The results of Proposition 3.1 can be verified for this data, for example, VY1=9.243 and CovY1Y2=0. Also, i=13VXi=2+8+2=12=9.243+2+0.757=j=13VYj. Thus, the proportion of the total variance explained by the first component is λ1/12=77%, and the one explained by the first two is λ1+λ2/12=93.69%, so that the components Y1 and Y2 can replace the original variables with a small loss of information.

The correlation coefficients between the principal components and the variables are the following:

ρX1,Y1=0ρX2,Y1=0.993ρX3,Y1=0.823ρX1,Y2=1ρX2,Y2=0ρX3,Y2=0ρX1,Y3=0ρX2,Y3=0.118ρX3,Y3=0.568E13

In view of these values, it can be concluded that X2and X3 individually are practically equally important with respect to the first principal component, although this is not the case with respect to the third component. If, in addition, it is assumed that the distribution of X is normal, Ν3μΣ, with a null mean vector, ellipsoids of constant density xtΣ1x=c2 can be considered. An ellipsoid of constant statistical distance and projections is shown in Figure 5.

The ellipsoid with c2=8 has been represented in Figure 5 (a), together with its axes and the ellipsoid projections on planes parallel to the coordinate axes. The aforementioned projections are ellipses of red, green, and blue colors that are reproduced in Figure 5 (b). Also, in this figure, the black ellipse obtained by projecting the ellipsoid on the plane determined by the first two main components has been represented. The equation of this ellipse is y12a2+y22b2=8, where a=cη1 and b=cη2, with η1 and η2 being the two smallest eigenvalues of Σ1, and the axes are determined by Y1 and Y2. As can be seen, the diameters of the ellipse determined by the first two components are larger than the others. Therefore, the area enclosed by this ellipse is the largest of all, indicating that it is the one that gathers the greatest variability.

### 3.1. Principal components with respect to standardized variables

The principal components of the normalized variables Z1=X1μ1σ11,,Zp=Xpμpσpp can also be considered, which in matrix notation is Z=VXμ, where V is the diagonal matrix whose elements are 1σ11,,1σpp. It is easily verified that the r.v. Z verifies EZ=0 and CovZ=VΣV=ρ, where ρ is the correlation matrix of X.

Principal components of Z are obtained by the eigenvalues and eigenvectors of the correlation matrix, ρ, of X. Furthermore, with some simplification, the previous results can be applied, since the variance of each Zi is equal to 1.

Let W1,,Wp be the principal components of Z and viuit, i=1,,p, the pairs of eigenvalues and eigenvectors of ρ, since they do not have to be the same.

Proposition 3.3: Let Z=Z1Zpt be a random vector with covariance matrix ρ. Let v1u1,,vpup be the pairs of eigenvalues and eigenvectors of ρ, with v1vp. Then, the ith principal component is given by Wi=uitVXμ, i=1,,p. In addition, with this choice it is verified that:

1. VWi=vi, i=1,,p.

2. CovWiWj=0, i,j=1,,p, ij.

3. If any of the eigenvalues are equal, the choice of the corresponding eigenvectors as vectors of coefficients is not unique.

4. i=1pVWi=v1++vp=j=1pVZj=p.

5. The linear correlation coefficients between the variables Zk and the principal components Wi are ρZk,Wi=ukivi and i,k=1,,p.

These results are a consequence of those obtained in Proposition 3.1 and Proposition 3.2 applied to Z and ρ instead of X and .

The total population variance of the normalized variables is the sum of the elements of the diagonal of ρ, that is, p. Therefore, the proportion of the total variability explained by the ith principal component is vip, i=1,,p.

Example 3.2: Let X1 and X2 be the two-unidimensional r.v.s and X=X1X2t with the covariance matrix, , and correlation matrix, ρ, given by

Σ=1234ρ=10.20.21E14

It can be verified that the pairs of eigenvalues and eigenvectors for S are λ1=100.04e1t=0.020.999 and λ2=0.96e2t=0.9990.02. Therefore, the principal components are the following:

Y1=e1tX=0.02X10.999X2Y2=e2tX=0.999X1+0.02X2E15

Furthermore, the eigenvalues and eigenvectors of ρ are v1=1.2u1t=0.7070.707 and v2=0.8u2t=0.7070.707; hence, the principal components of the normalized variables are the following:

W1=u1tZ=0.707Z1+0.707Z2=0.707X1μ1+0.0707X2μ2W2=u2tZ=0.707Z1+0.707Z2=0.707X1μ1+0.0707X2μ2E16

Because the variance of X2 is much greater than that of X1, the first principal component for Σ is determined by X2, and the proportion of variability explained by that first component is λ1λ1+λ2=0.99.

When considering the normalized variables, each variable also contributes to the components determined by ρ, and the dependencies between the normalized variables and their first component are ρZ1,W1=u11v1=0.7071.2=0.774 and ρZ2,W1=u21v1=0.7071.2=0.774. The proportion of the total variability explained by the first component is v1p=0.6.

Therefore, the importance of the first component is strongly affected by normalization. In fact, the weights, in terms of Xi are 0.707 and 0.0707 for ρ, as opposed to 0.02 and 0.999 for Σ.

Remark 3.2: The above example shows that the principal components deduced from the original variables are, in general, different from those derived from the normalized variables. So, normalization has important consequences.

When the units in which the different one-dimensional random variables are given are very different and in the case that one of the variances is very dominant compared to the others, the first principal component, with respect to the original variables, will be determined by the variable whose variance is the dominant one. On the other hand, if the variables are normalized, their relationship with the first components will be more balanced.

Principal components can be expressed in particular ways if the covariance matrix, or the correlation matrix, has special structures, such as diagonal ones, or structures of the form Σ=σ2A.

## 4. Sample principal components

Once we have the theoretical framework, we can now address the problem of summarizing the variation of n measurements made on p variables.

Let x1,,xn be a sample of a p-dimensional r.v. X with mean vector μ and covariance matrix Σ. These data have a vector of sample means x¯, covariance matrix S, and correlation matrix R.

This section is aimed at constructing linear uncorrelated combinations of the measured characteristics that contain the greatest amount of variability contained in the sample. These linear combinations are called principal sample components.

Given n values of any linear combination l1txj=l11x1j++lp1xpj, j=1,,n, its sample mean is l1tx¯j, and its sample variance is l1tSl1. If we consider two linear combinations, l1txj and l2txj, their sample covariance is l1tSl2.

The first principal component will be the linear combination, l1txj, which maximizes the sample variance, subject to the condition l1tl1=1. The second component will be the linear combination, l2txj, which maximizes the sample variance, subject to the condition that l2tl2=1 and that the sample covariance of the pairs l1txjl2txj is equal to zero. This procedure is continued until the p principal components are completed.

Proposition 4.1: Let S=sik be the p by p matrix of sample covariances, whose pairs of eigenvalues and eigenvectors are λ̂1ê1,,λ̂pêp, with λ̂1λ̂2λ̂p0. Let x be an observation of the p-dimensional random variable X, then:

1. The ith principal component is given by ŷi=êitx=ê1ix1++êpixp, i=1,,p.

2. The sample variance of ŷk is λ̂k, k=1,,p.

3. The sample covariance of ŷiŷk, ik, is equal to 0.

4. The total sample variance is i=1psii=λ̂1++λ̂p.

5. The sample correlation coefficients between xkand ŷi are rxk,ŷi=êkiλ̂iskk, i,k=1,,p.

In the case that the random variables have a normal distribution, the principal components can be obtained from a maximum likelihood estimation Σ̂=Sn, and, in this case, the sampling principal components can be considered as maximum likelihood estimates of the population principal components. Although the eigenvalues of S and Σ̂ are different but proportional, with constant proportionality fixed, the proportion of variability they explain is the same. The sample correlation matrix is the same for S and Σ̂. We still do not consider the particular case of normal distribution of the variables, so as not to have to include hypotheses that should be verified for the data under study.

Sometimes, the observations x are centered by subtracting the mean x¯. This operation does not affect the covariance matrix and produces principal components of the form ŷi=êitxx¯, and in this case ŷ¯i for any component, while the sample variances remain λ̂1,,λ̂p.

When trying to interpret the principal components, the correlation coefficients rxk,ŷi are more reliable guides than the coefficients êik, since they avoid interpretive problems caused by the different scales in which the variables are measured.

### 4.1. Interpretations of the principal sample components

Principal sample components have several interpretations. If the distribution of X is close to NpμΣ, then components ŷi=êitxx¯ are realizations of the main population components Yi=eitXμ, which will have distribution Np0Λ, where Λ is the diagonal matrix whose elements are the eigenvalues, ordered from major to minor, from the sample covariance matrix. Keeping in mind the hypothesis of normality, contours of constant density, Ep=xpxx¯tS1xx¯=c2, can be estimated and make inferences from them.

Although it is not possible to assume normality in the data, geometrically the data are n points p, and the principal components represent an orthogonal transformation whose coordinate axes are the axes of the ellipsoid Ep and with lengths proportional to λ̂i, with λ̂i being the eigenvalues of S. Since all eigenvectors have been chosen such that their norm is equal to 1, the absolute value of the ith component ŷi=êitxx¯ is the length of the projection of the vector xx¯ on the vector êi. Therefore, the principal components can be seen as a translation of the origin to the point x¯ and a rotation of the axes until they pass through the directions with greater variability.

When there is a high positive correlation between all the variables and a principal component with all its coordinates of the same sign, this component can be considered as a weighted average of all the variables or the size of the index that forms that component. The components that have coordinates of different signs oppose a subset of variables against another, being a weighted average of two groups of variables.

The interpretation of the results is simplified assuming that the small coefficients are zero and rounding the rest to express the component as sums, differences, or quotients of variables.

The interpretation of the principal components can be facilitated by graphic representations in two dimensions. A usual graph is to represent two components as coordinate axes and project all points on those axes. These representations also help to test hypotheses of normality and to detect anomalous observations. If there is an observation that is atypical in the first variable, we will have that the variability in that first variable will grow and that the covariance with the other variables will decrease, in absolute value. Consequently, the first component will be strongly influenced by the first variable, distorting the analysis.

Sometimes, it is necessary to verify that the first components are approximately normal, although it is not reasonable to expect this result from a linear combination of variables that do not have to be normal.

The last component can help detect suspicious observations. Each observation x can be expressed as a linear combination of the eigenvectors of S, xj=ŷ1jê1++ŷpjêp, with which the difference between the first components ŷ1jê1++ŷqjêq and the observation xj is ŷq1jêq1++ŷpjêp, which is a vector with square of the norm ŷq1j2++ŷpj2, and we will suspect of observations that have a large contribution to the square of the aforementioned norm.

An especially small value of the last eigenvalue of the covariance matrix, or correlation matrix, can indicate a linear dependence between the variables that have not been taken into account. In this case, some variable is redundant and should be removed from the analysis. If we have four variables and the fourth is the sum of the other three, then the last eigenvalue will be close to zero due to rounding errors, in which case we should suspect some dependence. In general, eigenvalues close to zero should not be ignored, and eigenvalues associated with these eigenvalues can indicate linear dependencies in the data and cause deformations in the interpretations, calculations, and consequent analysis.

### 4.2. Standardized sample principal components

In general, principal components are not invariant against changes of scale in the original variables, as has been mentioned when referring to the normalized population principal components. Normalizing, or standardizing, the variables consists of performing the following transformation zj=Dxjx¯=x1jx¯1s11xpjx¯psppt, j=1,,p. If the matrix Z is the p by n matrix whose columns are zj, it can be shown that its sample mean vector is the null vector and that its correlation matrix is the sample correlation matrix, R, of the original variables.

Remark 4.1: Applying that the principal components of the normalized variables are those obtained for the sample observations but substituting the matrix S for R, we can establish that if z1,,zn are the normalized observations, with covariance matrix R=rik, where rik is the sample correlation coefficient between observations xi and xk, and if the pairs of eigenvalues and eigenvectors of R are v̂1û1,,v̂pûp, with v̂1v̂p0, then

1. The ith principal component is given by ω̂i=ûitz=û1iz1++ûpizp, i=1,,p.

2. The sample variance of ω̂k is v̂k, k=1,,p.

3. The sample covariance of ω̂iω̂k, ik, is equal to 0.

4. The total sample variance is trR=p=v̂1++v̂p.

5. The sample correlation coefficients between zkand ω̂i are rzk,ω̂i=ûkiv̂i, i,k=1,,p.

6. The proportion of the total sample variance explained by the ith principal component is v̂ip.

### 4.3. Criteria for reducing the dimension

The eigenvalues and eigenvectors of the covariance matrix, or correlation matrix, are the essence of the analysis of principal components, since the eigenvalues indicate the directions of maximum variability and the eigenvectors determine the variances. If a few eigenvalues are much larger than the rest, most of the variance can be explained with less than p variables.

In practice, decisions about the number of components to be considered must be made in terms of the pairs of eigenvalues and eigenvectors of the covariance matrix, or correlation matrix, and different rules have been suggested:

1. When performing the graph iλ̂i, it has been empirically verified that with the first values there is a decrease with a linear tendency of quite steep slope and that from a certain eigenvalue this decrease is stabilized. That is, there is a point from which the eigenvalues are very similar. The criterion consists of staying with the components that exclude the small eigenvalues and that are approximately equal.

2. Select components until obtaining a proportion of the preset variance (e.g., 80%). This rule should be applied with care, since components that are interesting to reflect certain nuances suitable for the interpretation of the analysis could be excluded.

3. A rule that does not have a great theoretical support, which must be applied carefully so as not to discard any valid component for the analysis, but which has given good empirical results, is to retain those components with variances, λ̂i, above a certain threshold. If the work matrix is the correlation matrix, in which case the average value of the eigenvalues is one, the criterion is to keep the components associated with eigenvalues greater than unity and discard the rest.

## 5. Application to image compression

We are going to illustrate the use of principal components to compress images. To this end, the image of Lena was considered. This photograph has been used by engineers, researchers, and students for experiments related to image processing.

### 5.1. Black and white photography

The black and white photograph shown in Figure 6 was considered. First, the image in .jpg format was converted into the numerical matrix Image of dimension 512 by 512 (i.e., 29 x 29). Second, to obtain the observation vectors, the matrix was divided into blocks of dimension 23 x 23, Aij, with which 4096 blocks were obtained, and each of them was a vector of observations.

Image=A1,1A1,64A64,1A64,64E17

Third, each matrix Aij was stored in a vector of dimension 64, x, which contained the elements of the matrix by rows, that is, x=ai,1ai,8ai+1,1ai+1,8ai+8,8. This way, we had the observations xk64k=14096, which were grouped in the observation matrix x=xijΜ4096,64.

Fourth, the average of each column, x¯=x¯1x¯64, was calculated obtaining the vector of means, and from each observation xij, its corresponding mean x¯j was subtracted. Thus, the matrix of centered observations U was obtained. The covariance matrix of x was S=UtUΜ64,64.

Fifth, the 64 pairs of eigenvalues and eigenvectors of S, λ̂iêi, were found, and they were ordered according to the eigenvalues from highest to lowest. The 8 largest eigenvalues are drawn in Figure 7. As can be seen, the first eigenvalue is much larger than the rest. Thus, the first principal component completely dominates the total variability.

Sixth, with the theoretical results and the calculations previously made, the 64 principal components ŷj=êjtx=ê1,jx1++ê64,jxp, j=1,,p, were built. The first principal component was ŷ1=0.1167x1+0.1166x64. Therefore, an orthonormal basis of 64 was built.

Seventh, each vector êj=ê1,jê64,jt was grouped by rows in a matrix Μ8,8:

Êj=ê1,jê8,jê57,jê64,jE18

Each of the 64 matrices Êj was converted into an image. The images of the first three principal components are shown in Figure 8.

At this point, it is important to mention that the data matrix x has been assumed to be formed by 4096 vectors of 64 expressed in the canonical base, B. Also, the base whose vectors were the eigenvectors of S, B=ê1ê64, was considered. The coordinates with respect to the canonical basis of the vectors of B were the columns of the matrix PC=ê1tê64t. Then, given a vector v that with respect to the canonical base had coordinates x1x64 and with respect to the base B had coordinates y1y64, the relation between them was x1x64t=PCy1y64t. Also, as PC is an orthogonal matrix, y1y64=x1x64PC. Thus, the coordinates of the 4096 vectors that formed the observations matrix had as coordinates, with respect to the new base, the rows of the matrix of dimension 4096 x 64 given by y=xPC.

Eight, in order to reduce the dimension, it was taken into consideration that if we keep all the vectors of B, we can perfectly reconstruct our data matrix, because y=xPCx=yPC1=yPCt. Additionally, for the case under study, to reduce the dimension, if we use the slope change rule, we can consider the first two principal components; five components if we want to explain 97% of the variability, because i=15λ̂i/j=164λ̂j=97%; or eight components if we want to explain 98% of the total variability.

In order to compress the image, the first vectors of the base B were used. Moreover, supposing that we were left with M, M<64, the matrix TM given by Eq. (19) was defined:

TM=IMxM0Mx64M064MxM064Mx64ME19

Therefore, the dimension of yM=yTM was 4096 × 64.

Ninth, to reconstruct the compressed image, each row of yM was regrouped in an 8 x 8 matrix. The ith row of yM, denoted by yMi=bi,1bi,8bi,9bi,16bi,64, was transformed into the matrix Bi given by Eq. (20), and the matrix Compressed_image given by Eq. (21) was built:

Bi=bi,1bi,8bi,9bi,16bi,57bi,64i=1,,4096E20
Compressed_image=B1B64B65B128B4033B4096E21

Tenth and finally, Eq. (21) was converted into a .jpg file. Figure 9 shows the original image and compressed images with two, five, and eight principal components.

By increasing the number of principal components, the percentage of the variability explained is increased by very small percentages, but, nevertheless, nuances are added to the photo sufficiently remarkable, since they make it sharper, smooth out the contours, and mark the tones more precisely.

#### 5.1.1. Objective measures of the quality of reconstructions

The two methods that we will use are the peak signal-to-noise ratio (PSNR) and the entropy of the error image. The PSNR measure evaluates the quality in terms of deviations between the processed and the original image, and the entropy of an image is a measure of the information content contained in that image.

Definition 5.1: Let N be the number of rows by the number of columns in the image. Let xnn=1N be the set of pixels of the original image. Let ynn=1N be the set of reconstruction pixels. Let rn=xnynn=1N be the error. The mean square error (MSE) is

MSE=1Nn=1Nrn2E22

Definition 5.2: Let the images under study be the 8 bit images. The peak signal-to-noise ratio of the reconstruction is

PSNR=10log102812MSEE23

Figure 10 (a) shows PSNR of the reconstructions of the image versus the number of principal components used for the reconstruction, together with the regression line that adjusts the said cloud of points. Figure 10 (b) shows the values of the PSNR when we use three quarters (black), half (red), quarter (blue), eighth (green), sixteenth (brown), and the thirty-second part (yellow) of the components, which means a corresponding reduction in compression. A behavior close to linearity with a slope of approximately 0.2 can be seen. With the reductions considered, the PSNR varies between 27 and 63.

If the entropy is high, the variability of the pixels is very high, and there is little redundancy. Thus, if we exceed a certain threshold in compression, the original image cannot be recovered exactly. If the entropy is small, then the variability will be smaller. Therefore, the information of a pixel with respect to the pixels of its surroundings is high and, therefore, randomness is lost.

Definition 5.3: Let I be an 8 bit image that can take the values 0255. Let pi be the frequency with which the value i0255 appears. Then, the entropy is

HI=i=0255pilog2piE24

Figure 11 (a) shows the entropy of the reconstructions from 1 to 256 components. As can be seen, the entropy is increasing until the first 10 components, and then it becomes damped tending asymptotically to the value of the entropy of the image (7.4452). It can be seen that the difference with more than 170 components is insignificant. Figure 11 (b) shows the entropy of the reconstructions using 8 components (black), 16 components (brown), 32 components (green), 64 components (blue), and 128 components (red), respectively.

Finally, we consider the entropy of the images of the errors. Given an image, I, the value of each of its pixels is an element of the set 0255, and if we have a reconstruction, Î, and consider the error, E=IÎ, then the value of its pixels will be an element of the set 255255. Therefore, E cannot be considered as an image. Since a pixel of value eij in E is an error of the same size as eij, to consider images we denominate image of the error to ImE=eij, being E=eij.

Figure 12 (a) shows the entropy of the error image versus the number of principal components used for the reconstruction, together with an adjusted line of slope − 0.02. Figure 12(b) shows the entropy when we use 8 components (black), 16 components (brown), 32 components (green), 64 components (blue), and 128 components (red), respectively. With more than 200 principal components, the entropy of the errors is zero, which means that the errors have very little variability, and with fewer components, the decrease seems linear with slope 0.02.

### 5.2. Coordinates of the first principal component

In this section, we will consider the coordinates of the first vectors that form the principal components. If we consider that the vectors have been obtained as 23 x 23 dimension blocks, vectors will have 64 coordinates. Figure 13 shows the coordinates of the first six principal components with respect to the canonical base.

As can be seen from Figure 13, all coordinates seem to have some component with period 8. This suggests that there may be some relationship with the shape of the blocks chosen and that most vectors are close to being periodic with period 8, because when we consider each of the 4096 vectors of 64 components, the first 8 pixels are adjacent to the next 16 pixels, and these are adjacent to the next 8 pixels, and so on, up to 8 times.

Since the first principal components collect a large part of the characteristics of the vectors, it is plausible that they also reflect the periodicity of the vectors. Recall that principal components are linear combinations of vectors and that if all of them had all their periodic coordinates with the same period, then all components would be periodic as well.

In Figure 14, the coordinates of the first three principal components are shown when the vectors are constructed from blocks of 22 x 22(see Figure 14 (a-c)) and from blocks of 24 x 24 (see Figure 14 (d-f)). As can be seen, the periodicity in the first components is again appreciated.

### 5.3. Reduction of the first principal component by periodicity

Using the almost periodicity of the first principal component, we can use less information to obtain acceptable reconstructions of the image. If in the first principal component of dimension 64 we repeat the first eight values periodically and use k principal components to reconstruct the image, we go from a reduction of k/64 to another of k1+8/64/64. Figure 15 shows both the reconstruction of the image with 2 and 8 original principal components and the reconstruction of the image with 2 and 8 principal components, but with the first component replaced by a vector whose coordinates have period 8, we call this componentsper.

The first componentsper component is not the true one. Therefore, reconstructions from this set cannot be made with total precision. If we use to compare the 1-norm, 2-norm, and -norm of the image and the corresponding reconstruction, with the original principal components and the principal components using their periodicity, we obtain, by varying the number of used principal components, the results shown in Figure 16.

With the original principal components (blue), the original image can be completely reconstructed, while if we use only a few components, in this case 10 or less, approximations similar to the original ones are obtained with componentsper (green).

## 6. Conclusions

This chapter has been devoted to give a short but comprehensive introduction to the basics of the statistical technique known as principal component analysis, aimed at its application to image compression. The first part of the chapter was focused on preliminaries, mean vector, covariance matrix, eigenvectors, eigenvalues, and distances. That part finished bringing up the problems that the Euclidean distance presents and highlights the importance of using a statistical distance that takes into account the different variabilities and correlations. To that end, a brief introduction was made to a distance that depends on variances and covariances.

Next, in the second part of the chapter, principal components were introduced and connected with the previously explained concepts. Here, principal components were presented as a particular case of linear combinations of random variables, but with the peculiarity that those linear combinations represent a new coordinate system that is obtained by rotating the original reference system, which has the aforementioned random variables as coordinate axes. The new axes represent the directions with maximum variability and provide a simple description of the structure of the covariance.

Then, the third part of the chapter was devoted to show an application of principal component analysis to image compression. An original image was taken and compressed by using different principal components. The importance of carrying out objective measures of quality reconstructions was highlighted. Also, a novel contribution of this chapter was the introduction to the study of the periodicity of the principal components and to the importance of the reduction of the first principal component by periodicity. In short, a novel construction of principal components by periodicity of principal components has been included, in order to reduce the computational cost for their calculation, although decreasing the accuracy. It can be said that using the almost periodicity of the first principal component, less information to obtain acceptable reconstructions of the image can be used.

Finally, we would not like to finish this chapter without saying that few pages cannot gather the wide range of applications that this statistical technique has found in solving real-life problems. There is a countless number of applications of principal component analysis to solve problems that both scientists and engineers have to face in real-life situations. However, in order to be practical, it was decided to choose and develop step by step an application example that could be of interest for a wide range of readers. Accordingly, we thought that such an example could be one related to data compression, because with the advancements of information and communication technologies both scientists and engineers need to either store or transmit more information at lower costs, faster, and at greater distances with higher quality. In this sense, one example is image compression by using statistical techniques, and this is the reason why, in this chapter, it was decided to take advantage of statistical properties of an image to present a practical application of principal component analysis to image compression.

## References

1. 1. Jackson JE. A User's Guide to Principal Components. John Wiley & Sons; 1991
2. 2. Diamantaras KI, Kung SY. Principal Component Neural Networks: Theory and Applications. John Wiley & Sons; 1996
3. 3. Elsner JB, Tsonis AA. Singular Spectrum Analysis: A New Tool in Time Series Analysis. Plenum Press; 1996
4. 4. Rencher AC. Multivariate Statistical Inference and Applications. 1st ed. Wiley-Interscience; 1997
5. 5. Flury B. A First Course in Multivariate Statistics. Springer-Verlag; 1997
6. 6. Gnanadesikan R. Methods for Statistical Data Analysis of Multivariate Observations. 2nd ed. Wiley-Interscience; 1997
7. 7. Jolliffe IT. Principal Component Analysis. 2nd ed. Springer; 2002
8. 8. Wichern DW, Johnson RA. Applied Multivariate Statistical Analysis. 6th ed. Pearson Education Limited; 2014
9. 9. Strang G. Linear Algebra and its Applications. 4th ed. Thomson; 2006
10. 10. Larson R, Falvo D. Elementary Linear Algebra. 6th ed. Houghton Mifflin Harcourt Publishing Company; 2009
11. 11. Lay DC, Lay SR, McDonald JJ. Linear Algebra and its Applications. 5th ed. Pearson; 2015

Written By

Wilmar Hernandez and Alfredo Mendez

Submitted: 29 January 2018 Reviewed: 06 February 2018 Published: 16 March 2018