Open access peer-reviewed chapter

Determining an Adequate Number of Principal Components

Written By

Stanley L. Sclove

Reviewed: 17 March 2022 Published: 05 May 2022

DOI: 10.5772/intechopen.104534

From the Edited Volume

Advances in Principal Component Analysis

Edited by Fausto Pedro García Márquez

Chapter metrics overview

215 Chapter Downloads

View Full Metrics


The problem of choosing the number of PCs to retain is analyzed in the context of model selection, using so-called model selection criteria (MSCs). For a prespecified set of models, indexed by k=1,2,…,K, these model selection criteria (MSCs) take the form MSCk=nLLk+anmk, where, for model k, LLk is the maximum log likelihood, mk is the number of independent parameters, and the constant an is an=lnn for BIC and an=2 for AIC. The maximum log likelihood LLk is achieved by using the maximum likelihood estimates (MLEs) of the parameters. In Gaussian models, LLk involves the logarithm of the mean squared error (MSE). The main contribution of this chapter is to show how to best use BIC to choose the number of PCs, and to compare these results to ad hoc procedures that have been used. Findings include the following. These are stated as they apply to the eigenvalues of the correlation matrix, which are between 0 and p and have an average of 1. For considering an additional PCk + 1, with AIC, inclusion of the additional PCk + 1 is justified if the corresponding eigenvalue λk+1 is greater than exp(−2/n). For BIC, the inclusion of an additional PCk + 1 is justified if λk+1 > n1/n, which tends to 1 for large n. Therefore, this is in approximate agreement with the average eigenvalue rule for correlation matrices, stating that one should retain dimensions with eigenvalues larger than 1.


  • reduction of dimensionality
  • principal components
  • model selection criteria
  • information criteria
  • AIC
  • BIC

1. Introduction and background

1.1 Introduction

Sometimes, researchers know how many principal components (PCs) they need. For example, to construct an optimal scatterplot, the scores of the sample on the first two principal components will be used to obtain an optimal plot. For an optimal three-dimensional scatterplot, the scores on the first three principal components will be used. In many applications, however, the researchers will question how many principal components they need. This chapter discusses the application of various methods to the problem of reduction of dimensionality, in the sense of choosing an adequate number of principal components to retain to represent a dataset. The methods discussed include ad hoc methods, likelihood-based methods, and model selection criteria (MSCs), especially Akaike’s information criterion (AIC) and Bayesian information criterion (BIC). This chapter applies the concepts of [1, 2] to this particular problem.

1.2 Background

To begin the discussion here, we first give a short review of some general background on the relevant portions of multivariate statistical analysis, which may be obtained from textbooks such as [3] or [4].

1.3 Sample quantities

Let x1,x2,,xn denote a sample of np-dimensional random vectors


Here, the transpose (′) means that the vectors are being considered as column vectors. The sample mean vector is


The p×p sample covariance matrix is denoted by


1.4 Population quantitites

The sample covariance matrix S estimates the true covariance matrix Σ of the random variables


The true covariance matrix is




the covariance of Xu and Xv, for uv,u,v=1,2,,p. For u=v, we have CXvXv=VXv, the variance of Xv.

1.5 Principal components

The principal components of Σ are defined as uncorrelated linear combinations of maximal variance. Let us elaborate on this brief definition. First, a linear combination, say LC, of the p variables can be expressed as the vector product ax of two vectors a and x,that is,


Here, the vector a is a vector of scalars a1,a2,,ap:


These aj are the coefficients in the linear combination. Such linear combinations are called variates. Principal components are also called latent variables.

The variance V of a linear combination LC is


This is estimated as aSa. This is to be maximized over a. The derivative with respect to the vector a is


The solution is not unique: If a is a solution to this set of equations, so is ca, where c is any scalar constant. Therefore, a constraint is required to obtain a meaningful solution. A reasonable such constraint is the condition aa=1, that is, the squared length of the vector a equals 1. This is of course equivalent to the length of a, the quantity aa, being equal to 1.

A function incorporating the constraint, the Lagrangian function, is


The partial derivatives of the function L with respect to a and λ are




Setting these partial derivatives equal to zero gives the simultaneous linear equations


and the equation


The simultaneous linear equations can be written as


where 0 is the zero vector, the vector whose elements are all zeroes. Factoring out a on the right, we obtain


For nontrivial solutions, the determinant of the coefficient matrix SλI must be zero, that is, we must have detSλI=0. This condition is a polynomial equation of degree p in λ. Denote the p roots by λ1λ2λp. These roots are the eigenvalues (also called latent values). Their sum is the trace of S; their product is the determinant of S.

The corresponding Eigen equations are


1.5.1 Values of PCs in terms of Xs

The jth principal component (PC), Cj, is the linear combination of the form


where aj=a1ja2japj. That is to say, for j=1,2,,p, the value of the jth PC for Individual i is cji=ajxi,i=1,2,,n..

The equation for the jth PC in terms of the vector x=x1x2xp is cj=ajx,j=1,2,,p. Let c be the p-vector of values of the p PCs. Then, c=Ax, where A=a1a2ap is the p×p matrix whose columns are the eigenvectors.

1.5.2 Values of Xs in terms of PCs

The inverse relation is




where B is the matrix of loadings of the Xv on the PCs Cj. Actually, A is an orthonormal matrix (meaning that its columns are of length one and are pairwise orthogonal), so A1=A. Thus, B=A. Therefore,


Letting av be the vth row of the matrix A, that is,


we have, for v=1,2,,p, the representation of each variable Xv in terms of the variables C1,C2,,Cp that are the principal components,


In terms of the first k PCs, this is


where the error εv is


The covariance matrix can be represented in terms of its principal idempotentsajaj as


It follows as a result of this representation that the best approximation of rank k to S is the eigenvalue weighted sum of the first k principal idempotents,


The weights are all non-negative, recalling that, for a symmetric matrix, such as a covariance matrix, the eigenvalues are non-negative.


2. Some ad hoc arithmetic procedures for determining an appropriate number of PCs

2.1 Procedure based on the average of the eigenvalues

The mean λ¯ of the eigenvalues is the sum over the number


The sum of the eigenvalues turns out to be equal to the trace of the covariance matrix; therefore, the mean eigenvalue is equal to the trace divided by p.

One procedure for deciding on the number of PCs to retain is to retain those for which the eigenvalues are greater than average, that is, greater than λ¯. When working in terms of the correlation matrix, this average value is 1. To see this, recall that the correlation matrix is a special case of the covariance matrix, namely, the correlation matrix is the covariance matrix of the standardized variables. It is often preferable to work in terms of the correlation matrix rather than the covariance matrix, to control the effects of different units of measurement and different variances. If a variable has high variance relative to the other variables, the PC will be pulled in the direction of the variable with large variance.

When S is taken to be the sample correlation matrix, the trace of the matrix is simply p, and therefore, the mean λ¯ of the eigenvalues is 1.

2.2 An ad hoc arithmetic procedure based on retaining a prescribed proportion of the total variance

Another ad hoc procedure is to retain a number of PCs sufficient to account for a prescribed proportion, say, 90% of the total variance, that total variance being trace S=j=1pλj. The Figure 90% is of course somewhat arbitrary, so it might be good to have some somewhat more objective criteria based on the pattern of the eigenvalues.

2.3 Procedure based on the decrease of the eigenvalues

Another procedure—a graphical procedure—is to plot λ1,λ2,,λp against 1,2,,p. The λs are in decreasing order, so one then looks for a dropoff—an elbow—in the curve and retains a number of PCs corresponding to the point before the leveling off of the curve, if it does indeed take an elbow shape. Such a plot, of the eigenvalues versus 1,2,,p, is called a scree plot, “scree” being the debris at the foot of a glacier (or, more generally, a collection of broken rock fragments at the base of crags, mountain cliffs, volcanoes, or valley shoulders).


3. Model selection criteria AIC and BIC for the number of PCs

Let us now delve a bit further into mathematical statistics and consider some more objective, numerical criteria, in particular, the information criteria AIC and BIC. Let us see what a Gaussian model would imply about AIC and BIC. The maximum log likelihood for the model (*) approximating the p variables in terms of k PCs is 2πnp/2̂Σkn/2Cnpk, where Cnpk is a constant depending upon the sample size, n, the number of variables, p, and k, the Model k being considered, k=1,2,,K, and Σk denotes the determinant of the residual covarance matrix Σk.

The determinant of the covariance matrix is the product of the eigenvalues,


For a model based on the first k PCs, the determinant of the residual covariance matrix is the product of the remaining, smaller eigenvalues, Πj=k+1pλj.

The model selection criterion AIC—Akaike’s information criterion [5, 6, 7]—is based on an estimate of the logarithm of the cross-entropy of the K proposed models with a null model. That is, for alternative models indexed by k=1,2,,K, AICk is an estimate of the log cross-entropy of the proposed Model k with the null model. The cross-entropy of the distribution with the probability density function qx relative to a distribution with the probability density function px is defined as Hpq=EplnqX=lnqxpxdx.

The Bayesian information criterion (BIC) [8] is based on a large-sample estimate of the posterior probability ppk of Model k,k=1,2,,K. More precisely, BICk is an approximation to 2lnppk.

Formulated in this way, these model selection criteria (MSCs) are, thus, smaller-is-better criteria and take the form


where Lk is the likelihood for Model k,an=lnn for BICk,an=2 (not depending upon n) for AICk, and mk is the number of independent parameters in Model k. The first term is a lack-of-fit (LOF) term, and the second term is a penalty term based on the number of parameters used. With AIC, the penalty is two units per parameter; with BIC, the penalty is lnn units per parameter. For n8, In n exceeds 2: for sample sizes greater than 7, the penalty per parameter with BIC exceeds that for AIC. Therefore, relative to AIC, BIC tends to favor more parsimonious models—models with a smaller number of parameters.

Note that


where C is a constant. Thus, BIC values can be converted to values on a scale of 0–1. This is done by exponentiating –BICk/2, summing the values, and dividing by the sum. That is,


To relate the maximum likehood to the eigenvalues, note that for the PC model,


The model selection criteria can be written as


where Deviancek=n In max Lk is a measure of lack of fit and Penaltyk=anmk. Inclusion of an additional PC is justified if the criterion value decreases, that is, if MSCk+1<MSCk. For PCs, this is


This is








Thus, for AIC, the inclusion of the additional PCk+1 is justified if λk+1 is greater than exp2/n.

For BIC, the inclusion of an additional PCk+1 is justified if


The quantity n1/n tends to 1 for large n. Therefore, this procedure is in approximate agreement with the average eigenvalue rule for correlation matrices, stating that one should retain dimensions with eigenvalues larger than 1.


4. Examples

4.1 An artifical example

The synthesis/analysis paradigm can be useful for understanding a problem. This means synthesizing (simulating) a dataset, so that you know the model and parameter values, and then applying your analysis method to see how well it performs. In the present context, it is interesting to simulate a dataset of measurements of rectangles, with variables length (L) and width (W) and also some functions of those such as perimeter = 2 L + 2 W and difference = L–W. In one synthesis, we took L to be Normal with a mean of 10 and a variance of 1, W was Normal with a mean of 10 and a variance of 1, PERI = 2 L + 2 W plus N(0,1) error, and DIFF = L–W plus N(0,1) error. The eigensystem was computed, and as expected, it is noted that there are two large eigenvalues, with subsequent ones dropping off a lot in value and being close to zero. The eigenvalues of the correlation matrix were 1.91, 1.83, 0.21, and 0.05.

4.2 A real example

Next, we consider the principal component analysis of a sample from the Los Angeles (LA) Heart Study. This was a long-term study, 1947–1972. It was a study among Civil Servants of Los Angeles county. LA civil servants, 2252, randomly selected, ages 21–70, received a battery of examinations for “routine” cardiovascular disease (CVD) risk factors.

The variables include age, systolic blood pressure (SYS), diastolic blood pressure (DIAS), weight (WT), height (HT), and coronary incident, a binary variable indicating whether the individual had a coronary incident during the course of the study. Blood pressure is reported as a bivariate variable, (SYS, DIAS). SYS is the pressure when the heart pumps, and DIAS is the pressure when the heart relaxes.

In the textbook [9], data for a sample of n=100 men were studied. (Data on the same variables for another sample of 100 men are also given in [9]. Results can be compared and contrasted between the two samples.) Although, of course, the emphasis in the Heart Study was on explaining and predicting the coronary incident variable, here, we focus on the first five variables, their representation in terms of a smaller number of PCs, and the interpretations of the PCs. we did the PC analysis; it was not in the LA Heart Study or the textbook.

We used Minitab statistical software for the analysis. Aspects of the analysis are shown as follows.

The lower-triangular portion of the correlation matrix for the five variables is shown in Table 1. The highest correlation is 0.835, between SYS and DIAS. The next highest correlation, 0.426, is between HT and WT.

DIAS0.3540.835<= NOTE highest r of 0.835 is btw SYS and DIAS
HT−0.332−0.088−0.0990.426<= NOTE next highest r of 0.426 is btw HT and WT

Table 1.

Correlation matrix of five variables—LA heart data.

Correlations: AGE, SYS, DIAS, WT, HT.

Cell Contents: Pearson correlation.

4.3 Principal component analysis in the example

Note that an eigenvector can be multiplied by −1, changing the signs of all its elements. In the following, this is done with PC1 so that SYS and DIAS have positive loadings. Our interpretations, related to the scientific/medical context of the study, are BPtotal, SIZE, AGE, OVERWT, and BPdiff and are written below the eigenvectors. The interpretations are based on which loadings are large and which are small, that is, on the relative sizes of the loadings. Taking 0.6 as a cutoff point, in PC1, SYS and DIAS have loadings above this, while the other variables have loadings less than this (in fact, less than 0.4), so PC1 can be interpreted as an index of total BP. In PC2, the variables WT and HT have large loadings with the same sign, so PC2 can be interpreted as SIZE (Tables 2 and 3).

Eigenanalysis of the correlation matrix

Table 2.

PCs of heart data.

Principal component analysis: AGE, SYS, DIAS, WT, HT.

Interpretations (edited in to the computer output):

Table 3.

PC1 is multiplied by −1.

As above, denote the eigensystem in terms of the eigenpairs


Then, the eigensystem equations are


Here, S is taken to be the correlation matrix. Let 1v=0010, the vector with 1 in the vth position and zeroes elsewhere. The covariance between a variable Xv and a PC Cu is CXvCu=C1vXauX=1'Σau=1vλuau=λuauv, where auv is the vth element of the vector au. The coefficient of correlation is CorrXvCu=CXvCu/SDXvSDCu=λuauv/σvλu=λuauv/σv. When the covariance matrix used is the correlation matrix, each standard deviation σv=1, and therefore, this correlation is λuauv. A correlation of size greater than 0.6 corresponds to more than 0.62×100%=36% of variance explained. The variable Xv has a correlation higher than 0.6 with the component Cu if its loading in Cu, the value auv, is greater than 0.6 / λu. These values are appended to Table 4. Loadings larger than this cutoff value are in boldface. (The cutoff point of 0.6 is somewhat arbitrary; one might use, for example, a cutoff of 0.5.)

Eigenvalue, λ2.18941.53820.66170.44850.1621
Square root, λ1.481.240 .810.670.40
0.6/λ0.400 .480.740.901.50

Table 4.

Loadings corresponding to correlations >0.6 are boldface.

One can also focus on the pattern of loadings within the different PCs for the interpretation of the PCs. To reiterate this process and the interpretations, we have the following:

PC1: SYS and DIAS have large loadings with the same sign; we interpret PC1 as BPindex, or BPtotal.

PC2: WT and HT have large loadings with the same sign; we interpret PC2 as the man’s SIZE.

PC3: Only AGE has a large loading, so we interpret PC3 simply as AGE.

PC4: WT and HT have large loadings with opposite signs; we interpret PC4 as OVERWEIGHT.

PC5: SYS and DIAS have large loadings with opposite signs; we interpret PC5 as BPdrop.

We continue to marvel at how readily interpretable the PCs are. This simplicity is attained even without using a factor analysis model and using rotation to simplify the pattern of the loadings.

4.4 Employing the criteria in the example

To compare and contrast the methods, Table 5 shows the eigenvalues and the results according to the various criteria for deciding on the adequate number of PCs. According to the rule based on the average eigenvalue, the dimension is retained if its eigenvalue is greater than 1 (when working in terms of the correlation matrix). For BIC, the kth PC is retained if

No. of PCs, kλkλk>1?lnλkNlnλkfor BIC: Nlnλk>4.61?for AIC: Nlnλk>2?

Table 5.

Estimating the number of PCs by various methods.


where an=lnn. Here, n=100 and lnn=ln100, approximately 4.61. For AIC, the kth PC is retained if nlnλk>2. In this example, the methods agree on retaining k=2 PCs.

We feel that we should remark that, though it is the case that two PCs are suggested, the fourth and fifth PCs do have simple and interesting interpretations. It is just that they do not improve the fit very much. The third PC is essentially a single variable, age.


5. Discussion

The focus here has been on determining the number of dimensions needed to represent a complex of variables adequately. The algebraic solution devolves upon the analysis of properties of the covariance matrix of the variables, especially through its eigensystem.

5.1 Regression on principal components

Next, we consider applying principal component analysis in the context of multiple regression. In this context, there is, of course, a response variable Y and explanatory variables X1,X2,,Xp. One may transform the Xs to their principal components, as this may aid in the interpretation of the results of the regression. In addition, the number of significant regression coefficients may be decreased. In such regression on principal components (see, e.g., [10]), however, one should not necessarily eliminate the principal components with small eigenvalues, as they may still be strongly related to the response variable.

The value of the Bayesian information criterion for Model k is


for alternative models indexed by k=1,2,,K, where LLk is the maximum log likelihood for Model k, that is, LLk=maxlnLk and mk is the number of independent parameters in Model k. For linear regression models with Gaussian-distributed errors, 2LLk=Const.+nlnMSEk and so BIC takes the form


where here MSEk is the maximum likelihood estimate (MLE) of the mean squared error (MSE) of Model k, with divisor n, of the error variance.

The total number of subsets of p things is 2p. Therefore, with p explanatory variables, there are 2p alternative models—“subset regressions”—(including the model where no explanatory variables are used and the fitted value of Y is simply y¯). For example, if there are three Xs, the eight subsets are X1 alone, X2 alone, X3 alone, (X1,X2),X1X3,X2X3,X1X2X3, and the empty set. It would usually seem to be expedient to evaluate all 2p regression models—regressions on all 2p subsets of principal components, using adjusted R-square, AIC, and/or BIC rather than reducing the number of models considered by regressing on only a few principal components. That is, in the context of regression on principal components, it is probably wise not to reduce the number of principal components, for, as stated above, it is conceivable that some principal components with small eigenvalues may nevertheless be important in explaining and predicting the response variable.

5.2 Some related recent literature

Other researchers have considered the problem of the choice of the number of principal components. For example, Bai et al. [11] examined the asymptotic consistency of AIC and BIC for determining the number of significant principal components in high-dimensional problems. The focus in this chapter has not necessarily been on high-dimensional problems.

Some various applications from recent literature involving choosing the number of principal components include the following. The method presented here could possibly be applied in these applications.

For example, a good book on the topic of model selection and testing, covering many aspects, is [12]. In recent years, various econometricians have examined the problems of diagnostic testing, specification testing, semiparametric estimation, and model selection. In addition, various researchers have considered whether to use model testing and model selection procedures to decide upon the models that best fit a particular dataset. This book explores both the issues with application to various regression models, including models for arbitrage pricing theory. Along the lines of model selection criteria, the book references, e.g., [8], the foundational paper for BIC.

Next, we mention some recent papers, which show applications of model selection in various research areas.

One such paper is [13], an application of principal component analysis and other methods to water quality assessment in a lake basin in China.

Another is [14], on feature selection for classification using principal component analysis.

As mentioned, a particularly interesting application of principal component analysis is in regression and logistic regression. We have mentioned the paper [10] on using principal component analysis in regression, taking several principal components to replace the set of explanatory variables. Another interesting application is in [15], on using principal components in logistic regression.


6. Conclusions

The problem of choice of the number of principal components to use to represent a complex of variables—a multivariate sample—has been considered in this chapter.

In addition to some ad hoc arithmetic criteria, Akaike’s information criterion (AIC) and the Bayesian information criterion (BIC) have been applied here to the choice of the number of principal components to represent a dataset. The results have been compared and contrasted with ad hoc criteria such as retaining those principal components that explain more than an average amount of the total variance. The use of BIC is seen to correspond rather closely to the rule of retaining PCs whose eigenvalues are larger than average.



There are no further acknowledgements.


Authors’ contributions

Stanley L. Sclove is the sole author.



There was no funding other than the author’s usual salary at the university.


Competing interests

There are no competing interests.

Availability of data and material

The source of data used is a book that is referenced and available.


AICAkaike’s information criterion
BICBayesian information criterion
DIASdiastolic blood pressure
LClinear combination
LLmaximum log likekihood
MLEmaximum likelihood estimate
MSEmean squared error
PCprincipal component
SYSsystolic blood pressure


  1. 1. Sclove SL. Application of model-selection criteria to some problems in multivariate analysis. Psychometrika. 1987;52(1987):333-343. DOI: 10.1007/BF02294360
  2. 2. Sclove SL. Principal components. In: Darity WA editor. International Encyclopedia of the Social Sciences, 2nd edition. Detroit, USA: Macmillan Reference
  3. 3. Anderson TW. An Introduction to Multivariate Statistical Analysis. 3rd ed. New York, NY: Wiley; 2002
  4. 4. Johnson RJ, Wichern DW. Applied Multivariate Statistical Analysis. 6th ed. Upper Saddle River, NJ: Pearson; 2008
  5. 5. Akaike H. Information theory and an extension of the maximum likelihood principle. In: Petrov BN, Csáki F, editors. 2nd International Symposium on Information Theory, Tsahkadsor, Armenia, USSR, September 2-8, 1971. Budapest: Akadémiai Kiadó; 1973. pp. 267-281 Republished in Kotz S, Johnson NL editors. Breakthroughs in Statistics, I. Berlin, Germany: Springer-Verlag;1992. pp. 610–624
  6. 6. Akaike H. A new look at the statistical model identification. IEEE Transactions on Automatic Control. 1974;19(6):716-723. DOI: 10.1109/TAC.1974.1100705
  7. 7. Akaike H. Prediction and entropy. In: Atkinson AC, Fienberg SE, editors. A Celebration of Statistics, Springer. NY: New York; 1985. pp. 1-24
  8. 8. Schwarz G. Estimating the dimension of a model. The Annals of Statistics. 1978;6:461-464 Available from:
  9. 9. Dixon WJ, Massey FJ Jr. Introduction to Statistical Analysis. 3rd ed. New York: McGraw-Hill; 1969
  10. 10. Massy WF. Principal components regression in exploratory statistical research. Journal of the American Statistical Association. 1965;60(309):234-256. DOI: 10.1080/01621459.1965.10480787
  11. 11. Bai Z, Choi KP, Fujikoshi Y. Consistency of AIC and BIC in estimating the number of significant components in high-dimensional principal component analysis. The Annals of Statistics. 2018;46(3):1050-1076. DOI: 10.1214/17-AOS1577
  12. 12. Bhatti MI, Al-Shanfari H, Hossain MZ. Econometric Analysis of Model Selection and Model Testing. Oxfordshire, England, UK: Routledge; 2017
  13. 13. Xu S, Cui Y, Yang C, Wei S, Dong W, Huang L, et al. The fuzzy comprehensive evaluation (FCE) and the principal component analysis (PCA) model simulation and its applications in water quality assessment of Nansi Lake Basin, China. Environmental Engineering Research. 2021;26(2):222-232
  14. 14. Omuya EO, Okeyo GO, Kimwele MW. Feature selection for classification using principal component analysis and information gain. Expert Systems with Applications. 2021;174:114765
  15. 15. Aguilera AM, Escabias M, Valderrama MJ. Using principal components for estimating logistic regression with high-dimensional multicollinear data. Computational Statistics and Data Analysis. 2006;50(8):1905-1924

Written By

Stanley L. Sclove

Reviewed: 17 March 2022 Published: 05 May 2022