Subset Basis Approximation of Kernel Principal Component Analysis

Principal component analysis (PCA) has been extended to various ways because of its simple definition. Especially, non-linear generalizations of PCA have been proposed and used in various areas. Non-linear generalizations of PCA, such as principal curves (Hastie & Stuetzle, 1989) and manifolds (Gorban et al., 2008), have intuitive explanations and formulations comparing to the other non-linear dimensional techniques such as ISOMAP (Tenenbaum et al., 2000) and Locally-linear embedding (LLE) (Roweis & Saul, 2000).

In the case of KPCA, the optimization problem is reduced to an eigenvalue problem whose size is n.There are some efficient techniques for eigenvalue problems, such as the divide-and-conquer eigenvalue algorithm (Demmel, 1997) or the implicitly restarted Arnoldi method (IRAM) (Lehoucq et al., 1998) 1 .However, their computational complexity is still too large to solve when n is large, because KPCA does not have sparse solution.These algorithms require O(n 2 ) working memory space and O(rn 2 ) computational complexity, where r is the number of principal components.Moreover, we have to store all n learning samples to evaluate input vectors.
Subset KPCA (SubKPCA) approximates KPCA using the subset of samples for its basis, and all learning samples for the criterion of the cost function (Washizawa, 2009).Then the optimization problem for SubKPCA is reduced to the generalized eigenvalue problem whose size is the size of the subset, m.The size of the subset m defines the trade-off between the approximation accuracy and the computational complexity.Since all learning samples are utilized for its criterion, even if m is much smaller than n, the approximation error is small.The approximation error due to this subset approximation is discussed in this chapter.Moreover, after the construction, we only have to store the subset to evaluate input vectors.
An illustrative example is shown in Figure 1. Figure 1 (a) shows artificial 1000 2-dimensional samples, and contour lines of norms of transformed vectors onto one-dimensional subspace by PCA. Figure 1 (b) shows contour curves by KPCA (transformed to five-dimensional subspace in F ).This is non-linear analysis, however, it requires to solve an eigenvalue problem whose size is 1000.For an input vector, calculations of kernel function with all 1000 samples are required.Figure 1 (c) randomly selects 50 samples, and obtains KPCA.In this case, the size of the eigenvalue problem is only 50, and calculations of kernel function with only 50 samples are required to obtain the transform.However, the contour curves are rather different from (b). Figure 1 (d) shows contour curves of SubKPCA by using the 50 samples for its basis, and all 1000 samples for evaluation.The contour corves are almost that same with (b).In this case, the size of the eigenvalue problem is also only 50, and the number of calculations of kernel function is also 50.
In this chapter, we denote vectors by bold-italic lower symbols x, y, and matrices by bold-italic capital symbols A, B. In kernel methods, F could be infinite-dimensional space up to the selection of the kernel function.If vectors could be infinite (functions), we denote them by italic lower symbols f , g.If either domain or range of linear transforms could be infinite-dimensional space, we denote the transforms by italic capital symbols X, Y.This is summarized as follows; (i) bold symbols, x, A, are always finite.(ii) non-bold symbols, f , X, could be infinite.

Kernel PCA
This section briefly reviews KPCA, and shows some characterizations of KPCA.

Brief review of KPCA
Let x 1 ,...,x n be d-dimensional learning samples, and Suppose that their mean is zero or subtracted.Standard PCA obtains eigenvectors of the variance-covariance matrix Σ, Then the ith largest eigenvector corresponds to the ith principal component.Suppose In the case of KPCA, input vectors are mapped to feature space before the operation.Let where • * denotes the adjoint operator2 , and K is called the kernel Gram matrix (Schölkopf et al., 1999), and i,j-component of K is k(x i , x j ).Then the ith largest eigenvector corresponds to the ith principal component.If the dimension of F is large, eigenvalue decomposition (EVD) cannot be performed.Let {λ i , u i } be the ith eigenvalue and corresponding eigenvector of Σ F respectively, and {λ i , v i } be the ith eigenvalue and eigenvector of K.Note that K and Σ F have the same eigenvalues.Then the ith principal component can be obtained from the ith eigenvalue and eigenvector of K, Note that it is difficult to obtain u i explicitly on a computer because the dimension of F is large.However, the inner product of a mapped input vector Φ(x) and the ith principal component is easily obtained from, k x is an n-dimensional vector called the empirical kernel map.
Let us summarize using matrix notations.Let Then the projection and the transform of x onto the r-dimensional eigenspace are

Characterization of KPCA
There are some characterizations or definitions for PCA (Oja, 1983).SubKPCA is extended from the least mean square (LMS) error criterion3 .min From this definition, X that minimizes the averaged distance between x i and Xx i over i is obtained under the rank constraint.Note that from this criterion, each principal component is not characterized, i.e., the minimum solution is X = U PCA U ⊤ PCA , and the transform U PCA is not determined.
In the case of KPCA, the criterion is min where R(A) denotes the range or the image of the matrix or the operator A, and N (A) denotes the null space or the kernel of the matrix or the operator A. In linear case, we can assume that the number of samples n is sufficiently larger than r and d, and the second constraint N (X) ⊃R (S) ⊥ is often ignored.However, since the dimension of the feature space is large, r could be larger than the dimension of the space spanned by mapped samples Φ(x 1 ),...,Φ(x n ).For such cases, the second constraint is introduced.

Solution to the problem (16)
Here, brief derivation of the solution to the problem ( 16) is shown.Since the problem is in R(S), X can be parameterized by X = SAS * , A ∈ R n×n .Accordingly, J 1 yields where • 1/2 denotes the square root matrix, and • F denotes the Frobenius norm.The eigenvalue decomposition of From the Schmidt approximation theorem (also called Eckart-Young theorem) (Israel & Greville, 1973), J 1 is minimized when
The dominant computation for the learning stage is EVD.In realistic situation, n should be less than several tens of thousands.For example, if n = 100, 000, 20Gbyte RAM is required to store K on four byte floating point system.This computational complexity is sometimes too heavy to use for real large-scale problems.Moreover, in the evaluation stage, response time of the system depends on the number of n.

Definition
Since the problem of KPCA in the feature space F is in the subspace spanned by the mapped samples, Φ(x 1 ),...,Φ(x n ), i.e., R(S), the problem in F is transformed to the problem in R n .SubKPCA seeks the optimal solution in the space spanned by smaller number of samples, Φ(y 1 ),...,Φ(y m ), m ≤ n that is called a basis set.Let T =[ Φ(y 1 ),...,Φ(y m )], then the optimization problem of SubKPCA is defined as min The third and the fourth constraints indicate that the solution is in R(T).It is worth noting that SubKPCA seeks the solution in the limited space, however, the objective function is the same as that of KPCA, i.e., all training samples are used for the criterion.We call the set of all training samples the criterion set.The selection of the basis set {y 1 ,...,y m } is also important problem, however, here we assume that it is given, and the selection is discussed in the next section.

Solution of SubKPCA
At first, the minimal solutions to the problem (20) are shown, then their derivations are shown.
If R(T) ⊂R (S), its solution is simplified.Note that if the set {y 1 ,...,y m } the subset of {x 1 ,...,x n }, R(T) ⊂R(S) is satisfied.Therefore, solutions for two cases are shown, (R(T) ⊂ R(S) and all cases) 3.2.1The case R(T) ⊂R(S) Let κ 1 ,...,κ r and z 1 ,...,z r be sorted eigenvalues and corresponding eigenvectors of the generalized eigenvalue problem, respectively, where each eigenvector z i is normalized by then the problem (20) is minimized by The projection and the transform of SubKPCA for an input vector x are where h x =[k(x, y 1 ),...,k(x, y m )] ∈ R m is the empirical kernel map of x for the subset.
A matrix or an operator A that satisfies AA = A and A ⊤ = A (A * = A), is called a projector (Harville, 1997).If R(T) ⊂R(S), P SubKPCA is a projector since P * SubKPCA = P SubKPCA , and

All cases
The Moore-Penrose pseudo inverse is denoted by and let W =[w 1 ,...,w r ].Then the problem ( 20) is minimized by Since the solution is rather complex, and we don't find any advantages to use the basis set {y 1 ,...,y m } such that R(T) ⊂R(S), we henceforth assume that R(T) ⊂R(S).

Derivation of the solutions
Since the problem (20) is in R(T), the solution can be parameterized as X = TBT * , B ∈ R m×m .Then the objective function is where the relations y are used.Since the second term is a constant for B, from the Schmidt approximation theorem, The minimum solution is given by the singular value decomposition (SVD) of Then the minimum solution is given by 73 Subset Basis Approximation of Kernel Principal Component Analysis www.intechopen.com From the matrix equation theorem (Israel & Greville, 1973), the minimum solution is given by Eq. ( 27).
Let us consider the case that R(T) ⊂R(S).

Computational complexity of SubKPCA
The procedures and computational complexities of SubKPCA are as follows, 1. Select the subset from training samples (discussed in the next Section) 2. Calculate K y and K ⊤ xy K xy [ O(m 2 )+O(nm 2 )] 3. Perform generalized EVD, Eq. ( 21).[O(rm 2 )] 4. Store Z and the samples in the subset.5.For an input vector x, calculate the empirical kernel map h x .[O(m)] 6. Obtain transformed vector Eq. ( 24).
The procedures 1, 2 and 3 are the construction, and 4 and 5 are the evaluation.The dominant calculation in the construction stage is the generalized EVD.In the case of standard KPCA, the size of EVD is n, whereas for SubKPCA, the size of generalized EVD is m.Moreover, for evaluation stage, the computational complexity depends on the size of the subset, m, and required memory to store Z and the subset is also reduced.It means the response time of the system using SubKPCA for an input vector x is faster than standard KPCA.

Approximation error
It should be shown the approximation error due to the subset approximation.In the case of KPCA, the approximation error, that is the value of the objective function of the problem (16).From Eqs. ( 17) and ( 19), The value of J 1 at the minimum solution is In the case of SubKPCA, the approximation error is The first term is due to the approximation error for the rank reduction and the second term is due to the subset approximation.Let P R(S) and P R(T) be orthogonal projectors onto R(S) and R(T) respectively.The second term yields that since K = S * P R(S) S. Therefore, if R(S)=R(T) (for example, the subset contains all training samples), the second term is zero.If the range of the subset is far from the range of the all training set, the second term is large.

Pre-centering
Although we have assumed that the mean of training vector in the feature space is zero so far, it is not always true in real problems.In the case of PCA, we subtract the mean vector from all training samples when we obtain the variance-covariance matrix Σ.On the other hand, in KPCA, although we cannot obtain the mean vector in the feature space, Φ = 1 n ∑ n i=1 Φ(x i ), explicitly, the pre-centering can be set in the algorithm of KPCA.The pre-centering can be achieved by using subtracted vector Φ(x i ), instead of a mapped vector Φ(x i ), that is to say, S and K in Eq. ( 17) are respectively replaced by where I denotes the identify matrix, and 1 n and 1 n,n are an n-dimensional vector and an n × n matrix whose elements are all one, respectively.
For SubKPCA, following three methods to estimate the centroid can be considered, The first one is the same as that of KPCA.The second one is the mean of the basis set.If the basis set is the subset of the criterion set, the estimation accuracy is not as good as Φ1 .The third one is the best approximation of Φ1 in R(T).Since SubKPCA is discussed in R(T), Φ1 and Φ3 are equivalent.However, for the post-processing such as pre-image, they are not equivalent.

75
Subset Basis Approximation of Kernel Principal Component Analysis www.intechopen.com For SubKPCA, only S in Eq. ( 28) has to be modified for per-centering4 .I f Φ3 is used, S and K xy are replaced by

Selection of samples
Selection of samples for the basis set is an important problem in SubKPCA.Ideal criterion for the selection depends on applications such as classification accuracy or PSNR for denoising.We, here, show a simple criterion using empirical error, min 4. Clustering, and their combinations.

Greedy forward search
The greedy forward search adds a sample to the basis set one by one or bit by bit.The algorithm is as follows, If several samples are added at 9 and 10, the algorithm is faster, but the cost function may be larger.

Backward search
On the other hand, a backward search removes samples that have the least effect on the cost function.In this case, the standard KPCA using the all samples has to be constructed at the beginning, and this may have very high computational complexity.However, the backward search may be useful in combination with the greedy forward search.In this case, the size of the temporal basis set does not become large, and the value of the cost function is monotonically decreasing.
Sparse KPCA (Tipping, 2001) is a kind of backward procedures.Therefore, the kernel Gram matrix K using all training samples and its inverse have to be calculated in the beginning.

Algorithm 1 Greedy forward search (one-by-one version)
1: Set initial basis set T = φ, size of current basis set m = 0, residual set S = {x 1 ,...,x n }, size of the residual set ñ = n.2: while m < m, do 3: Let temporal basis set be T = T∪{ x i } 5: Obtain SubKPCA using the temporal basis set 6: Store the empirical error E i = J 1 (X).

7:
end for 8: Obtain the smallest E i , k = argmin i E i .RANSAC is a simple sample (or parameter) selection technique.The best basis set is chosen from many random sampling trials.The algorithm is simple to code.

Clustering
Clustering techniques also can be used for sample selection.When the subset is used for the basis set, i) a sample that is the closest to each centroid should be used, or ii) centroids should be included to the criterion set.Clustering in the feature space F is also proposed (Girolami, 2002).

Comparison with conventional methods
This section compares SubKPCA with related conventional methods.

Improved KPCA
Improved KPCA (IKPCA) (Xu et al., 2007) directly approximates u i ≃ T ṽi in Eq. ( 7).From SS * u i = λ i u i , the approximated eigenvalue problem is By multiplying T * from left side, one gets the approximated generalized EVD, K ⊤ xy K xy ṽ = λ i K y ṽi .The parameter vector v i is substituted to the relation ũi = T ṽi , hence, the transform of an input vector x is where κ i is the ith largest eigenvalue of ( 21).
This approximation has no guarantee to be good approximation of u i .In our experiments in the next section, IKPCA showed worse performance than SubKPCA.In so far as feature extraction, each dimension of the feature vector is multiplied by 1 √ κ i comparing to SubKPCA.If the classifier accepts such linear transforms, the classification accuracy of feature vectors 77 Subset Basis Approximation of Kernel Principal Component Analysis www.intechopen.commay be the same with SubKPCA.Indeed, (Xu et al., 2007) uses IKPCA only for feature extraction of a classification problem, and IKPCA shows good performance.

Sparse KPCA
Two methods to obtain a sparse solution to KPCA are proposed (Smola et al., 1999;Tipping, 2001).Both approaches focus on reducing the computational complexity in the evaluation stage, and do not consider that in the construction stage.In addition, the degree of sparsity cannot be tuned directly for these sparse KPCAs, where as the number of the subset m can be tuned for SubKPCA.
As mentioned in Section 4.1.2,(Tipping, 2001) is based on a backward search, therefore, it requires to calculate the kernel Gram matrix using all training samples, and its inverse.These procedures have high computational complexity, especially, when n is large.(Smola et al., 1999) utilizes l 1 norm regularization to make the solution sparse.The principal components are represented by linear combinations of mapped samples, u i = ∑ n j=1 α j i Φ(x j ).The coefficients α j i have many zero entry due to l 1 norm regularization.However, since α j i has two indeces, even if each principal component u i is represented by a few samples, it may not be sparse for many i.

Nyström approximation
Nyström approximation is a method to approximate EVD, and it is applied to KPCA (Williams & Seeger, 2001).Let ũi and u i be the ith eigenvectors of K y and K respectively.Nyström approximation approximates ṽi = m n where λ i is the ith eigenvalue of K y .Since the eigenvector of K x is approximated by the eigenvector of K y , the computational complexity in the construction stage is reduced, but that in the evaluation stage is not reduced.In our experiments, SubKPCA shows better performance than Nyström approximation.

Iterative KPCA
There are some iterative approaches for KPCA (Ding et al., 2010;Günter et al., 2007;Kim et al., 2005).They update the transform matrix Λ −1/2 V ⊤ KPCA in Eq. ( 14) for incoming samples.Iterative approaches are sometimes used for reduction of computational complexities.Even if optimization step does not converge to the optimal point, early stopping point may be a good approximation of the optimal solution.However, Kim et al. (2005) and Günter et al. (2007) do not compare their computational complexity with standard KPCA.In the next section, comparisons of run-times show that iterative KPCAs are not faster than batch approaches.

Incomplete Cholesky decomposition
ICD can also be used for reduction of computational complexity of KPCA.ICD approximates the kernel Gram matrix K by where G ∈ R n×m whose upper triangle part is zero, and m is a parameter that specifies the trade-off between approximation accuracy and computational complexity.Instead of performing EVD of K, eigenvectors of K is obtained from EVD of G ⊤ G ∈ R m×m using the relation Eq. ( 7) approximately.Along with Nyström approximation, ICD reduces computational complexity in the construction stage, but not in evaluation stage, and all training samples have to be stored for the evaluation.
In the next section, our experimental results indicate that ICD is slower than SubKPCA for very large dataset, n is more than several thousand.
ICD can also be applied to SubKPCA.In Eq. ( 21), K ⊤ xy K xy is approximated by Then approximated z is obtained from EVD of G ⊤ K y G.

Numerical examples
This section presents numerical examples and numerical comparisons with the other methods.

Methods and evaluation criteria
At first, methods to be compared and evaluation criteria are described.Following methods are compared, Standard KPCA using all training samples.

Reduced KPCA [RKp]
Standard KPCA using subset of training samples.
For evaluation criteria, the empirical error that is J 1 , is used.
where X is replaced by each operator.Note that full KPCA gives the minimum values for E emp (X) under the rank constraint.Since E emp (X) depends on the problem, normalized by that of full KPCA is also used, E emp (X)/E emp (P Fkp ), where P FKp is a projector of full KPCA.
Validation error E val that uses validation samples instead of training samples in the empirical error is also used.The alternative criterion is operator distance from full KPCA.Since these methods are approximation of full KPCA, an operator that is closer to that of full KPCA is the better one.
In the feature space, the distance between projectors is measured by the Frobenius distance, For example, if X = P SubKPCA = TZZ ⊤ T * (Eq.( 27)),

Artificial data
Two-dimensional artificial data described in Introduction is used again with more comparisons and quantitative evaluation.Gaussian kernel function k(x 1 , x 2 )= exp(−0.1 x 1 − x 2 2 ) and the number of principal components, r = 5 are chosen.Training samples of Reduced KPCA and the basis set of SubKPCA, Nyström approximation, and IKPCA are identical, and chosen randomly.For Sparse KPCA (SpKp), a parameter σ is chosen to have the same sparsity level with SubKPCA. Figure 2 shows contour curves and values of evaluation criteria.From evaluation criteria E emp and D, SubKp shows the best approximation accuracy among these methods.
Table 1 compares sample selection methods.The values in the table are the mean values and standard deviations over 10 trials using different random seeds or initial point.SubKPCA performed better than the other methods.Regarding sample selection, K-means and forward search give almost the same results for SubKPCA.

Open dataset
Three open benchmark datasets, "concrete," "housing," and "tic" from UCI (University of California Irvine) machine learning repository are used 5 (Asuncion & Newman, 2007).Table 2 shows properties of the datasets. 2/(2σ 2 )) whose σ 2 is set to be the variance of for all elements of each dataset is used for the kernel function.Figure 5 shows the relation between runtime [s] and squared distance from Full KPCA.In this figure, "kmeans" includes runtime for K-means clustering.The vertical dotted line stands for run-time of full KPCA.For (a) concrete and (b) housing, incomplete Cholesky decomposition is faster than our method.However, for a larger dataset, (c) tic, incomplete Cholesky decomposition is slower than our method.KHA-SMD Günter et al. (2007) is slower than full KPCA in these three methods.

Classification
PCA and KPCA are also used for classifier as subspace methods (Maeda & Murase, 1999;Oja, 1983;Tsuda, 1999).Subspace methods obtain projectors onto subspaces that correspond with classes.Let P i be a projector onto the subspace of the class i.In the class feature information compression (CLAFIC) that is one of the subspace methods, P i is a projector of PCA for each class.Then an input sample x is classified to a class k whose squared distance is the largest, that is, k =argmax i=1,...,c x where c is the number of classes.Binary classifiers such as SVM cannot be applied to multi-class problems directly, therefore, some extentions such as one-against-all strategy have to be used.However, subspace methods can be applied to many-class problems

Conclusion
Theories, properties, and numerical examples of SubKPCA have been presented in this chapter.SubKPCA has a simple solution form Eq. ( 22) and no constraint for its kernel functions.Therefore, SubKPCA can be applied to any applications of KPCA.Furthermore, it should be emphasized that SubKPCA is always better than reduced KPCA in the sense of the empirical errors if the subset is the same.

5
Fig. 2. Contour curves of projection normsGaussian kernel k(x 1 .x 2 )=exp(− x 1 − x 2 2 /(2σ 2)) whose σ 2 is set to be the variance of for all elements of each dataset is used for the kernel function.The number of principal

Fig. 6 .
Fig. 6.Relation between training error and elapsed time in MNIST dataset is used for basis of the SubKPCA.Figure 6 shows relation between run-time and training error.SubKPCA achieves lower training error E emp = 4.57 × 10 −5 in 28 seconds, whereas ICD achieves E emp = 4.59 × 10 −5 in 156 seconds.
The projection and the transform of x onto r-dimensional eigenspace are U PCA U ⊤ PCA x and U ⊤ PCA x respectively.69 Subset Basis Approximation of Kernel Principal Component Analysis www.intechopen.com

Table 1 .
Mean values and standard deviations over 10 trials of E emp and D in Experiment 1: Upper rows are E emp (X)/E emp (P FKp ); lower rows are D; Sparse KPCA does not require sample selection.

Table 2 .
Open dataset components, r, is set to be the input dimension of each dataset.90% of samples are used for training, and the remaining 10% of samples are used for validation.The division of the training and the validation sets is repeated 50 times randomly.Figures3-(a) and (b) show the averaged squared distance from KPCA using all samples.SubKPCA shows better performance than Reduced KPCA and the Nyström method, especially SubKPCA with a forward search performed the best of all.In both datasets, even if the number of basis is one of tenth that of all samples, the distance error of SubKPCA is less than 1%.Figures3-(c) and (d) show the average normalized empirical error, and Figures (e) and (f) show the averaged validation error.SubKPCA with K-means or forward search performed the best, and its performance did not change much with 20% more basis.The results for the Nyström method are outside of the range illustrated in the figures.Figures4-(a) and (b) show the calculation times for construction.The simulation was done on the system that has an Intel Core 2 Quad CPU 2.83GHz and an 8Gbyte RAM.The routines dsygvx and dsyevx in the Intel math kernel library (MKL) were respectively used for the generalized eigenvalue decomposition of SubKPCA and the eigenvalue decomposition of KPCA.The figures indicate that SubKPCA is faster than Full KPCA if the number of basis is less than 80%.