Kernel Methods for Dimensionality Reduction Applied to the «Omics» Data

Microarray technology has been advanced to the point at which the simultaneous monitoring of gene expression on a genome scale is now possible. Microarray experiments often aim to identify individual genes that are differentially expressed under distinct conditions, such as between two or more phenotypes, cell lines, under different treatment types or diseased and healthy subjects. Such experiments may be the first step towards inferring gene function and constructing gene networks in systems biology.


Introduction
Microarray technology has been advanced to the point at which the simultaneous monitoring of gene expression on a genome scale is now possible.Microarray experiments often aim to identify individual genes that are differentially expressed under distinct conditions, such as between two or more phenotypes, cell lines, under different treatment types or diseased and healthy subjects.Such experiments may be the first step towards inferring gene function and constructing gene networks in systems biology.
The term "gene expression profile" refers to the gene expression values on all arrays for a given gene in different groups of arrays.Frequently, a summary statistic of the gene expression values, such as the mean or the median, is also reported.Dot plots of the gene expression measurements in subsets of arrays, and line plots of the summaries of gene expression measurements are the most common plots used to display gene expression data (See for example Chambers (1983) and references therein).
An ever increasing number of techniques are being applied to detect genes which have similar expression profiles from microarray experiments.Techniques such clustering (Eisen et al. (1998)), self organization map (Tamayo et al. (1999)) have been applied to the analysis of gene expression data.Also we can found several applications on microarray analysis based on distinct machine learning methods such as Gaussian processes (Chu et al. (2005); Zhao & Cheung (2007)), Boosting (Dettling (2004)) and Random Forest (Diaz (2006)).It is useful to find gene/sample clusters with similar gene expression patterns for interpreting the microarray data.
However, due to the large number of genes involved it might be more effective to display these data on a low dimensional plot.Recently, several authors have explored dimension reduction techniques.Alter et al. (2000) analyzed microarray data using singular value decomposition (SVD), Fellenberg et al. (2001) used correspondence analysis to visualize genes and tissues, Pittelkow & Wilson (2003) and Park et al. (2008) used several variants of biplot methods as a visualization tool for the analysis of microarray data.Visualizing gene expression may facilitate the identification of genes with similar expression patterns.
Principal component analysis has a very long history and is known to very powerful for the linear case.However, the sample space that many research problems are facing, especially the sample space of microarray data, are considered nonlinear in nature.One reason might be that the interaction of the genes are not completely understood.Many biological pathways are still beyond human comprehension.It is then quite naive to assume that the genes should be connected in a linear fashion.Following this line of thought, research on nonlinear dimensionality reduction for microarray gene expression data has increased (Zhenqiu et al. (2005), Xuehua & Lan (2009) and references therein).Finding methods that can handle such data is of great importance if as much information as possible is to be gleaned.
Kernel representation offers an alternative to nonlinear functions by projecting the data into a high-dimensional feature space, which increases the computational power of linear learning machines, (see for instance Shawe-Taylor & Cristianini (2004); Scholkopf & Smola (2002)).
Kernel methods enable us to construct different nonlinear versions of any algorithm which can be expressed solely in terms of dot products, known as the kernel trick.Thus, kernel algorithms avoid the explicit usage of the input variables in the statistical learning task.Kernel machines can be used to implement several learning algorithms but they usually act as a black-box with respect to the input variables.This could be a drawback in biplot displays in which we pursue the simultaneous representation of samples and input variables.
In this work we develop a procedure for enrich the interpretability of Kernel PCA by adding in the plot the representation of input variables.We used the radial basis kernel (Gaussian kernel) in our implementation however, the procedure we have introduced is also applicable in cases that may be more appropriated to use any other smooth kernel, for example the Linear kernel which supplies standard PCA analysis.In particular, for each input variable (gene) we can represent locally the direction of maximum variation of the gene expression.As we describe below, our implementation enables us to extract the nonlinear features without discarding the simultaneous display of input variables (genes) and samples (microarrays).

Kernel PCA methodology
KPCA is a nonlinear equivalent of classical PCA that uses methods inspired by statistical learning theory.We describe shortly the KPCA method from Scholkopf et al. (1998).
Given a set of observations x i ∈ R n , i = 1,...,m.Let us consider a dot product space F related to the input space by a map φ : R n → F which is possibly nonlinear.The feature space F could have an arbitrarily large, and possibly infinite, dimension.Hereafter upper case characters are used for elements of F, while lower case characters denote elements of R n .We assume that we are dealing with centered data ∑ m i=1 φ(x i )=0.In F the covariance matrix takes the form We have to find eigenvalues λ ≥ 0 and nonzero eigenvectors V ∈ F\{0} satisfying As is well known all solutions V with λ = 0 lie in the span of {φ(x i )} m i=1 .This has two consequences: first we may instead consider the set of equations for all k = 1,...,m, and second there exist coefficients α i , i = 1,...,m such that Combining ( 1) and (2) we get the dual representation of the eigenvalue problem where α denotes the column vector with entries α 1 , ..., α m .To find the solutions of (3), we solve the dual eigenvalue problem Kα = mλα,( 4 ) for nonzero eigenvalues.It can be shown that this yields all solutions of (3) that are of interest for us.Let λ 1 ≥ λ 2 ≥ ••• ≥ λ m the eigenvalues of K and α 1 , ..., α m the corresponding set of eigenvectors, with λ r being the last nonzero eigenvalue.We normalize α 1 , ..., α r by requiring that the corresponding vectors in F be normalized V k , V k = 1, for all k = 1, ..., r.T a k i n g into account (2) and (4), we may rewrite the normalization condition for α 1 , ..., α r in this way For the purpose of principal component extraction, we need to compute the projections onto the eigenvectors V k in F, k = 1, ..., r.L e ty beatestpoint,withanimageφ(y) in F.Th en are the nonlinear principal component corresponding to φ.

Centering in feature space
For the sake of simplicity, we have made the assumption that the observations are centered.This is easy to achieve in input space but harder in F, because we cannot explicitly compute the mean of the mapped observations in F. There is, however, a way to do it.
Given any φ and any set of observations x 1 , ..., x m ,letusdefine will be centered.Thus the assumption made above now hold, and we go on to define covariance matrix and dot product matrix Kij = φ(x i ), φ(x j ) in F. We arrive at our already familiar eigenvalue problem m λ α = K α, ( 8 )   with α being the expansion coefficients of an eigenvector (in F) in terms of the centered points ( 7) Because we do not have the centered data (7), we cannot compute K explicitly, however we can express it in terms of its noncentered counterpart K.In the following, we shall use Using the vector 1 m =(1, ..., 1) ⊺ , we get the more compact expression We thus can compute K from K and solve the eigenvalue problem (8).As in equation ( 5), the solution αk , k = 1, ..., r, are normalized by normalizing the corresponding vector Ṽk in F, which translates into λk αk , αk = 1.
Consider a test point y.To find its coordinates we compute projections of centered φ-images of y onto the eigenvectors of the covariance matrix of the centered points, 4 Principal Component Analysis -Multidisciplinary Applications www.intechopen.com Introducing the vector where Ṽ is a m × r matrix whose columns are the eigenvectors Ṽ1 , ..., Ṽr .
Notice that the KPCA uses only implicitly the input variables since the algorithm formulates the reduction of the dimension in the feature space through the kernel function evaluation.Thus KPCA is usefulness for nonlinear feature extraction by reducing the dimension but not to explain the selected features by means the input variables.

Adding input variable information into Kernel PCA
In order to get interpretability we add supplementary information into KPCA representation.
We have developed a procedure to project any given input variable onto the subspace spanned by the eigenvectors (9).
We can consider that our observations are realizations of the random vector X =(X 1 , ..., X n ).
Then to represent the prominence of the input variable X k in the KPCA.We take a set of points of the form y = a + se k ∈ R n where e k =( 0, ..., 1, ..., 0) ∈ R n , s ∈ R,w h er ek-th component is equal 1 and otherwise are 0.Then, we can compute the projections of the image of these points φ(y) onto the subspace spanned by the eigenvectors (9).
Taking into account equation ( 11) the induced curve in the eigenspace expressed in matrix form is given by the row vector: where Z s is of the form (10).
In addition we can represent directions of maximum variation of σ(s) associated with the variable X k by projecting the tangent vector at s = 0.In matrix form, we have where δ k t denotes the delta of Kronecker.In particular, let us consider the radial basis kernel: k(x, z)=exp(−c x − z 2 ) with c > 0 a free parameter.Using above notation, we have When we consider the set of points of the form y = a + se k ∈ R n , In addition, if a = x β (a training point) then Thus, by applying equation ( 12) we can locally represent any given input variable in the KPCA plot.Moreover, by using equation ( 13) we can represent the tangent vector associated with any given input variable at each sample point.Therefore, we can plot a vector field over the KPCA that points to the growing directions of the given variable.
We used the radial basis kernel in our implementation however the procedure we have introduced is also applicable to any other smooth kernel, for instance the Linear kernel which supplies standard PCA analysis.

Validation
In this section we illustrate our procedure with data from the leukemia data set of Golub et al.
In these examples our aim is to validate our procedure for adding input variables information into KPCA representation.We follow the following steps.First, in each data set, we build a list of genes that are differentially expressed.This selection is based in accordance with previous studies such as (Golub et al. (1999), Pittelkow & Wilson (2003), Reverter et al. (2010)).
In addition we compute the expression profile of each gene selected, this profile confirm the evidence of differential expression.

6
Principal Component Analysis -Multidisciplinary Applications www.intechopen.com Second, we compute the curves through each sample point associated with each gene in the list.These curves are given by the φ-image of points of the form: where x i is the 1 × n expression vector of the i-th sample, i = 1, ..., m, k denotes the index in the expression matrix of the gene selected to be represented, e k =( 0, ..., 1, ..., 0) is a 1 × n vector with zeros except in the k-th.These curves describe locally the change of the sample x i induced by the change of the gene expression.
Third, we project the tangent vector of each curve at s = 0, that is, at the sample points x i , i = 1, ..., m, onto the KPCA subspace spanned by the eigenvectors (9).This representation capture the direction of maximum variation induced in the samples when the expression of gene increases.
By simultaneously displaying both the samples and the gene information on the same plot it is possible both to visually detect genes which have similar profiles and to interpret this pattern by reference to the sample groups.

Leukemia data sets
The leukemia data set is composed of 3051 gene expressions in three classes of leukemia: 19 cases of B-cell acute lymphoblastic leukemia (ALL), 8 cases of T-cell ALL and 11 cases of acute myeloid leukemia (AML).Gene expression levels were measured using Affymetrix high-density oligonucleotide arrays.
The data were preprocessed according to the protocol described in Dudoit et al. (2002).
In addition, we complete the preprocessing of the gene expression data with a microarray standardization and gene centring.
In this example we perform the KPCA , as detailed in the previous section, we compute the kernel matrix with using the radial basis kernel with c = 0.01, this value is set heuristically.The resulting plot is given in Figure 1.It shows the projection onto the two leading kernel principal components of microarrays.In this figure we can see that KPCA detect the group structure in reduced dimension.AML, T-cell ALL and B-cell ALL are fully separated by KPCA.
To validate our procedure we select a list of genes differentially expressed proposed by (Golub et al. (1999), Pittelkow & Wilson (2003), Reverter et al. ( 2010)) and a list of genes that are not differentially expressed.In particular, in Figures 2, 3, 4 and 5 we show the results in the case of genes: X76223_s_at, X82240_rna1_at, Y00787_s_at and D50857_at, respectively.The three first genes belong to the list of genes differentially expressed and the last gene is not differentially expressed.This profile is agree with our procedure because the direction in which the expression of the X76223_s_at gene increases points to the T-cell cluster.
Figure 3 (top) shows the tangent vectors associated with X82240_rna1_at gene attached at each sample point.This vector field reveals upper expression towards B-cell cluster as is expected from references above mentioned.Figure 3 (bottom) shows the expression profile of X82240_rna1_at gene.We can observe that X82240_rna1_at gene is up regulated in B-cell class.This profile is agree with our procedure because the direction in which the expression of the X82240_rna1_at gene increases points to the B-cell cluster.
Figure 4 (top) shows the tangent vectors associated with Y00787_s_at gene attached at each sample point.This vector field reveals upper expression towards AML cluster as is expected from references above mentioned.Figure 4 (bottom) shows the expression profile of Y00787_s_at g e n e .W ec a no b s e r v et h a tY00787_s_at gene is up regulated in AML class.This profile is agree with our procedure because the direction in which the expression of the Y00787_s_at gene increases points to the AML cluster.The arrows are of short length and variable direction in comparison with other genes showed in previous Figures.Figure 5 (bottom) shows a flat expression profile of D50857_at gene.This profile is agree with our procedure because any direction of expression of the D50857_at gene is highlighted.

Lymphoma data sets
The lymphoma data set comes from a study of gene expression of three prevalent lymphoid malignancies: B-cell chronic lymphocytic leukemia (B-CLL), follicular lymphoma (FL) and diffuse large B-cell lymphoma (DLCL).Among 96 samples we took 62 samples 4026 genes in three classes: 11 cases of B-CLL, 9 cases of FL and 42 cases of DLCL.Gene expression levels were measured using 2-channel cDNA microarrays.
After preprocessing, all gene expression profiles were base 10 log-transformed and, in order to prevent single arrays from dominating the analysis, standardized to zero mean and unit variance.Finally, we complete the preprocessing of the gene expression data with gene centring.
In this example we perform the KPCA , as detailed in the previous section, we compute the kernel matrix with using the radial basis kernel with c = 0.01, this value is set heuristically.The resulting plot is given in Figure 6.It shows the projection onto the two leading kernel principal components of microarrays.In this figure we can see that KPCA detect the group structure in reduced dimension.DLCL, FL and B-CLL are fully separated by KPCA.
To validate our procedure we select a list of genes differentially expressed proposed by (Reverter et al. (2010)) and a list of genes that are not differentially expressed.In particular, in Figures 7, 8, 9 and 10 we show the results in the case of genes: 139009, 1319066, 1352822 and 1338456, respectively.The three first genes belong to the list of genes differentially expressed and the last gene is not differentially expressed.Figure 8 (top) shows the tangent vectors associated with 1319066 gene attached at each sample point.This vector field reveals upper expression towards FL cluster as is expected from references above mentioned.This gene is mainly represented by the second principal component.Figure 8 (bottom) shows the expression profile of 1319066 gene.We can observe that 1319066 gene is up regulated in FL class.This profile is agree with our procedure because the direction in which the expression of the 1319066 gene points to the FL cluster.We can observe that 1352822 gene is up regulated in B-CLL class.This profile is agree with our procedure because the direction in which the expression of the 1352822 gene increases points to the B-CLL cluster.
Figure 10 (top) shows the tangent vectors associated with 1338456 gene attached at each sample point.This vector field shows no preferred direction to any of the three cell groups.
The arrows are of short length and variable direction in comparison with other genes showed in previous Figures.Figure 10 (bottom) shows a flat expression profile of 1338456 gene.This profile is agree with our procedure because any direction of expression of the 1338456 gene is highlighted.

Conclusion
In this paper we propose an exploratory method based on Kernel PCA for elucidating relationships between samples (microarrays) and variables (genes).Our approach show two main properties: extraction of nonlinear features together with the preservation of the input variables (genes) in the output display.The method described here is easy to implement and facilitates the identification of genes which have a similar or reversed profiles.Our results indicate that enrich the KPCA with supplementary input variable information is complementary to other tools currently used for finding gene expression profiles, with the advantage that it can capture the usual nonlinear nature of microarray data.

Figure 2 (Fig. 1 .
Figure 2 (top) shows the tangent vectors associated with X76223_s_at gene, attached at each sample point.This vector field reveals upper expression towards T-cell cluster as is expected from references above mentioned.This gene is well represented by the second principal component.The length of the arrows indicate the strength of the gene on the sample position despite the dimension reduction.Figure 2 (bottom) shows the expression profile of Fig. 2. (Top) Kernel PCA of Leukemia dataset and tangent vectors associated with X76223-s-at gene at each sample point.Vector field reveals upper expression towards T-cell cluster.(Bottom) Expression profile of X76223-s-at gene confirms KPCA plot enriched with tangent vectors representation.

Figure 7 (
Figure 7 (top)  shows the tangent vectors associated with 139009 gene attached at each sample point.This vector field reveals upper expression towards DLCL cluster as is expected from references above mentioned.This gene is mainly represented by the first principal component.The length of the arrows indicate the influence strength of the gene on the sample position despite the dimension reduction.Figure7(bottom) shows the expression profile of 139009 g e n e .W ec a no b s e r v et h a t139009 gene is up regulated in DLCL cluster.This profile is agree with our procedure because the direction in which the expression of the 139009 gene increases points to the DLCL cluster.

Fig. 5 .Fig. 6 .
Fig. 5. (Top) Kernel PCA of Leukemia dataset and tangent vectors associated with D50857-at gene at each sample point.Vector field shows no preferred direction.(Bottom) Flat Expression profile of D50857-at gene confirms KPCA plot enriched with tangent vectors representation.

Figure 9 (
Figure9 (top)  shows the tangent vectors associated with 1352822 gene attached at each sample point.This vector field reveals upper expression towards B-CLL as is expected from references above mentioned.Figure9(bottom) shows the expression profile of 1352822 gene.We can observe that 1352822 gene is up regulated in B-CLL class.This profile is agree with our procedure because the direction in which the expression of the 1352822 gene increases points to the B-CLL cluster.
Fig. 7. (Top) Kernel PCA of Leukemia dataset and tangent vectors associated with 139009 gene at each sample point.Vector field reveals upper expression towards DLCL cluster.(Bottom) Expression profile of 139009 gene confirms KPCA plot enriched with tangent vectors representation.