Iterations for subspace of dimension M/3

## 1. Introduction

High level of image content analysis is required for several applications. This is taking more significance as the number of digital images stored is growing exponentially. On the one hand the technology should help store these images, on the other, enable us to develop newer algorithmic models aimed at efficient and quick retrieval of images. The entire captured data may not be applicable for an application and hence deriving a subset of data to achieve objective function is desirable.

Face detection and recognition are preliminary steps to a wide range of applications such as personal identity verification, video-surveillance, facial expression extraction, gender classification, advanced human and computer interaction. A face recognition system would allow user to be identified by simply walking past a surveillance camera. Research has been devoted to facial recognition for years and has brought forward algorithms in an attempt to be as accurate as humans are.

A face recognition system is expected to identify faces present in images and videos automatically. It can operate in either or both of two modes:

Face verification or authentication,(fig above)

Face identification or recognition.

Face verification involves a one-to-one match that compares a query face image against a template face image whose identity is being claimed. Face identification involves one-to-many matches that compare a query face image against all the template images in the database to determine the identity of the query face. Another face recognition scenario involves a watch-list check, where a query face is matched to a list of suspects (one-to-few matches). As per Hietmeyer, face recognition is one of the most effective biometric techniques for travel documents and scored higher on several evaluation parameters.

Computational models of face recognition must address several difficult problems. This difficulty arises from the fact that faces must be represented in a way that best utilizes the available face information to distinguish a particular face from all other faces. The problem of dimensionality reduction arises in face recognition because an m X n face image is reconstructed to form a column vector of mn components, for computational purposes. As the number of images in the data set increases, the complexity of representing data sets increases. Analysis with a large number of variables generally consumes a large amount of memory and computation power.

## 2. Dimensionality reduction

Efforts are on for efficient storage and retrieval of images. Considerable progress has happened in face recognition with newer models especially with the development of powerful models of face appearance. These models represent faces as points in high-dimensional image spaces and employ dimensionality reduction to find a more meaningful representation, therefore, addressing the issue of the ”curse of dimensionality”. Dimension reduction is a process of reducing the number of variables under observation. The need for dimension reduction arises when there is a large number of univariate data points or when the data points themselves are observations of a high dimensional variable. The key observation is that although face images can be regarded as points in a high-dimensional space, they often lie on a manifold (i.e., subspace) of much lower dimensionality, embedded in the high-dimensional image space. The main issue is how to properly define and determine a low-dimensional subspace of face appearance in a high-dimensional image space.

Dimensionality reduction techniques using linear transformations have been very popular in determining the intrinsic dimensionality of the manifold as well as extracting its principal directions. Dimensionality reduction is an effective approach to downsizing data. In statistics, dimension reduction is the process of reducing the number of random variables under consideration, R^{N}→R^{M} (M<N) and can be divided into feature selection and feature extraction.

Feature selection is choosing a subset of all the features

[x_{1} x_{2} … x_{n}] Feature selection [ x_{i1} x_{i2} … x_{im} ]

Feature extraction is creating new features from existing ones

[x_{1} x_{2} … x_{n}] Feature extraction [ y_{1} y_{2} … y_{m} ]

In either case, the goal is to find a low-dimensional representation of the data while still describing the data with sufficient accuracy.

For reasons of computational and conceptual simplicity, the representation is often sought as a linear transformation of the original data. In other words, each component of the representation is a linear combination of the original variables. Well-known linear transformation methods include principal component analysis, factor analysis, and projection pursuit. Independent component analysis (ICA) is a recently developed method in which the goal is to find a linear representation of nongaussian data so that the components are statistically independent, or as independent as possible. Such a representation seems to capture the essential structure of the data in many applications, including feature extraction and signal separation.

Several techniques exist to tackle the curse of dimensionality out of which some are linear methods and others are nonlinear. PCA, LDA, LPP are some popular linear methods and nonlinear methods include ISOMAP & Eigenmaps. PCA and LDA are the two most widely used subspace learning techniques for face recognition. These methods project the training sample faces to a low dimensional representation space where the recognition is carried out. The main supposition behind this procedure is that the face space (given by the feature vectors) has a lower dimension than the image space (given by the number of pixels in the image), and that the recognition of the faces can be performed in this reduced space. PCA has the advantage of capturing holistic features but ignore the localized features. Fisher faces from LDA technique extracts discriminating features between classes and is found to perform better for large data sets. Its shortcoming is that of Small Sample Space (SSS) problem. LPPs are linear projective maps that arise by solving variational problem that optimally preserves the neighborhood structure of the data set.

In many cases, face images may be visualized as points drawn on a low-dimensional manifold hidden in a high-dimensional ambient space. Specially, we can consider that a sheet of rubber is crumpled into a (high-dimensional) ball. The objective of a dimensionality-reducing mapping is to unfold the sheet and to make its low-dimensional structure explicit. If the sheet is not torn in the process, the mapping is topology-preserving. Moreover, if the rubber is not stretched or compressed, the mapping preserves the metric structure of the original space.

PCA is guaranteed to discover the dimensionality of the manifold and produces a compact representation. Turk and Pentland use Principal Component Analysis to describe face images in terms of a set of basis functions, or “eigenfaces”. LDA is a supervised learning algorithm. LDA searches for the project axes on which the data points of different classes are far from each other while requiring data points of the same class to be close to each other. Unlike PCA which encodes information in an orthogonal linear space, LDA encodes discriminating information in a linear separable space using bases are not necessarily orthogonal. It is generally believed that algorithms based on LDA are superior to those based on PCA. However, some recent work shows that, when the training dataset is small, PCA can outperform LDA, and also that PCA is less sensitive to different training datasets.

Recently, a number of research efforts have shown that the face images possibly reside on a nonlinear submanifold. However, both PCA and LDA effectively see only the Euclidean structure. They fail to discover the underlying structure, if the face images lie on a nonlinear submanifold hidden in the image space. Some nonlinear techniques have been proposed to discover the nonlinear structure of the manifold, *e.g.* Isomap, LLE and Laplacian Eigenmap. These nonlinear methods do yield impressive results on some benchmark artificial data sets. However, they yield maps that are defined *only* on the training data points and how to evaluate the maps on novel test data points remains unclear.

## 3. Singular Value Decomposition (SVD)

Singular value decomposition (SVD) is an important factorization of a rectangular real or complex matrix, with many applications in signal processing and statistics. As applied to face recognition this technique is used to extract the holistic global features of the training set SVD is the best, in the mean-square error sense, linear dimension reduction technique. Being based on the covariance matrix of the variables, it is a second-order method. SVD seeks to reduce the dimension of the data by finding a few orthogonal linear combinations of the original variables with the largest variance.

The basic idea behind SVD is taking a high dimensional, highly variable set of data points and reducing it to a lower dimensional space that exposes the substructure of the original data more clearly and orders it from most variation to the least. What makes SVD practical for pattern recognition applications is that one can simply ignore variation below a particular threshold to massively reduce the data but be assured that the main relationships of interest have been preserved.

Singular value decomposition (SVD) can be looked at from three mutually compatible points of view. On the one hand, we can see it as a method for transforming correlated variables into a set of uncorrelated ones that better expose the various relationships among the original data items. At the same time, SVD is a method for identifying and ordering the dimensions along which data points exhibit the most variation. This ties into the third way of viewing SVD, which is that once we have identified where the most variation is, it's possible to find the best approximation of the original data points using fewer dimensions. Hence, SVD can be seen as a method for data reduction.

As said earlier Singular Value Decomposition is a way of factoring matrices into a series of linear approximations that expose the underlying structure of the matrix. If A is the input matrix, calculating the SVD consists of finding the eigenvalues and eigenvectors of *AA*^{T} and *A*^{T}*A*. This yields three matrices U,V & S where the eigenvectors of *A*^{T}*A* make up the columns of *V,* the eigenvectors of *AA*^{T} make up the columns of *U*. and the singular values in S are square roots of eigenvalues from *AA*^{T} or *A*^{T}*A*. The singular values are the diagonal entries of the *S* matrix and are arranged in descending order. The singular values are always real numbers. If the matrix *A* is a real matrix, then *U* and *V* are also real.

In the factorization, the first principal component is s1, with the largest variance is the linear combination with T T. We have

The second PC is the linear combination with the second largest variance and orthogonal to the first PC, and so on. There are as many PCs as the number of the original variables. PCs explain most of the variance, so that the rest can be disregarded with minimal loss of information. Since the variance depends on the scale of the variables, it is customary to first standardize each variable to have mean zero and standard deviation one. After the standardization, the original variables with possibly different units of measurement are all in comparable units.

The mathematical model formulated is given below:

Let A is m’ X n’ real matrix and N=A^{T}A

R denotes the range space and N denotes the null space of a matrix. Rank of A, A^{T}, A^{T}A, AA^{T} is equal and is denoted by ρ orthonormal basis v _{i} 1 ≤ i ≤ ρ are sought for R_{A}^{T} where ρ is the rank of R_{A}^{T}& u_{i} 1 ≤ i ≤ ρ for R_{A} such that,

Advantages of having such a basis are that geometry becomes easy and gives a decomposition of A into ρ one-ranked matrices. Combining the equations (Eq. 2) & (Eq. 3)

If V_{j} is known then, _{j}≠0, choosing s_{j} >0,

Let_{,} _{i}’s as orthonormal eigenvectors of N=A^{T}A are found and _{j.} The resulting Ui span the Eigen subspace. When SVD is applied to the sample set below in figure 3, the corresponding eigen faces obtained are shown in figure 4. The figure is highlighting the holistic features from the given sample set.

Basis selection from SVD

If A is the face Space, then x vectors are drawn from [X_{1}…..X_{x}] =Π_{<1..x>}(A^{-1}UD) Where U & D are the unitary and diagonal matrices of SVD of A.

## 5. Linear Discriminant Analysis

Fisher Linear Discriminant also referred as Linear Discriminant Analysis is a classical pattern recognition method, which was introduced by Fisher (1934). It is a very effective feature extraction method but facing issues for Small Sample Space problem.

The Dimensionality Reduction technique SVD searches for directions in the data that have largest variance and subsequently project the data onto it. In this way, one can obtain a lower dimensional representation of the data, that removes some of the ”noisy” directions. There are many difficult issues with how many directions one needs to choose. It is an unsupervised technique and as such does not include label information of the data. For instance, if we imagine 2 clusters in 2 dimensions, one clusters has *y* = 1 and the other *y* = *¡*1. The clusters are positioned in parallel and very closely together, such that the variance in the total data-set, ignoring the labels, is in the direction of the clusters. For classification, this would be a terrible projection, because all labels get evenly mixed and will destroy the useful information.

A much more useful projection is orthogonal to the clusters, i.e. in the direction of least overall variance, which would perfectly separate the data-cases (obviously, we would still need to perform classification in this 1-D space).

The conventional solution to misclassification for small sample size problem and large data set with similar faces is the use of PCA into LDA i.e. fisher faces. PCA is used for dimensionality reduction and then LDA is performed on the lower dimensional space. Discriminant analysis often produces models whose accuracy approaches complex modern methods. The target variable may have two or more categories. The following figure 5 shows a plot of the two categories with the two predictors on orthogonal axes:

A transformation function is found that maximizes the ratio of between-class variance to within-class variance as illustrated by this figure 6 produced by Ludwig Schwardt and Johan du Preez:

The transformation seeks to rotate the axes so that when the categories are projected on the new axes, the differences between the groups are maximized. So the question is, how do we utilize the label information in finding informative projections?

To that purpose Fisher-LDA considers maximizing the following objective:

The second use of the term LDA refers to a discriminative feature transform that is optimal for certain cases [10]. This is what we denote by LDA throughout this paper. In the basic formulation, LDA finds eigenvectors of matrix

Here*w*, and new discriminative features

The straightforward algebraic way of deriving the LDA transform matrix is both a strength and a weakness of the method. Since LDA makes use of only second-order statistical information, covariances, it is optimal for data where each class has a unimodal Gaussian density with well separated means and similar covariances. Large deviations from these assumptions may result in sub-optimal features.

Also the maximum rank of

However, the classification performance of traditional LDA is often degraded by the fact that their separability criteria are not directly related to their classification accuracy in the output space. A solution to the problem is to introduce weighting functions into LDA. Object classes that are closer together in the output space, and thus can potentially result in misclassification, should be more heavily weighted in the input space. This idea has been further extended in with the introduction of the fractional-step linear discriminant analysis algorithm (F-LDA), where the dimensionality reduction is implemented in a few small fractional steps allowing for the relevant distances to be more accurately weighted. Although the method has can be applied on low dimensional patterns it cannot be directly applied to high-dimensional patterns, such as those face images due to two factors: (1) the computational difficult of the eigen-decomposition of matrices in the high-dimensional image space; (2) the degenerated scatter matrices caused by the small sample size, which widely exists in the FR tasks where the number of training samples is smaller than the dimensionality of the samples.

The traditional solution to the SSS problem requires the incorporation of a PCA step into the LDA framework. In this approach, PCA is used as a pre-processing step for dimensionality reduction so as to discard the null space of the within-class scatter matrix of the training data set. Then LDA is performed in the lower dimensional PCA subspace. However, it has been shown that the discarded null space may contain significant discriminatory information. To prevent this from happening, solutions without a separate PCA step, called direct LDA (D-LDA) methods have been presented recently. In the D-LDA framework, data are processed directly in the original high-dimensional input space avoiding the loss of significant discriminatory information due to the PCA pre-processing step.

Firstly dimensionality of the original input space is lowered by introducing a new variant of D-LDA that results in a low-dimensional SSS-free subspace where the most discriminatory features are preserved. The variant of D-LDA utilizes a modified Fisher’s criterion to avoid a problem resulting from the wage of the zero eigenvalues of the within-class scatter matrix as possible divisors. Also, a weighting function is introduced into the variant of D-LDA, so that a subsequent F-LDA step can be applied to carefully re-orient the SSS-free subspace resulting in a set of optimal discriminant features for face representation.

The DF-LDA is a linear pattern recognition method. Compared with nonlinear models, a linear model is rather robust against noises and most likely will not over fit. Although it has been shown that distribution of face patterns is highly non convex and complex in most cases, linear methods are still able to provide cost effective solutions to the FR tasks through integration with other strategies, such as the principle of “divide and conquer,” in which a large and nonlinear problem is divided into a few smaller and local linear sub problems.

Let

The maximization process in (3) is not directly linked to the classification error which is the criterion of performance used to measure the success of the Face Recognition procedure. Thus, the weighted between-class scatter matrix can be expressed as:

where

is the mean of class

is the Euclidean distance between the means of class i and j.

Basis selection from DF-LDA

The set Y vectors are chosen by the equation [y_{1}…..y_{y}] =Π_{<1..y>}(U^{T}S_{TOT}U^{T}) Where S_{TOT} is the sum of between and within class scatter matrices, U is a diagonal matrix from Eigen values and vectors. Fisher faces are shown in figure 7 below.

We can clearly see from fisher faces that more pronounced features are highlighted than the rest of the face point like hair, eyebrows etc.

## 6. Locality preserving projections

Different from Principal Component Analysis (PCA) and Linear Discriminant Analysis (LDA) which effectively see only the Euclidean structure of face space, LPP finds an embedding that preserves local information, and obtains a face subspace that best detects the essential face manifold structure. The Laplacianfaces are the optimal linear approximations to the eigen functions of the Laplace Beltrami operator on the face manifold. In this way, the unwanted variations resulting from changes in lighting, facial expression, and pose may be eliminated or reduced. Theoretical analysis shows that PCA, LDA and LPP can be obtained from different graph models. By using Locality Preserving Projections (LPP), the face images are mapped into a face subspace for analysis.

LPP shares many of the data representation properties of nonlinear techniques such as Laplacian Eigen maps or Locally Linear Embedding. Yet LPP is linear and more crucially is defined everywhere in ambient space rather than just on the training data points. It builds a graph incorporating neighborhood information of the data set. Using the notion of the Laplacian of the graph, transformation matrix is computed which maps the data points to a subspace.

This linear transformation optimally preserves local neighborhood information in a certain sense. The representation map generated by the algorithm may be viewed as a linear discrete approximation to a continuous map that naturally arises from the geometry of the manifold. In the meantime, there has been some interest in the problem of developing low dimensional representations through kernel based techniques for face recognition. These methods can discover the nonlinear structure of the face images. However, they are computationally expensive. Moreover, none of them explicitly considers the structure of the manifold on which the face images possibly reside.

While the Eigen faces method aims to preserve the global structure of the image space, and the Fisher faces method aims to preserve the discriminating information; our Laplacianfaces method aims to preserve the local structure of the image space. In many real world classification problems, the local manifold structure is more important than the global Euclidean structure, especially when nearest neighbor like classifiers are used for classification.

LPP seems to have discriminating power although it is unsupervised. An efficient subspace learning algorithm for face recognition should be able to discover the nonlinear manifold structure of the face space. LPP shares some similar properties to LLE, such as a locality preserving character. However, their objective functions are totally different. LPP is obtained by finding the optimal linear approximations to the eigen functions of the Laplace Beltrami operator on the manifold. LPP is linear, while LLE is nonlinear. Moreover, LPP is defined everywhere, while LLE is defined only on the training data points and it is unclear how to evaluate the maps for new test points. In contrast, LPP may be simply applied to any new data point to locate it in the reduced representation space. LPP seeks to preserve the intrinsic geometry of the data and local structure.

The objective function of LPP is as follows:

Where

## 7. Statistical view of LPP

LPP can also be obtained from statistical viewpoint. Suppose the data points follow some underlying distribution. Let *d* be the number of non-zero *S*_{ij}, and *D* be a diagonal matrix whose entries are column (or row, since *S* is symmetric) sums of *S*, *D*_{ii}*=*∑_{j} *S*_{ji.} By the Strong Law of Large Numbers, *E*(zz^{T} | ||z||< ε) can be estimated from the sample points as follows:

where *L = D – S* is the Laplacian matrix. The *ith* column of matrix *X* is x*i*.

## 8. Theoretical analysis of LPP, PCA AND LDA

In this section, we present a theoretical analysis of LPP and its connections to PCA and LDA.

### 8.1. Connections to PCA

It is worthwhile to point out that *XLX*^{T} is the data covariance matrix, if the Laplacian matrix *L is*

where *n* is the number of data points, *I* is the identity matrix and e is a column vector taking 1 at each entry. In fact, the Laplacian matrix here has the effect of removing the sample mean from the sample vectors.

In this case, the weight matrix *S* takes *1/n*^{2} at each entry, i.e

Hence the Laplacian matrix is

Let m denote the sample mean i.e.

we have

Where

The above analysis shows that the weight matrix *S* plays a key role in the LPP algorithm. When we aim at preserving the global structure, we take ε (or *k*) to be infinity and choose the eigenvectors (of the matrix *XLX*^{T}) associated with the largest eigenvalues. Hence the data points are projected along the directions of maximal variance. ε should be sufficiently small to preserve the local structure and choose the Eigen vectors associates with smallest Eigen values.

Hence the data points are projected along the directions preserving locality. It is important to note that, when ε (or *k*) is sufficiently small, the Laplacian matrix is no longer the data covariance matrix, and hence the directions preserving locality are not the directions of minimal variance. In fact, the directions preserving locality are those minimizing *local* variance.

### 8.2. Connections to LDA

LDA seeks directions that are efficient for discrimination. The projection is found by solving the generalized Eigen value problem

where *l* classes. The i^{th} class contains *ni* sample points. Let m(*i*) denote the average vector of the *i*^{th} class. Let x(*i*) denote the random vector associated tothe *i*^{th} class and ) (i *j* x denote the *j*^{th} sample point in the *i*^{t}*h* class. We can rewrite the matrix S_{W} as follows:

Where,

*I*is the identity matrix and

*ni*dimensional vector.

To further simplify the above equation, we define

It is interesting to note that we could regard the matrix *W* as the weight matrix of a graph with data points as its nodes. Specifically, *Wij* is the weight of the edge (x*i*, x*j*). *W* reflects the class relationships of the data points. The matrix *L* is thus called *graph Laplacian*, which plays key role in LPP.

Similarly, we can compute the matrix *SB* as follows:

where e = (1,1,…,1)^{T} is a *n* dimensional vector and

Thus, the generalized eigenvector problem of LDA can be written as follows:

Thus, the projections of LDA can be obtained by solving the following generalized eigenvalue problem,

The optimal projections correspond to the eigenvectors associated with the smallest eigenvalues. If the sample mean of the data set is zero, the covariance matrix is simply *XX*^{T} which is close to the matrix *XDX*^{T} in the LPP algorithm. Our analysis shows that LDA actually aims to preserve discriminating information and global geometrical structure. Moreover, LDA has a similar form to LPP. However, LDA is supervised while LPP can be performed in either supervised or unsupervised manner.

### 8.3. Learning laplacian faces for representation

LPP is a general method for manifold learning. It is obtained by finding the optimal linear approximations to the eigenfunctions of the Laplace Betrami operator on the manifold. Therefore, though it is still a linear technique, it seems to recover important aspects of the intrinsic nonlinear manifold structure by preserving local structure. Based on LPP, Laplacianfaces method for face representation is a locality preserving subspace. In the face analysis and recognition problem one is confronted with the difficulty that the matrix *XDX*^{T} is sometimes singular. This stems from the fact that sometimes the number of images in the training set (*n*) is much smaller than the number of pixels in each image (*m*). In such a case, the rank of *XDX*^{T} is at most *n*, while *XDX*^{T} is an *m*×*m* matrix, which implies that *XDX*^{T} is singular. To overcome the complication of a singular *XDX*^{T}, we first project the image set to a PCA subspace so that the resulting matrix *XDX*^{T} is nonsingular. Another consideration of using PCA as preprocessing is for noise reduction. This method, we call *Laplacianfaces*, can learn an optimal subspace for face representation and recognition.

The algorithmic procedure of Laplacianfaces is formally stated below:

PCA projection: We project the image set {x*i*} into the PCA subspace by throwing away the smallest principal components.

Constructing the nearest-neighbor graph: Let G denote a graph with *n* nodes. The *ith* node corresponds to the face image x_{i}. We put an edge between nodes *i* and *j* if x_{i} and x_{j} are “close”, i.e. x*i* is among *k* nearest neighbors of x*i* or x*i* is among *k* nearest neighbors of x_{j}. The constructed nearest neighbor graph is an approximation of the local manifold structure. Note that, here we do not use the ε - neighborhood to construct the graph. This is simply because it is often difficult to choose the optimal ε in the real world applications, while *k* nearest neighbor graph can be constructed more stably. The disadvantage is that the *k* nearest neighbor search will increase the computational complexity of our algorithm. When the computational complexity is a major concern, one can switch to the ε -neighborhood.

Choosing the weights: If node *i* and *j* are connected, put

where *t* is a suitable constant. Otherwise, put *S*_{ij} *=* 0*.* The weight matrix *S* of graph G models the face manifold structure by preserving local structure.

Eigenmap: Compute the eigenvectors and eigenvalues for the generalized eigenvector problem:

where D is a *k*-dimensional vector. *W* is the transformation matrix. This linear mapping best preserves the manifold’s estimated intrinsic geometry in a linear sense. The column vectors of *W* are the so called *Laplacianfaces*.

## 9. Face representation using laplacianfaces

As we described previously, a face image can be represented as a point in image space. A typical image of size *m*×*n* describes a point in *m*×*n*-dimensional image space. However, due to the unwanted variations resulting from changes in lighting, facial expression, and pose, the image space might not be an optimal space for visual representation, we have discussed how to learn a locality preserving face subspace which is insensitive to outlier and noise. The images of faces in the training set are used to learn such a locality preserving subspace. The subspace is spanned by a set of eigenvectors of equation (1), i.e. w0, w1, …, wk-1.

Eigenmaps are obtained from the generalized eigenvector problem as ALA^{T} a = λADA^{T} a where D is a diagonal matrix whose entries are column or row, since W is symmetric sums of W, D_{ii} = ΣjW_{ji}., L = D -W is the Laplacian matrix is equivalent nonlinear Laplace Beltrami opearator. The ith column of matrix A is xi. Let the column vectors a_{0}; _ _ _ ; a_{l-1} be the solutions of equation (), ordered according to their eigenvalues, in ascending order Thus, the embedding is as follows: yi = E^{T} x_{i}; E = (a_{0}; a_{1}; _ _ _ ; a_{l-1}) where y_{i} is a l-dimensional vector, and E is a n x l matrix.. The y_{i} represent the Laplacian faces.

### 9.1. Basis selection from LPP

Locality information can be preserved by the following transformation on A, the input face space [z_{1} …. Z_{z}] = Π_{<1..z}(A^{T}L A) Where L =D-W gives the Laplacian matrix. D is the diagonal matrix and W is the weight matrix of the K nearest neighbors clustering.

Basis for the face space is obtained as,

where M is the dimension of the original face space

### 9.2. Projection onto reduced subspace

Each face in the training set*U*_{i} ε B, 1 ≤ *i* ≤ K such that

## 10. Independent component analysis

Independent component analysis (ICA) is a statistical method, the goal of which is to decompose multivariate data into a linear sum of non-orthogonal basis vectors with coefficients (encoding variables, latent variables, hidden variables) being statistically independent.

ICA generalizes a widely-used subspace analysis method such as principal component analysis (PCA) and factor analysis, allowing latent variables to be non-Gaussian and basis vectors to be non-orthogonal in general. ICA is a density estimation method where a linear model is learned such that the probability distribution of the observed data is best captured, while factor analysis aims at best modeling the covariance structure of the observed data.

The ICA model is a generative model, which means that it describes how the observed data are generated by a process of mixing the components *si*. The independent components are latent variables, meaning that they cannot be directly observed. Also the mixing matrix is assumed to be unknown. All we observe is the random vector X, and we must estimate both A and S using it. This must be done under as general assumptions as possible. The starting point for ICA is the very simple assumption that the components *S*_{i} are statistically *independent*. It will be seen below that we must also assume that the independent component must have *nongaussian* distributions. However, in the basic model we do *not* assume these distributions known (if they are known, the problem is considerably simplified.) For simplicity, we are also assuming that the unknown mixing matrix is square, but this assumption can be sometimes relaxed. Then, after estimating the matrix A, we can compute its inverse, say W, and obtain the independent component simply by: s=Wx

ICA is very closely related to the method called *blind source separation* (BSS) or blind signal separation. A “source” means here an original signal, i.e. independent component, like the speaker in a cocktail party problem. “Blind” means that we know very little, if anything, on the mixing matrix, and make little assumptions on the source signals. ICA is one method, perhaps the most widely used, for performing blind source separation.

The task of ICA is to estimate the mixing matrix *A* or its inverse *W* = *A*^{−1} such that elements of the estimate *y* = *A*^{−1}*x* =*Wx* are as independent as possible. For the sake of simplicity, we often leave out the index *t* if the time structure does not have to be considered.

PCA makes one important assumption: the probability distribution of input data must be Gaussian. When this assumption holds, covariance matrix contains all the information of (zero-mean) variables. Basically, PCA is only concerned with second-order (variance) statistics. The mentioned assumption need not be true. If we presume that face images have more general distribution of probability density functions along each dimension, the representation problem has more degrees of freedom. In that case PCA would fail because the largest variances would not correspond to meaningful axes of PCA.

Here, i =1:4.

In vector-matrix notation, and dropping index t, this is

## 11. Random projections

There has been a strong trend lately in face processing research away from geometric models towards appearance models. Appearance-based methods employ dimensionality reduction to represent faces more compactly in a low-dimensional subspace which is found by optimizing certain criteria. Recently, Random Projection (RP) has emerged as a powerful method for dimensionality reduction. It represents a computationally simple and efficient method that preserves the structure of the data without introducing significant distortion. D

Transforms *d*, with d<<pvia the following transformation: *R* is orthonormal and its columns are realizations of independent and identically distributed (i.i.d.) zero-mean normal variables, scaled to have unit length. RP is motivated by the *Johnson-Lindenstrauss* lemma that states that a set of *M* points in a high dimensional.

Euclidean space can be mapped down onto a

The main reason for orthogonalizing the random vectors is to preserve the similarities between the original vectors in the low-dimensional space. In high enough dimensions, however, it is possible to save computation time by avoiding the orthogonalization step without affecting much the quality of the projection matrix. This is due to the fact that, in high-dimensional spaces, there exist a much larger number of almost orthogonal vectors than orthogonal vectors. Thus, high-dimensional vectors having random directions are very likely to be close to orthogonal.

## 12. Mixture of components

One can use different ratios of feature vectors drawn from SVD, DF-LDA & LPP Techniques. The first step can be normalizing the images in the training set to compensate for the illumination effects. These processed images should be subjected to dimensionality reduction using each of the methods mentioned in the chapter. Basis selection can be carried out using these independent sets of dimension reduced vectors in different proportions aimed at enhancing the efficiency and accuracy of recognition task. Below is a sample example mentioned with two trials, one with for 1/3^{rd} dimensionality reduction and another with 2/3^{rd} reduction. In each of the trials, several iterations are performed by taking different combinations of the feature vectors. The iterations will converge when the desired precision of recognition rate is obtained.

### 12.1. Example

#### 12.1.1. Preprocessing

The Face Space: For the recognition task, each m X n *I*_{i} image in the training set is transformed into a column vector of mn components. A matrix S (mn X M) is constructed such that S =[ *I*_{1} *I*_{2...} *I*_{M}], where M is number of face images in training set It is found that all N vectors are linearly independent, which implies that the range space of matrix S is the entire region spanned by the columns of S. i.e Range space of S R(S)=[ S]Normalization: Normalize the images,to reduce illumination effects and lighting conditions as,

For i=1,2,3….., M

Where,

#### 12.1.2. Basis selection

Recognition Task: Unknow probeface is normalized () and projected on to the subspace to get weight for the probe image

Deciding on the Threshold: A set of 150 known images other than the ones in the data set is used in the computation of threshold given by

_{i} Є γ

The method of choosing right combination of right proportion of feature vectors has been applied on a large database consisting of a variety of still images with illumination, expression variations as well as partially occluded images. The ratio 3:2:5:: SVD:DF-LDA:LPP has yielded highest accuracy in recognition. The example is tried on a total test set of 165 images drawn from YALE dataset and the training set consisting 15 classes having a class count of five images.

An ROC graph is plotted to visualize and analyze the working of face recognition efficiency. It is a two dimensional graph in which TP rate, true positive rate, is plotted on the Y axis and FP rate, false positive rate, is plotted on the X axis. Given a set of test images a two by two contingency table is constructed representing the dispositions of the set of images.

SVD (no. of vectors) | DFLDA (no. of vectors) | LPP (no. of vectors) | EFFICIENCY (in %) |

15 | 5 | 5 | 80.00 |

5 | 5 | 15 | 81.21 |

8 | 9 | 8 | 81.81 |

15 | 5 | 15 | 87.27 |

5 | 15 | 5 | 81.21 |

Graph No. | True Positive | False Negative | False Positive | True Negative |

1 | 122 | 28 | 5 | 10 |

2 | 123 | 27 | 4 | 11 |

3 | 125 | 15 | 5 | 10 |

4 | 132 | 18 | 3 | 12 |

5 | 122 | 28 | 3 | 12 |

SVD (no. of vectors) | DFLDA (no. of vectors) | LPP (no. of vectors) | EFFICIENCY (in %) |

30 | 10 | 10 | 84.24 |

10 | 10 | 30 | 85.45 |

20 | 15 | 15 | 86.67 |

25 | 15 | 10 | 84.84 |

15 | 10 | 25 | 92.12 |

Graph No. | True Positive | False Negative | False Positive | True Negative |

1 | 129 | 21 | 5 | 10 |

2 | 132 | 18 | 4 | 11 |

3 | 133 | 17 | 5 | 10 |

4 | 128 | 32 | 4 | 11 |

5 | 140 | 10 | 3 | 12 |

## 13. Conclusion

In this chapter several linear and non linear dimensionality reduction techniques were discussed from the perspective of face recognition. Since the face images contain several characteristic features both global and local, using any one method alone may not yield better recognition accuracy. It may be good to have combinations of the basis vectors from several approaches to achieve higher accuracy. Underlying manifold structure in image space will get face subspace and is possible with LPP, ICA methods. More pronounced features can be drawn from the space in case of LDA based algorithms. Random and PCA projections give appearance models which are holistic in nature.

Future of face recognition can also look at increase in dimension like depth information for recognition purposes. Algorithmic models should aim at addressing scale invariance feature vectors which can hopefully solve recognition task even under extreme variations in images.

The approach to face recognition was motivated by information theory, leading to the idea of basing face recognition on a small set of image features that best approximate the set of known face images, without requiring that they correspond to our intuitive notions of facial parts and features. The approach does provide a practical solution to the problem of face recognition and is relatively simple and has been shown that it can work well in a constrained environment. Anecdotal experimentation with acquired image sets indicates that profile size, complexion, ambient lighting and facial angle play significant parts in the recognition of a particular image.