Open access peer-reviewed chapter

A Generally Semisupervised Dimensionality Reduction Method with Local and Global Regression Regularizations for Recognition

Written By

Mingbo Zhao, Yuan Gao, Zhao Zhang and Bing Li

Submitted: 18 November 2015 Reviewed: 22 March 2016 Published: 06 July 2016

DOI: 10.5772/63273

From the Edited Volume

Face Recognition - Semisupervised Classification, Subspace Projection and Evaluation Methods

Edited by S. Ramakrishnan

Chapter metrics overview

2,060 Chapter Downloads

View Full Metrics

Abstract

The insufficiency of labeled data is an important problem in image classification such as face recognition. However, unlabeled data are abundant in the real-world application. Therefore, semisupervised learning methods, which corporate a few labeled data and a large number of unlabeled data into learning, have received more and more attention in the field of face recognition. During the past years, graph-based semisupervised learning has been becoming a popular topic in the area of semisupervised learning. In this chapter, we newly present graph-based semisupervised learning method for face recognition. The presented method is based on local and global regression regularization. The local regression regularization has adopted a set of local classification functions to preserve both local discriminative and geometrical information, as well as to reduce the bias of outliers and handle imbalanced data; while the global regression regularization is to preserve the global discriminative information and to calculate the projection matrix for out-of-sample extrapolation. Extensive simulations based on synthetic and real-world datasets verify the effectiveness of the proposed method.

Keywords

  • Semi-supervised Learning
  • Dimensionality Reduction
  • Local and Global Regressions
  • Face Recognition
  • Transductive and Inductive Learning

1. Introduction

In the real world, there are ever-increasing vision face data generated from Internet surfing and daily social communication. These metadata can be labeled or unlabeled, and accordingly be utilized for image retrieval, summarization, and indexing. To handle these datasets for realizing the above tasks, automatic annotation is an elementary step, which can be formulated as a pattern classification problem and accomplished by learning-based techniques. Traditionally, the supervised-learning-based methods, such as Linear discriminant analysis (LDA) and Support Vector Machine (SVM), can deliver satisfactory recognition accuracy given that the number of labeled data is adequate. But labeling a huge amount of data is expensive and time consuming. On the other hand, the unlabeled data are sufficient and can be easily obtained from real-world application. Therefore, semisupervised learning-based methods that utilize a few of labeled data and a huge amount of unlabeled data are becoming more and more popular than only relying on the supervised learning methods [2733].

Recently, since two pioneer semisupervised methods, i.e., Gaussian Fields and Harmonic Functions (GFHF) and Learning with Local and Global Consistency (LLGC), have been proposed in 2003 and 2004, respectively, graph-based semisupervised learning methods have received considerable research interest in the area of semisupervised learning. These methods usually represent both labeled and unlabeled sets by a graph, and then utilize their graph Laplacian matrix to characterize the manifold structure. Finally, different learning tasks such as image classification, clustering, and dimensionality reduction are performed on the graph Laplacian matrix. For example, GFHF and LLGC work in a transductive way by directly propagating the class label information from the labeled set to the unlabeled set along the graph, where the labels of unlabeled data can be estimated. Other similar works include Random Walk [5] and Special Label Propagation (SLP) [8]. However, the transductive learning methods cannot predict the class labels of new-coming samples, hence suffering the out-of-sample problem.

To solve the out-of-sample problem, inductive learning methods are proposed during the past decades. Typical methods for inductive learning are Manifold Regularization (MR) [1] and Semisupervised Discriminant Analysis (SDA) [2]. The MR tries to learn a projection matrix by adding the graph Laplaican regularized term to the cost function of original supervised methods. Therefore, both unlabeled and new-coming data can be cast into a low-dimensional subspace, hence the out-of-sample problem can be naturally solved [7, 9, 10, 16]. For example, MR has extended the regularized least square and SVM to their semisupervised learning extensions, i.e., Laplacian regularized least squares (Lap-RLS) and Laplacian SVM by adding a manifold regularized term. Similarly, Cai et al. [2] have extended LDA to SDA for semisupervised dimensionality reduction.

It should be noted that the success of semisupervised learning is based on how to utilizing the unlabeled data for characterizing the distribution of labels in data space. Several methods including Locally Linear Reconstruction [11, 12, 20], Local Regression and Global Alignment [13, 14], and Local Spline Regression [18, 19] have been developed to discover the intrinsic manifold structure of data. However, when we do semisupervised classification, the data points lying far away the data manifold are noisy for learning the correct classifier and can deteriorate the classification performance. On the other hand, sampling in real-world applications is usually not uniform. As a result, the sampled data may be imbalanced or with multi-density distribution. None of the aforementioned methods focus on solving the two problems.

In this chapter, we develop an effective semisupervised dimensionality reduction method, i.e., Local and Global Regression (LGR), for face recognition with outliers and imbalanced face data. In order to both handle transductive and inductive learning problems, LGR aims to sufficiently learn the classification function by using all data. In detail, the presented method first extends the original supervised regression term to a supervised loss term and a global regression regularized term, where the loss term is to fix the inconsistency between the predicted labels and initial labels, while the global regression term is to sufficiently learn the classification function using all training data and to obtain the projection matrix for handling out-of-sample problem. Furthermore, to capture the local discriminative information, a set of weighted local classification functions are adopted for each dataset to estimate the labels of its nearby data, where the weight is to reduce the outliers bias and to deal with imbalanced data. Thus, both local and global discriminative information of dataset can be preserved by the proposed LGR method.

The main contributions of this work are as follows: (1) we propose a new effective method for semisupervised dimensionality reduction, which can handle both transductive and inductive learning problems; (2) we develop a graph Laplacian matrix, which can characterize both local geometrical and discriminative information, as well as reduce the bias of outliers and handle imbalanced data; (3) we have also established the connection between the proposed method and other state-of-the-art methods. Theoretical analysis has shown that many popular semi‐supervised methods such as LRGA can be viewed as the special cases of the proposed method. Extensive simulations based on synthetic and real-world datasets verify the effectiveness of the proposed method.

This chapter is organized as follows. In Section 2, the notations and motivations are first given. We then propose our LGR method for both handling transductive and inductive learning problems. Finally, we also establish the connection between the proposed method and other state-of-the-art methods. Section 3 demonstrates the extensive simulations and the final conclusions are drawn in Section 4.

Advertisement

2. The proposed method

2.1. Notation and motivation

In semi-supervised learning, we define X={Xl ,Xu } ={x1 ,x2 ,,xl+u } RD×(l+u ) be the data matrix where the first l and the remaining u columns are the labeled and unlabeled samples, respectively; Yl ={y1 ,y2 ,,yj } Rc×l be the binary label matrix with each column yj representing the class assignment of xj, i.e. yij =1 , as the class matrix, where yij =1 , if xj belongs to the ith class; yij =0 , otherwise, D and c are the numbers of features and classes, respectively. We also let L=DW be the graph Laplacian matrix associated with both labeled and unlabeled sets [17],where W is the weight matrix defined as wij =exp(xi xj 2 /2σ2 ) , if xi is within the k nearest neighbor of xj or if xj is within the k nearest neighbor of xi; wij =0 , otherwise, D is a diagonal matrix satisfying Dii=j=1l+uwij.

Most semi-supervised learning methods utilize the Gaussian function based affinity matrix. As point out in references [11, 12], the Gaussian function based affinity matrix is found to be oversensitive to the Gaussian variance; only a slight variation on the variance may affect the results dramatically. Thus, Gaussian function based affinity matrix is not a good method for handling image classification. The method developed should be robust to the parameters.

Second, when carrying out semisupervised classification, the samples lying far away from the data manifold are outliers which may lead to learn an incorrect classifier and deteriorate the classification performance. Considering Figure 1(a and b) as examples, we generalize a two-cycle and two-moon datasets with outliers. Considering the distribution of two data, the ideal decision boundary should lie in the gap between two data sub-manifolds. However, since there are many outliers around the data manifold, these outliers will blur the clear distribution of the whole data and are noisy to learn a correct classifier. Therefore, it is very important to develop a method that can adaptively reduce the effects of outliers.

Third, in real-world applications, sampling is usually not uniform. Consequently, the sampled data can be imbalanced or follows multi-density distribution. Figure 1(c) shows a two-plate dataset with two classes: each class follows a Gaussian distribution but with different cores and density. Obviously, the data points (left data points) in the high-density area will take more important part than those (right data points) in the low-density area when to learn a classifier, which may cause incorrect classification results. The method developed should handle such imbalanced data with multi-density distribution.

The method developed should also solve the out-of-sample problem. To address the above problems, we, in this paper, propose a new semisupervised learning method, which is based on local and global regression.

Figure 1.

(a) Two-cycle dataset; (b) two-moon dataset; (c) two-plate dataset.

2.2. Local and global regression

We start from the supervised least-squares regression. The least-square regression is to fix a linear model yj=VTxj+bT by regressing X on Y:

minj=1lVTxj+bTyjF2+αtVF2 E1

where V is the projection matrix that is to project the new-coming samples and b is the bias term. Although the label yj of xj (jl) has already been known, since l is usually very small, the classification function zj=VTxj+b may not be sufficiently trained due to the small sample size. To solve this problem, we introduce Z={Zl,Zu}={z1,z2,,zl+u}Rc×(l+u) as a set of estimated labels to play the same roll by replacing VTxj+b with zj and add a regression term to Eq. (1) as follows:

mini=1lziyiF2+αr(j=1l+uVTxj+bTzjF2+ηVF2) E2

According to Eq. (2), the classification function zj=VTxj+b can be sufficiently learned by using all the predicted labels and to fix to their original labels. In other meaning, the global discriminative information can be preserved by the regression term of Eq. (2). Furthermore, to grasp the local discriminative information, we induce a local regression function for each data sample xj. We denote Nk(xj) as the k neighborhood set of xj with itself, Xj={xj0,xj1,,xjk1}RD×k as the local data matrix formed by all samples in Nk(xj), where {j1,j1,,jk} is the index set of Nk(xj) and j1=j, xj1=xj. We also denote Zj={zj0,zj2,,zjk1}Rc×k as the local low-dimensional label matrix in Nk(xj). Then, the local regression function for all data samples can be given as follows:

minZj,Vj,bjj=1l+u(i=0k1VjTxji+bjTzjiF2+ηVjF2) E3

However, minimizing the above total errors over all data samples tends to force each local error αji=VjTxji+bjTzjiF similar to each other. Given some cases that the dataset includes some outliers, assuming all the local regression errors equally may emphasize the effects from outliers and weaken the effects from normal data. In this section, to weaken the effects from outliers, we add a weight vector Γj={τj1,τj2,,τjk}R1×k for each local data patch xj in order to penalize each regression error, which can be shown as

minZj,Vj,bjj=1l+u(i=0k1τjiVjTxji+bjTzjiF2+ηVjF2) E4

In the following section, we will discuss how to select the weight τji. Our motivation is to let the weight of local error αji be large given xji are the normal data and in the contrast to let the weight be small given xji is outlier. In detail, to obtain local projection matrix Vj and bias bj, we perform derivatives to Eq. (4) w.r.t. Vj and bj to zeros. Then, Eq. (4) will be reduced to

minZj=1l+uTr(ZSjLjSjTZT)=minZTr(ZLdZT) E5

where Lj=HjHjXjT(XjHjXjT+ηI)1XjHj; SjR(l+u)×k is the selected matrix satisfying (Sj)pq=1, if xp is the qth neighbors to xp; (Sj)pq=0, otherwise, Ld=j=1l+u(SjLjSjT) is the local graph Laplacian matrix. Similarly, by setting the derivatives of Eq. (2) w.r.t. V and b to zero, we have

{b=(eZTeXTV)/eeTV=(XLcXT+ηI)1XLcZT E6

where eR1×(l+u) is a unit vector and Lc=IeTe/eeT is used for centering the samples by subtracting the mean of all samples. With b and V in Eq. (6), the global regression term in Eq. (2) can be written as

VTX+bTeZF2+ηVF2=Tr(ZLgZT) E7

where Lg=LcLcXT(XLcXT+ηI)1XLc is the global graph Laplacian matrix. By integrating Eq. (7) with Eq. (2), we formulate our method as follows:

J(Z)=minZTr((ZY)U(ZY)T)+αmTr(ZLdZT)+αrTr(ZLgZT) E8

where UR(l+u)×(l+u) is the diagonal matrix with the first l and the remaining u diagonal elements as 1 and 0, respectively; the second term describes the local discriminative structure of data; the third term describes the global discriminative structure; and αm and αr are the two balancing parameters. Since both local and global regressions are regularized in our method, we refer our method as LGR. Finally, by performing derivatives of J(Z) w.r.t. z to zero, we can calculate the solution of z as

Z=YU(U+αmLd+αrLg)1 E9

Then, we can obtain the optimal projection matrix and bias term by replacing z in Eq. (6).

2.3. Weight selection for bias reduction

In this section, we consider how to select the weights in the proposed method suggested in Section 2.2. Note, our goal of using the weights is to weaken the effects of outliers and the weight τji should be set to a small value if xji is an outlier. Then we can make the weight τji inversely proportional to the distance between xji and a center μj, i.e., τji=1/xjiμj. Such a center is expected to represent the idea center of data in the neighborhoods of xj and should be far away from outliers. Hence, the weight τj1 is usually small if xji is an outlier. But this center μj is unknown. We next present an iterative approach to calculate μj and the weight τj1 simultaneously. The approach is converged and proved afterward.

1. Initialize μj0 as the average center of all data points in the local patch of xj.
2. Update τjit for each xji as τji=1/xjiμjt1 and form the weight matrix Γjt.
3. Update μjt=i=0k1τjitxji/i=0k1τjit=XjΔjte/eTΔjte.
4. Iterate steps 2 and 3 until i=0k1xjiμjt no changes. Outputτjit.

Table 1.

Iterative approach for calculating the weight.

Table 1 shows the basic steps of the iterative approach. Following Table 1, the weight τjit at each iteration is updated from the last μjt1 and the newly updated center μjt is calculated from current τjit. The whole iterations are continued until convergence, so that the weight τjit can be adaptively and iteratively re-weighted to minimize i=0k1xjiμjt. In addition, as can be seen in simulation of Figure 2, the updated μjt will be adaptively re-weighted to be close to the main center of most data points, while the updated τjit will be weaken if xji is outliers or be strengthened if xji is close to the ideal center. We next discuss a theorem to guarantee the convergence of the approach of Table 1.

Theorem 1. The approach in Table 1 will monotonically decrease the objective function i=0k1xjiμjt until convergence.

Proof. According to step 3 in Table 1, we know that

μjt=argminμjti=0k1τjitxjiμjtF2 E10

where τji=1/xjiμjt1 as in step 2 of Table 1. Following Eq. (10), we have

i=0k1{xjiμjt2/xjiμjt1}i=0k1{xjiμjt12/xjiμjt1} E11

Based on the lemma in reference [6] that 2aa/b2bb/b holds for any two nonzero value, we have

i=0k1{2xjiμjtxjiμjt2xjiμjt1}i=0k1{2xjiμjt1xjiμjt12xjiμjt1} E12

By summing Eqs. (11) and (12) in two sides, we have

i=0k1xjiμjti=0k1xjiμjt1 E13

Eq. (14) indicates that the objective function i=0k1xjiμjt is monotonically decreased in each iteration. Since there is a lower bound in the objective function (i=0k1xjiμjt0), the iterative approach will certainly converge. We thus prove Theorem 1. Finally, by incorporating the weight for reducing the bias for each local regression error into Eq. (4), we can reduce the bias of outliers of data samples.

Here, in order to show the convergence of the approach, we simply show an example in Figure 2(a), where we generalize eight normal data points and two outliers in R2. Figure 2(b) shows the converged route of μ, where we start μ0 as the average mean of all data points and mark μt in each iteration with t. From Figure 2(b), we can observe that the optimal solution μt will iterative close to the main center of normal data while be far away from the outliers. Figure 2(c) shows the converged curve of approach as discussed in Table 1. From Figure 2(c), we can observe that the objective i=0kxiμt will monotonically decrease until convergence. Figure 2(d) shows the converged weight of data points. From Figure 2(d), we can observe the weights of normal data points are strengthen while those of outliers can be reduced.

Figure 2.

The convergence of the approach in Table 1: (a) original data, (b) the converged route of mean, (c) the converged curve of objective, (d) the converged weight.

2.4. Normalizing graph Laplacian matrix

It can be easily proved that Ld is a graph Laplacian matrix ( see the Appendix). But Ld may not be a normalized graph Laplacian matrix. As pointed in references [8, 23], the normalization can strengthen the local regressions in the low-density region and weaken those in the high-density region. Since the data sampling is usually uniform in practice, normalization is useful for handling the case when the density of dataset varies dramatically. In this section, we show that by choosing a special weight vector Γj for each Xj, Ld can be a normalized graph Laplacian matrix.

Specifically, let us consider a data sample xj and let Kl be the index set of those neighborhoods; set Nk(xj) contains xj as a neighbor of xj, i.e., if jKl, then xlNk(xj), where xl can be denoted as xji in the neighborhood set Nk(xj), and i=i(l,j) is the local index depending on l and j. Obviously, if xl is in the low-density area, it has sparse neighbors and Kl is relatively small. As a result, its connections to other samples will be weaker than that which has large Kl. Here, to strengthen the connections of samples in the low-density area, we need to normalize the weights corresponding to each Kl. Let τjl be the weight of xji and l be the global index of xji. We then define τji=τjl as follows:

τji=τjlτjliKlτil E14

Hence, based on this definition, we have the following theorem:

Theorem 2. With the normalization for each wji as in Eq. (14), Ld is both graph Laplaican matrix and normalized graph Laplacian matrix.

Proof. The proof that Ld is a graph Laplacian matrix can be seen in the Appendix. In order to prove Ld is a normalized graph Laplacian matrix, we need prove Ld can be reformulated in the form of Ld=IWd and the sum of each row or column of the affinity matrix Wd is equal to 1. Note Ld=j=1l+u(SjLjSjT) and Lj=HjHjXjT(XjHjXjT+ηI)1XjHj, where
Hj=Δj(ΔjekTekΔj)/(ekΔjekT), we first define the affinity matrix Wd as follows:

Wd=j=1l+u(SjWjdSjT) E15

where each Wjd satisfies

Wjd=(ΔjekTekΔj)/(ekΔjekT)HjXjT(XjHjXjT+ηI)1XjHj E16

Then, Ld can be reformulated as

Ld=j=1l+u(SjΔjSjT)j=1l+u(SjWjdSjT) E17

Here, for each SjΔjSjT, we have SjTeT=ekTSjΔjSjTeT=SjΓjT, where SjΓjTR(l+u)×1 is a column vector by putting each τjl to its global index l corresponding to xji. We thus have

{j=1l+u(SjΔjSjT)}eT=j=1l+u(SjΓjT)=eT E18

The second equation holds as iKlτil=1; hence, the sum of all SjΓjT in each element is equal to 1. Then, following Eq. (18), it indicates j=1l+u(SjΔjSjT) is an identity matrix, i.e., j=1l+u(SjΔjSjT)=I. Then based on the above analysis, we can reformulate Ld in the form of Ld=IWd. In addition, since Ld is a graph Laplaican matrix (as proved in the Appendix), it satisfies LdeT=0, then we have

LdeT=0{j=1l+u(SjΔjSjT)j=1l+u(SjWjdSjT)}eT=0{j=1l+u(SjΔjSjT)}eT={j=1l+u(SjWjdSjT)}eTeT={j=1l+u(SjWjdSjT)}eTWdeT=eToreWd=e E19

which indicates that the sum of each column or row of Wd is equal to 1. We thus prove the theorem. Theorem 2 indicates that by choosing a special weight vector τji for each xji, Ld can be both graph Laplacian matrix and normalized graph Laplacian matrix.

Here, it should be noted that if xl is an outlier, its local weights can be significantly decreased, whether taking xl as a neighbor of itself or of other data points. Otherwise, the normalization does not change the magnitude of its original local weights. For some data points in the low-density area, normalizing the weights can increase the information convection through those points. Finally, the basic steps of the proposed LGR are given in Table 2 and the flowchart by utilizing the proposed LGR method for face recognition is given in Figure 3.

Input: Data matrix XRD×(l+u), the initial label matrix YRc×(l+u), and other related parameters.
Output: The projection matrix V*RD×d and estimated label matrix Z*Rc×(l+u).
Algorithm:
1. Determine the weight for each local patch based on Table 1.
2. Normalize the weight as in Eq. (14).
3. Form local regression regularized term Ld as in Eq. (5) with special local weight vector.
4. Form global regression regularized term Lg as in Eq. (7).
5. Solve the regression problem as in Eq. (8):
J(Z)=minZTr((ZY)U(ZY)T)+αmTr(ZLdZT)+αrTr(ZLgZT) E28

and calculate estimated label matrix Z*=YU(U+αmLd+αrLg)1 as in Eq. (9). Output V*=(XLcXT+ηI)1XLcZ*T.
6. Calculate the projection matrix V* by replacing z* to Eq. (6) as V*=(XLcXT+ηI)1XLcZ*T. Output V*.

Table 2.

The proposed LGR.

Figure 3.

Flowchart by utilizing the proposed LGR for face recognition.

2.5. Discussion and relative work

In this section, we discuss the relationship of Learning from Local and Global Information (LLGDI) with other state-of-the-art methods including MR, Flexible Manifold Embedding (FME), and Local Regression and Global Alignment (LRGA).

2.5.1. Relationship to manifold regularization (Lap-RLS/L) [1]

The goal of MR [1] is to develop a semisupervised learning strategy by extending the original supervised methods, such as RLS and SVM to their semisupervised learning versions, i.e., Laplacian RLS and Laplacian SVM. For example, Lap-RLS/L is to fix a linear model yj=VTxj+bT by regressing X on Y and simultaneously to preserve the manifold smoothness in the embeddings of both the labeled and the unlabeled set. The objective function of Lap-RLS/L can be given as

J(V,b)=minj=1lVTxj+bTyjF2+αtVF2+αmTr(VTXLXTV) E20

However, it can be observed that Lap-RLS/L cannot sufficiently train the classification function due to the utilization of labeled samples, though it uses manifold term as complementary. Hence, the proposed LGR is superior to Lap-RLS/L.

2.5.2. Relationship to FME [7, 10]

Nie et al. has proposed another unified framework, i.e., FME [7, 10], for semisupervised dimensionality reduction, in which they verify that LLGC, GFHF, and Lap-RLS/L are only special cases in the framework. The basic objective function of FME can be given as

J(V,Z,b)=mini=1lziyiF2+αmTr(ZLZT)+αr(VTX+bTeZF2+ηVF2) E21

It can be observed that Eq. (22) is almost the same as the objective function of LGR in Eq. (10), when we consider LdL. However, LGR has utilized a weighted and normalized local discriminative Laplacian matrix to preserve manifold and discriminative structure in a dataset. This is a better way than only relying on neighborhood graph.

2.5.3. Relationship to LRGA [13, 14]

Recently, Yang et al. has proposed semisupervised transductive learning method, namely, LRGA [13, 14], for multimedia retrieval. They share the similar concept with the proposed method. The basic objective function of LRGA can be given as

J(Z)=minZ,Vj,bji=1lziyiF2+αmj=1l+u(i=1kVjTxji+bjTzjiF2+ηVjF2) E22

It can be noted that LRGA is a special case of LGR when αr=0. Therefore, LRGA is only a transductive learning method and cannot handle the out-of-sample problem, while LGR is a transductive and inductive learning method. Another superiority of LGR over LRGA is that LGR has adopted a weighted normalized each local regression term. Thus, as shown in the simulation results, LLGDI can handle outliers and multi-density dataset remarkably.

Advertisement

3. Simulation results

In this section, we will evaluate the proposed LGR based on three synthetic datasets and two real-world datasets.

3.1. Synthetic datasets

Figure 4.

Toy examples for transductive learning: (a) and (d) the original data of two-moon and two-cycle datasets; (b) and (e) the results of LGR; (c) and (f) the results of SLP.

In this section, we evaluate the performance of the proposed LGR and SLP for transductive learning. The SLP is an extensive method to GFHF, LLGC, and Random Walk (RW) hence, it is representative. Here, we utilize two-moon and two-cycle datasets in Figure 1(a and b) for evaluation. Figure 4 shows the results of LGR and SLP for transductive learning. From Figure 4, we can see that LGR can achieve better simulation result than SLP, in a way that less data are misclassified in LGR than SLP. This indicates the proposed LGR is robust to the outliers.

We also evaluate the inductive performance of the proposed LGR for handling the out-of-sample problem. Figure 5 shows the gray images of decision surfaces and boundaries learned by LGR, which are formed as follows: for each pixel, we form the its gray value as the difference from each pixel to its nearest labeled data of different classes in the reduced subspace. Here, we set the reduced dimensionality as 1. Then, we form the decision boundaries by the pixels with the value 0. Following Figure 5, we can observe that the proposed LGR can learn clear decision boundary that can well separate two classes, which verifies the effectiveness of LGR for handling the out-of-sample problem.

To show the merit of normalization, we utilize two-plate dataset in Figure 1(c) for evaluation. Our goal is to show LGR can handle multi-density dataset. Figure 6 shows the gray images of decision surfaces and boundaries learned by LGR without normalization and LGR with normalization. From Figure 6, we can observe that LGR without normalization cannot find proper boundary. However, LGR with normalization can achieve better performance, as there are less missing-classified data points separated by the decision boundary, which becomes more distinctive and accurate. The improved results are believed to be due to the fact that normalization can strengthen the local regressions in the low-density region and weaken those in the high-density region. This is proved to be advantageous to be used for multi-density dataset.

Figure 5.

Toy examples for inductive learning: decision surfaces and boundaries learned by LGR. (a) and (c) Two-moon dataset; (b) and (d) two-cycle dataset.

Figure 6.

Gray image of reduced space learned by LGR without normalization and LGR with normalization: two-plate dataset. (a) Original dataset; (b) LGR without normalization; and (c) LGR with normalization.

3.2. Semisupervised face recognition based on real-world benchmark datasets

For handling the face recognition problem, we use three real-world face datasets to evaluate the performance of methods, which include UMNIST: cannot find the full name [24], Extended Yale-B [25], and Massachusetts Institute of Technology Center for Biological and Computational Learning (MIT-CBCL) [26] datasets. The UMIST dataset is a multi-view face dataset, consisting of 1012 images of 20 peoples, each covering a wide range of poses from profile to frontal views. Therefore, the UMIST has widely been used for general purpose face recognition under different face poses. The size of each image is 112×92 with 256 gray levels per pixel. In our simulation, we down-sample the size of each image to 28×23 and no other preprocessing is performed. The Extended Yale-B dataset contains 16,123 images of 38 human subjects under 9 poses and 64 illumination conditions. Because of the illumination variability, the same object can appear dramatically different even when viewed in fixed pose. Hence, this is another challenge for face recognition, and Extended Yale-B dataset are extensively used for testing appearance-based face recognition methods. Similar to the UMIST dataset, the images are also cropped and resized to 32×32 pixels. This dataset now has around 64 near frontal images under different illuminations per individual. The MIT-CBCL dataset provides 3240 synthetic images rendered from 3D head models of 10 peoples. The head models are generated by fitting a morphable model to the high-resolution training images. Different from UMNIST dataset, the MIT-CBCL dataset is based on the 3D morphable model, which is rendered under varying pose and illumination conditions making the face recognition task more challengeable. The size of each image is originally 200×200 with 256 gray levels per pixel. In our simulation, we down-sample the size of each image to 32×32 and no other preprocessing is performed. The detailed information of dataset and some sampled images of real-world datasets can be shown in Table 3 and Figure 7. For each dataset, we randomly select 10, 50 and 30 samples from each class as training samples for UMNIST, Extended Yale-B, and MIT-CBCL datasets. The test set is then formed by the selected or all remaining samples. The data partitioning for each dataset is also given in Table 3.

Next, we compare our method with other supervised and semisupervised dimension reduction methods. These methods include Regularized Linear discriminant analysis (RLDA), SDA [2], Lap-RLS/L [1], least-square solution for solving SDA in Eq. (16) (in Table 1, we refer to it as LS-SDA) [28], FME [7, 10], and the proposed LGR. Note that Principal Component Analysis (PCA) is an unsupervised method while RLDA is supervised methods, and the remaining methods LGR are all semisupervised methods. The simulation settings are as follows: for SDA, Lap-RLS/L, two parameters, i.e., αt and αm, need to be determined for balancing the trade-off between the manifold and Tikhonov terms. We use fivefold cross validation to determine the best values and the candidate set is {109,106,103,100,103,106,109}. The above candidate set is also used for determining the best value for the Tikhonov term parameter αt in RLDA and the addition regularized parameter αr in FME and LGR. In order to eliminate the null space before performing dimension reduction, the training sets in all datasets are preliminarily processed with PCA operator. Since most of methods, such as RLDA, SDA, Lap-RLS/L and FME, and the proposed LGR have a limited rank of c–1, we simply reduce the dimensionality of all methods to c–1. All methods used labeled set in the output reduced subspace to train a nearest neighborhood classifier in order to evaluate the classification accuracy of test set. We also compare the performance of nearest neighborhood classifier with other state-of-the-art methods as a baseline.

Dataset Database Type #Samples #Dim #Class #Training per Class #Test per Class
UMNIST Face 1012 1024 20 20 Remains
Extended Yale-B Face 16123 1024 38 50 Remains
MIT-CBCL Face 3240 1024 10 30 30

Table 3.

Dataset information and data partition for each dataset.

Figure 7.

Sample images of real-world datasets: (a) UMNIST dataset, (b) Extended Yale-B dataset, (c) MIT-CBCL dataset.

The average accuracies over 20 random splits with the above parameters for each dataset are shown in Table 4. From the simulation results, we can obtain the following observation: (1) given sufficient labeled samples, all the supervised and semisupervised dimension reduction methods outperform nearest neighborhood classifier due to the utilization of label information and feature extraction; (2) the semisupervised dimension reduction methods are better than the corresponding supervised methods. For example, SDA outperforms RLDA by about 5–6% in COIL100 dataset with two labeled samples per class. For other datasets, it can outperform by 2–3%. This indicates that by incorporating the unlabeled set into the training procedure, the classification performance can be markedly improved, as the manifold structure embedded in the dataset is preserved; (3) we also observe that both SDA and the least-square solution in Table 1 can achieve the same classification results due to the reason as analyzed in Section 3; (4) the proposed LGR can deliver better accuracies than those delivered by other semisupervised dimension reduction methods such as SDA and Lap-RLS/L by about 3–4% in most datasets. The improvement can even achieve almost 8% in ETH80 dataset with two labeled samples per class. The improvement is believed to be true that LGR aims to characterize both local and global discriminative information embedded in dataset, which is better to handle classification problem; (5) we observe that LGR outperform FME by about 2% in most cases. The main reason is that LGR has utilized a weighted normalized local discriminative Laplacian matrix to preserve both manifold and discriminative structures in dataset, which is better than only relying on neighborhood graph.

Dataset Method 4 labeled samples per class 7 labeled samples per class 10 labeled samples per class
Unlabeled Test Unlabeled Test Unlabeled Test
Mean±std Mean±std Mean±std Mean±std Mean±std Mean±std
UMNIST Baseline 81.1±0.9 80.2±1.0 88.6±0.7 88.3±0.7 93.1±0.6 93.0±0.7
RLDA 85.2±0.6 85.0±0.7 90.7±0.5 90.4±0.6 95.3±0.4 94.4±0.5
SDA 86.4±0.7 86.3±0.7 92.1±0.6 91.7±0.7 96.2±0.4 95.4±0.5
LS-SDA 86.4±0.7 86.3±0.7 92.1±0.6 91.7±0.7 96.2±0.4 95.4±0.5
Lap-RLS/L 86.6±0.7 86.0±0.8 91.9±0.3 91.9±0.4 95.7±0.5 95.3±0.6
FME 88.2±0.6 87.7±0.6 93.1±0.3 92.9±0.4 96.7±0.5 96.1±0.5
LGR 89.2±0.4 88.9±0.5 94.2±0.2 93.8±0.4 97.9±0.6 97.2±0.4
Dataset Method 4 labeled samples per class 7 labeled samples per class 10 labeled samples per class
Unlabeled Test Unlabeled Test Unlabeled Test
mean±std mean±std mean±std mean±std mean±std mean±std
Extended Yale-B Baseline 50.5±1.9 50.1±1.9 63.9±1.5 63.6±1.7 69.8±1.5 69.4±1.6
RLDA 74.8±2.0 74.5±2.1 81.6±1.6 81.1±1.6 89.6±1.5 89.3±1.7
SDA 86.4±1.9 86.0±2.0 89.8±1.7 89.5±1.8 92.8±1.5 92.6±1.5
LS-SDA 86.4±1.9 86.082.0 89.8±1.7 89.5±1.8 92.8±1.5 92.6±1.5
Lap-RLS/L 86.7±2.0 86.2±2.2 90.1±1.8 89.7±1.9 93.1±1.6 92.8±1.7
FME 88.2±1.8 88.0±2.0 91.3±1.7 91.1±1.9 93.9±1.5 93.6±1.6
LGR 89.5±1.7 89.1±2.2 92.2±1.6 92.1±1.6 94.8±1.4 94.5±2.0
Dataset Method 4 labeled samples per class 7 labeled samples per class 10 labeled samples per class
Unlabeled Test Unlabeled Test Unlabeled Test
Mean±std Mean±std Mean±std Mean±std Mean±std Mean±std
MIT-CBCL Baseline 69.5±3.5 68.9±3.9 80.5±6.9 79.8±7.0 90.3±4.0 90.1±3.8
RLDA 72.6±2.9 73.8±4.6 82.5±7.6 82.2±7.2 92.1±3.2 92.9±3.0
SDA 75.7±2.9 76.9±4.7 84.7±6.7 84.2±6.5 95.0±3.3 94.9±3.1
LS-SDA 75.7±2.9 76.9±4.7 84.7±6.7 84.2±6.5 95.0±3.3 94.9±3.1
Lap-RLS/L 75.2±4.0 76.1±4.1 84.7±6.5 84.1±6.5 93.7±4.3 93.7±3.6
FME 78.4±2.5 78.1±3.2 85.3±5.4 85.1±5.0 95.1±3.7 94.9±3.1
LGR 81.0±2.2 80.8±3.2 87.2±5.1 86.5±5.1 96.5±3.4 96.4±3.2

Table 4.

Average classification accuracy over 20 random splits on unlabeled set and test set of different datasets (means±standard derivations).

Advertisement

4. Conclusion

In this chapter, we propose a semisupervised method, namely LGR, for face recognition. With the above analysis, the following conclusions can be drawn: (1) the proposed LGR can achieve better results in face recognition than those delivered by other state-of-the-art methods as more discriminative information are captured based on local and global regressions, (2) the proposed LGR is robust to outliers and can handle the imbalanced data, and (3) the proposed LGR can deal with out-of-sample extrapolation to estimate the labels of new-coming face data by casting it to the global projection matrix.

Advertisement

Appendix

In order to prove that Ld is graph Laplacian matrix, we need to prove Ld is positive semidefinite matrix and the sum of each row or column of Ld is equal to zero. We first have the following Lemmas:

Lemma 1. For each local patch Xj, Lj can be reformulated as follows:

Lj=ηGj(GjTXjTXjGj+ηI)1GjT E23

where Gj=(IΔjekTek/(ekΔjekT))Δj1/2Rk×k.

Proof. First, it can be easily noted that GjGjT=Hj, which is verified as follows:

GjGjT=(IΔjekTek/(ekΔjekT))Δj(IΔjekTek/(ekΔjekT))T=(ΔjΔjekTekΔj/(ekΔjekT))(IΔjekTek/(ekΔjekT))T=Δj2ΔjekTekΔj/(ekΔjekT)+ΔjekT(ekΔjekT)ekΔj/(ekΔjekT)2=ΔjΔjekTekΔj/(ekΔjekT)=Hj E24

Then, we have

Lj=HjHjXjT(XjHjXjT+ηI)1XjHj=GjGjTGjGjTXjT(XjGjGjTXjT+ηI)1XjGjGjT=GjGjTGjGjTXjTXjGj(GjTXjTXjGj+ηI)1GjT=GjGjTGj(GjTXjTXjGj+ηIηI)(GjTXjTXjGj+ηI)1GjT=ηGj(GjTXjTXjGj+ηI)1GjT E25

The second equation holds as A(ATA+λI)1=(AAT+λI)1A, for any matrix A. Thus, Lemma 1 is proved.

Lemma 2. Given a positive semidefinite matrix C, DCDT is a positive semidefinite matrix for any matrix D.

Lemma 3. Given a set of positive semidefinite matrixes {C1,C2,Cn} then j=1nCj is a positive semidefinite matrix.

We neglect the proofs of Lemmas 2 and 3 as they can be seen in reference [15]. Then with Lemmas 1–3, we can easily prove Theorem 2 as follows:

Proof of Theorem 2. Note that following Lemma 1, we reformulate each Lj as Lj=ηGj(GjTXjTXjGj+ηI)1GjT. It can be noted (GjTXjTXjGj+ηI)1 is a positive semidefinite matrix, then, following Lemmas 2 and 3, we have each ηSjGj(GjTXjTXjGj+ηI)1GjTSjT is a positive semidefinite matrix and Ld, i.e.,

Ld=j=1l+u(SjLjSjT)=j=1l+u(ηSjGj(GjTXjTXjGj+ηI)1GjTSjT) E26

is also a positive semidefinite matrix. In addition, for each ηSjGj(GjTXjTXjGj+ηI)1GjTSjT, we have SjTeT=ekT and

GjTekT=(ekT(ekTΔjek)ekT/(ekΔjekT))Δj1/2=(ekTekT)Δj1/2=0ηSjGj(GjTXjTXjGj+ηI)1GjTSjTeT=0LdeT=j=1l+u(ηSjGj(GjTXjTXjGj+ηI)1GjTSjT)eT=0 E27

which indicates that the sum of each row or column of Ld is equal to zero. We thus prove Ld is graph Laplacian matrix.

References

  1. 1. M. Belkin, P. Niyogi, V. Sindhwani. Manifold regularization: a geometric framework for learning from labeled and unlabeled samples. Journal of Machine Learning Research, 7:2399–2434, 2006.
  2. 2. D. Cai, X. He, J. Han. Semi-supervised discriminant analysis. IEEE International Conference on Computer Vision, Rio de Janeiro, Brazil, IEEE, 1–7, 2007.
  3. 3. X. Zhu, Z. Ghahramani, J. D. Lafferty. Semi-supervised learning using Gaussian fields and harmonic functions. In Proceedings of ICML, Washington DC, USA, Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 2003.
  4. 4. D. Zhou, O. Bousquet, T. N. Lal, J. Weston, B. Scholkopf. Learning with local and global consistency. In Proceedings of NIPS, Vancouver, Canada, Massachusetts Institute of Technology Press, Cambridge, MA, USA, 2004.
  5. 5. M. Szummer, T. Jaakkola. Patially labeled classification with Markov random walks. In Proceedings of NIPS, Vancouver, Canada, Massachusetts Institute of Technology Press, Cambridge, MA, USA, 2002.
  6. 6. F. Nie, H. Huang, X. Cai, C. Ding. Efficient and robust feature selection via joint L21-norms minimization. In Proceedings of NIPS, Vancouver, Canada, Massachusetts Institute of Technology Press, Cambridge, MA, USA 2010.
  7. 7. F. Nie, D. Xu, I. W. H. Tsang, C. Zhang. Flexible Manifold Embedding: a framework for semi-supervised and unsupervised dimension reduction. IEEE Transactions on Image Processing, 19(7):1921–1932, 2010.
  8. 8. F. Nie, S. Xiang, Y. Liu, C. Zhang. A general graph based semi-supervised learning with novel class discovery. Neural Computing and Application, 19(4):549–555, 2010.
  9. 9. F. Nie, D. Xu, X. Li, S. Xiang. Semi-supervised dimensionality reduction and classification through virtual label regression. IEEE Transactions on Systems, Man and Cybernetics, Part B, 41(3):675–685, 2011.
  10. 10. F. Nie, D. Xu, I. W. H. Tsang, C. Zhang. A flexible and effective linearization method for subspace learning. Graph Embedding for Pattern Analysis, 177–203, Yun Fu, Yunqian Ma, Eds. Springer, New York, 2013.
  11. 11. F. Wang, C. Zhang. Label propagation through linear neighborhoods. IEEE Transactions on Knowledge and Data Engineering, 20(1):55–67, 2008.
  12. 12. J. Wang, F. Wang, C. Zhang, H. C. Shen, L. Quan. Linear neighborhood propagation and its applications. IEEE Transactions on Pattern Analysis and Machine Intelligence, 31(9):1600–1615, 2009.
  13. 13. Y. Yang, F. Nie, D. Xu, J. Luo, Y. Zhuang, Y. Pan. A multimedia retrieval framework based on semi-supervised ranking and relevance feedback. IEEE Transactions on Pattern Analysis and Machine Intelligence, 34(5):723–742, 2012.
  14. 14. Y. Yang, D. Xu, F. Nie, J. Luo, Y. Zhuang. Ranking with local regression and global alignment for cross medial retrieval. In Proceedings of MM, Beijing, China, ACM New York, NY, USA, 2009.
  15. 15. Y. Yang, F. Nie, S. Xiang, Y. Zhuang, W. Wang. Image clustering using local discriminant models and global integration. IEEE Transactions on Image Processing, 19(10):2761–2773, 2010.
  16. 16. D. Wang, F. Nie, H. Huang. Large-scale adaptive semi-supervised learning via unified inductive and transductive model. In Proceeding of KDD, New York, NY, USA, ACM New York, NY, USA, 2014.
  17. 17. X. He, S. Yan, Y. Hu, P. Niyogi, H. Zhang. Face recognition using Laplacian faces. IEEE Transactions on Pattern Analysis and Machine Intelligence, 27(3):328–340, 2005.
  18. 18. S. Xiang, F. Nie, C. Zhang. Semi-supervised classification via local spline regression. IEEE Transactions on Pattern Analysis and Machine Intelligence, 32(11):2039–2053, 2010.
  19. 19. S. Xiang, F. Nie, C. Zhang. Nonlinear dimensionality reduction with local spline embedding. IEEE Transactions on Knowledge and Data Engineering, 21(9):1285–1298, 2009.
  20. 20. S. T. Roweis, L. K. Saul. Nonlinear dimensionality reduction by locally linear embedding. Science, 209: 2323–2326, 2000.
  21. 21. S. A. Nene, S. K. Nayar, H. Murase. Columbia object image library (COIL-100). Technical Report CUCS-005-96, Columbia University, New York, NY, 1996.
  22. 22. B. Leibe, B. Schiele. Analyzing appearance and contour based methods for object categorization. In Proceedings of CVPR, Madison, Wisconsin, USA, IEEE, 2003.
  23. 23. R. Johnson, T. Zhang. On the effectiveness of Laplacian normalization for graph semi-supervised learning. Journal of Machine Learning Research, 8:1489–1517, 2007.
  24. 24. D. B. Graham, N. M. Allinson. Characterizing virtual eigensignatures for general purpose face recognition in face recognition: from theory to application. NATO ASI Series F, Computer and Systems Sciences, 163:446–456, 1998.
  25. 25. K. C. Lee, J. Ho, D. Kriegman. Acquiring linear subspaces for face recognition under variable lighting. IEEE Transactions on Pattern Analysis and Machine Intelligence, 27(5):947–963, 2005.
  26. 26. B. Weyrauch, J. Huang, B. Heisele, V. Blanz. Component-based Face Recognition with 3D Morphable Models. First IEEE Workshop on Face Processing in Video, IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Washington DC, USA, IEEE, 2004.
  27. 27. M. Zhao, Z. Zhang, T. W. S. Chow, Trace ratio criterion based generalized discriminative information for semi-supervised dimensionality reduction. Pattern Recognition, 45(4):1482–1499, 2012.
  28. 28. M. Zhao, Z. Zhang, H. Zhang. Learning from local and global discriminative information for semi-supervised dimensionality reduction. The International Joint Conference on Neural Networks (IJCNN), 1–8, Dallas, TX, USA, IEEE, 2013.
  29. 29. M. Zhao, Z. Zhang, T. W. S. Chow, B. Li. Soft label based linear discriminant analysis for image recognition and retrieval. Computer Vision and Image Understanding, 121:86–99, 2014.
  30. 30. M. Zhao, Z. Zhang, T. W. S. Chow, B. Li. A general soft label based linear discriminant analysis for semi-supervised dimension reduction. Neural Networks, 55:83–97, 2014.
  31. 31. M. Zhao, T. W. S. Chow, Z. Zhang, B. Li. Automatic image annotation via compact graph based semi-supervised learning. Knowledge Based Systems, 76:148–165, 2015.
  32. 32. M. Zhao, C. Zhan, Z. Wu, P. Tang. Semi-supervised image classification based on local and global regression. IEEE Signal Processing Letters, 22(10):1666–1670, 2015.
  33. 33. M. Zhao, T. W. S. Chow, Z. Wu, Z. Zhang, B. Li. Learning from normalized local and global discriminative information for semi-supervised regression and dimensionality reduction. Information Sciences, 324(10):286–309, 2015.

Written By

Mingbo Zhao, Yuan Gao, Zhao Zhang and Bing Li

Submitted: 18 November 2015 Reviewed: 22 March 2016 Published: 06 July 2016