## 1. Introduction

Manifold learning becomes well known due to its property to learn the representative geometry in low-dimensional embedding, with which data analysis and visualization are significantly benefitted. From the observation of some nonlinear data, a low-dimensional smooth manifold (differentiable manifold) is embedded in the original high-dimensional space, which is implicit if we only consider the metrics of the original space. Manifold learning algorithm purpose is to learn such embedding according to some protocols, e.g., local linearity and global structure preserving. For the remainder of this chapter, the term manifold is used to refer as the smooth manifolds (differentiable manifolds) for convenience. As a complex high-dimensional data, face image analysis is a difficult topic in the field of computer vision due to the complicated facial appearance variations, among which the head pose challenges many face-related applications. Accurate head pose estimation is advantageous to face alignment and recognition, because frontal- or near-frontal faces are easier to handle compared with other poses. It has been found that facial appearance is lying on a manifold embedded form in the original high-dimensional space represented as face images. Correspondingly, the head pose can also be represented as a low-dimensional embedding, which is more representative and discriminative to model the variation. Therefore, the head pose estimation can be implemented by the manifold learning.

In principle, head pose refers to the view of the face to the imaging system, i.e., the camera center. 3D head transformation involves 6 degrees of freedom (DOF), which can be interpreted as the 3D translations **Figure 1**. This definition reduces the head pose to 3 DOF which are sufficient to model most of the in- and out-plane rotation of the head. It can be found that the pitch and yaw generate more self-occlusions and roll can be easily corrected by the position of eyes. So, in this chapter, the 2 DOF including yaw and pitch are considered. Usually, the head poses of yaw and pitch lead to the problem of self-occlusion, which subsequently results in the loss of informative features, e.g., facial texture and shapes. By comparing, frontal- or near-frontal faces are relatively easier to deal with. Examples of various head poses [2] are given in **Figure 2**. The task of the head pose estimation is actually to determine the yaw and pitch of an unknown face image or search the frontal face in a database. Once the head pose is obtained, further applications such as face alignment and recognition will be benefitted in efficiency and accuracy. Therefore, modern development of face-related research prefers to estimate the head pose and correct the head orientation by positioning the face to near frontal or warping the faces in various poses to a frontal face template [3].

Basically, head pose estimation methods broadly fall into several categories. Template-based methods treat the head pose estimation as a verification (or classification) problem. The testing face is projected to the data set labeled with known poses, the one from which the most significant similarity measured by various metrics is retrieved for the testing pose [4, 5]. Furthermore, pose detectors can be learned to simultaneously localize the face and recognize the pose [6]. Regression-based methods estimate a linear or nonlinear function with the original faces or extracted facial features as input variables and discrete or continuous poses as output [2]. Deformable models learn flexible facial modes [7–9]. By manipulating a set of parameters which specifies the pose, specific face example can be generated, which will be used to match the testing face. With the development of manifold learning [10–13], more promising results of head pose estimation are achieved. The essence of such methods is based on the assumption that the discriminative modes for head pose lie on low-dimensional manifolds embedded in high-dimensional space, i.e., the original color space or other low level feature space [14]. The low-dimensional representation of the head pose images can be learned by unsupervised or supervised manifold learning.

In contrast, the template-based methods have the problem of serious dependence on training data. If similar poses to the query pose do not exist in the training set, the estimated result would be biased. The regression-based methods often require to use complicated regression models, for example, a high-order polynomial. However, complicated nonlinear function would cause the problem of overfitting, which will result in poor generalization of the model. The deformable models require the localization of dense facial features, such as landmarks of facial components, which are seriously influenced by the head pose. The manifold learning-based methods are somehow limited by some problems, such as identity and noise sensitivity; however, simple efforts can be made to efficiently improve the performance [15]. More importantly, the manifold learning-based methods show promising performance of generalization. And the head pose can be easily modeled and better visualized with low-dimensional features. More details will be given in following sections.

According to the previous analysis, the main focus of this chapter will be on the manifold learning based on head pose estimation. The main notations used in this chapter are listed and interrupted in Section 2. In Section 3, classical manifold learning algorithms will be elaborated. In Section 4, adaptions and extensions of manifold learning algorithms, which are more suitable for head pose estimation, are discussed. Section 4 summaries the work, and some available resources of manifold learning are given.

## 2. Notations

*i*th data point in the original *D*-dimensional space. ** x** without subscript is used to represent an arbitrary data point.

*M-*data points collection.

*N*(*i*) represents the set of *K-*nearest neighbor of the *i*th data point.

*d*-dimensional representation of the *i*th data point after dimensionality reduction. Similarly, * y* without subscript is used to represent an arbitrary data point.

*j*th data point is one of the *K* nearest neighbors of the *i*th data point, and

*D* eigenvalues (ranked decreasingly *D×D* matrix.

*d* eigenvectors, and *d* eigenvectors.

**I** denotes the identity matrix.

## 3. Characteristics of manifold learning algorithms

Given a set of data points, for example, face images, it is difficult to directly estimate or extract the most significant modes from such high-dimensional representation of the data. If the distribution of data in the original feature space can be linearly structured, the classical principal component analysis (PCA) will be able to estimate the discriminative modes and then reduce the feature dimensions. An example of such a type of data is shown in **Figure 3**. However, if the data distribution of the original data is nonlinear, for example, the famous “swiss roll” shown in **Figure 4(a)**, which is a smooth, continuous but nonlinear surface embedded in the 3D space, the structure interpreted as Euclidean distance is less preferable to represent the distribution of the data. Taking the two circled points sampled from the manifold shown in **Figure 4(b)**, for instance, their Euclidean distance is close, while this is not guaranteed if the 3D structure is considered. The embedded structure can be explored with the help of nonlinear dimensionality reduction, such as manifold learning algorithms. The learned low-dimensional representation can approximately model the real distance of the sampled data points as shown in **Figure 4(c)**.

In this section, in order to reveal the essence of manifold learning, the PCA is initially detailed. Other classical manifold learning algorithms will be elaborated in the following.

### 3.1. Principal component analysis (PCA)

PCA is one of the most popular unsupervised linear dimensionality reduction algorithms. The intrinsic feature of PCA is to estimate a linear space whose basis will be able to preserve the maximum variations in the original data. Mathematically, the low-dimensional data can be obtained by a linear transformation from the original data as denoted in Eq. (1).

where * x* is the centered data point. The entry of the projection matrix

**V**is the column vector that represents the principal components in the projection space. Let us take one of the principal components for instance. The objective is to preserve the maximum variations in the transformed data.

*M* if it is sufficiently large. Eq. (2) is a form of Rayleigh’s quotient, which can be maximized by eigenvector decomposition of covariance matrix **C**. The *d* eigenvectors corresponding to the top *d* positive eigenvalues are taken to construct the low-dimensional space * V_{D×d}* to which the original data are projected. Actually, Eq. (2) can be converted to

which means that the original data can be linearly represented as a combination of the principal components.

Taking the head pose images of one identity shown in **Figure 2**, for example, the PCA is applied on the vectorized images. **Figure 5** visualizes the low-dimensional representation of the face images in the first 3D dimensions. One can find obvious transitions for pitch and yaw along a 3D shape of valley. The three principal components are visualized in **Figure 6**, from which one face image is decomposed into a weighted accumulation of variations in the mean face. The first and third eigenfaces (principal component) clearly show the variation in yaw. Therefore, PCA can model the head poses as some of the discriminative principal components.

### 3.2. Locally linear embedding (LLE)

From the observation of the data shown in **Figure 4**, the smooth manifold is globally nonlinear but can be seen as linear from a local neighborhood. On the basis of this observation, the LLE attempts to represent each of the data by a weighted linear combination of a number of neighbors [11]. The weight matrix **W** can be obtained by the following objective function.

where *w _{i}* denotes the

*i*th row vector of matrix

**W**. Eq. (4) shows that the LLE aims to minimize the total reconstruction error for the data from the corresponding nearest neighbors. Specifically,

**W**is a sparse matrix which assigns optimal weights for neighboring data points and zeros for nonneighboring data points. As a result, both the global nonlinearity and local linearity are included in one identical form. A close form of the weights matrix

where *K* is the number of nearest neighbors tuned for specific problem; *m*th and *n*th neighbors of the *i*th data point.

Moreover, the weight matrix **W** is locally invariant to linear transformation, i.e., translation, rotation, and scaling. Therefore, it is reasonable to propose that low-dimensional representation of the data can also preserve the local geometry as featured in the original space with the same weight matrix **W**. The next step is then to estimate such low-dimensional (*d*-dimension) representation **Y** by the following equation.

Eq. (6) can be rewritten as

where **Y** can be obtained, of which the columns correspond to the bottom *d* eigenvectors of **M**.

The same data set used in the last experiment is processed with LLE and shown with the first three dimensions in **Figure 7**. The variation of the head pose in yaw with different pitch is obviously shown. The transition from a pose to another pose tends to be continuous and easy to locate. The learned manifold is smoother and more discriminative than PCA.

### 3.3. Isomap

Isomap [10] is an abbreviation of isometric feature mapping [17], which is an extension of the classical algorithm of multidimensional scaling (MDS) [18]. From the previous section, one can learn that the LLE represents the nonlinearity of the original data by preserving the local geometrical linearity. In contrast, the algorithm of Isomap proposed a global solution by constructing a graph for all pairwise data. This idea ensures the global optimum.

Specifically, Isomap firstly constructs a graph that can be represented as *G* =(*V*,*E*) with *V* as the vertices (the data points) and *E* as the edges (the adjacent connections). The adjacent vertices will be connected with edges according to particular metrics. Taking the *i*th and *j*th data points for instance, an edge will be added between them if *ϵ*-neighbor) or *K*-nearest neighbor of *K*-nearest neighbor). The graph is then assigned with distances among all pairwise vertices according to the edges. The distance between the vertices associated with edges will be

From the distance matrix

where *τ* is defined as *d*-dimensional representation **Y** is obtained by decomposing the matrix of *d* eigenvectors.

**Figure 8** shows the results of the low-dimensional head poses obtained from Isomap. An interesting shape of “bowl” of the embedding surface obtained for the head pose images.

### 3.4. Laplacian eigenmaps (LE)

Compared to Isomap, the idea of graph representation of the data is also taken by the algorithm of LE. However, the difference is the later attempts to construct a weighted graph (other than distance graph) for the data, which is then represented as a Laplacian [12].

The first step of LE is to construct an adjacent graph whose vertices are the data points and edges are the adjacent connections for neighbors. A pair of points * x_{i}* −

*|| ≤ ε. The other criterion to connect or disconnect the pair of points is to find if they are*

**x**_{j}*K*-nearest neighbors for each other. The second step is to choose appropriate weights for the graph. There are two options: the heat kernel defines the weight as

*w*=1 for connected edges and zero otherwise.

_{ij}The third step is to minimize an objective function

where the diagonal matrix of **W**:

where **L** = **A** − **W** is the Laplacian matrix for the weighted graph. Such form is a generalized eigenvector decomposition. And the matrix **Y** whose columns are the bottom *d* eigenvectors decomposed from Eq. (10) is the *d*-dimensional representation of the data. To be compared, the LE is less sensitive to outlier and noise due to its property of local preservation. The weights for nonedges are set to be zeros, which diminish the problem of short circuiting.

As shown in **Figure 9**, the embedding surface with the shape of parabolais generated by LE, which is similar to the results obtained by LLE. But the latter produces smoother and more symmetric shape of the surface. The variation in yaw from left to right is shown symmetrically, and the frontal face approximately locates on the vertex of bottom.

### 3.5. Laplacian preserving projections (LPP)

The previously introduced algorithms do not clarify how an unseen data is projected to the low-dimensional space. To solve this problem, LPP reformulates the LE by representing the dimensionality reduction as a linear projection from the original to the low-dimensional data. The first two steps of LPP are exactly the same as LE, which construct the adjacent graph and compute the weights for each connection. The most significant difference is the LPP representing the dimensionality reduction from the original to the low-dimensional space as a projection * y* =

**V**

^{T}

*. The problem is converted to the one which aims to find a projection space instead of directly compute the low-dimensional features. The generalized eigenvector decomposition defined in Eq. (10) is then reformulated as follows:*

**x**The bottom *d* eigenvectors decomposed from Eq. (11) construct the projection matrix * y* =

**V**

^{T}*.*

**x**More improved nonlinear manifold learning algorithms are developed [13, 19], but in this section, the main idea of how to derive the low-dimensional representation of the head poses is the core. Details of the advanced versions of the manifold learning algorithms can be explored in the original references.

## 4. Head pose estimation via manifold learning

The manifold learning methods can successfully model the head pose variations in both yaw and pitch as discussed in the previous sections. However, there are still several difficulties to state. The introduction of noise, for example, identity, and illumination variations will affect the performance of those methods on the head pose estimation. Another point is that they do not infer how the low-dimensional representation of an unseen head pose image is obtained (except LPP) and how the pose is estimated. In this section, more sophisticated methods are introduced to solve these problems based on the original or extended manifold learning algorithms.

### 4.1. PCA-based head pose estimation

In Ref. [20], the PCA has been turned to be robust to invariance of identity. Another important conclusion is that the angle of 10^{°} is found to be the lower bound to be discriminative. For the data set constructed following this finding, the PCA would produce promising results for head pose estimation.

A kernel machine-based method is proposed using the kernel PCA (KPCA) and kernel support vector classifier (KSVC) [21]. The KPCA is an extension of the classical PCA. Let **C** is replaced by * x_{ts}*, it is first mapped by the kernel and then the low-dimensional features can be obtained by the projection matrix learned from KPCA

### 4.2. View representation by Isomap

The derivation of how the Isomap reduces the dimensionality of the original data to a low-dimension has been introduced in the previous section. Now the problem is how to connect the head pose to the features. A pose parameter map **F** is proposed in Ref. [22] to build such connections.

where **Y** is the low-dimensional representation of the data obtained from Isomap. Actually, the matrix of **F** can be seen as a set of linear transformations that map the features to corresponding pose angles. During training time, the head poses **Θ** are given as annotations, and the low-dimensional features **Y** can be learned by manifold learning, then, **F** can be obtained using the singular value decomposition (SVD) of **Y**^{T}.

where **P _{Y}**,

**W**, and

_{Y}**U**are the SVD of

_{Y}**Y**

^{T}.

Given a testing image **Y**. The first step is to construct a geodesic distance vector for the testing image to all the training images

Finally, the estimated pose of the testing image is computed from

The insight of this method focuses on the conversion from testing data to the subspace learned by nonlinear manifold learning. The algorithms of LLE and LE can also be generalized by the proposed idea.

### 4.3. The biased manifold embedding (BME)

The head pose estimation is subjected to the identity variation. The ideal case is to eliminate such negative effects, which means the face images with close pose angles should maintain nearer and the ones with quite different poses should stay farther in the low-dimensional manifold, even the poses are from the same identity. Based on this statement, the BME is proposed to modify the distance matrix according to the pose angles, which can be extended with almost all the classical algorithms [23].

The modified distance between a pair of data points

where *p(i,j)* = |*p _{i} - p_{j}*| is simply defined as the absolute difference of the angles of two poses. From the modified distance matrix, one can find that the distance between images with close poses is biased to be proportionally small. The images with the same poses are defined to be zero-distance.

In fact, the BME can be seen as a naÏve version of the supervised manifold learning. The head pose information is used as the supervision to enhance the construction of the graph. For the head pose estimation stage, the generalized regression neural network (GRNN) [24] is applied to learn the nonlinear mapping for the unseen data points, and linear multivariate regression is applied to estimate the head pose angle. This idea can be easily extended to the classical algorithms, e.g., Isomap, LLE, and LE, among which the biased LE achieves the lowest error rate on the data set of FacePix [25].

### 4.4. Head pose estimation as frontal view search

The two remarkable head poses, i.e., yaw and pitch, cause the problem self-occlusion. Compared with pitch, the yaw makes the problem more serious. An extended manifold learning (EML) method is proposed to specify the head pose estimation only considering the variation in the yaw [15]. This work resorts to the frontal view search instead of directly estimating the head pose, which is more efficient and robust. The idea is based on the observation that the frontal face locates nearly at the vertex in the symmetrical shape of the embedding. However, if the pose distribution of the data is asymmetric, the location of the frontal face in the manifold will shift from the vertex. Therefore, the first trial of the EML method is data enhancement. All the images are horizontally flipped and both the original and flipped images are used for manifold learning. In order to make the method more robust to variations in environment, for example, illumination, the localized edge orientation histogram (LEOH) is presented to represent the original color mappings as more representative features. The idea is inspired by the classical HoG feature [14]. The first step of LEOH is to apply a Canny edge detector on the original images. Then, the whole image is divided into *M* × *N* cells. The gradient orientation is quantized into *N _{B}* bins. Next, histograms of the gradient orientation of the cells locating in a block consisted of

*P*×

*Q*cells are accumulated and normalized. Finally, the LEOH feature is obtained by the block features concatenation. The proposed ideas can be easily incorporated in various manifold learning methods that improve the performance of the frontal view searching.

### 4.5. Head pose estimation by supervised manifold learning

A taxonomy of methods, which structures the general framework of manifold learning into several stages, is proposed to incorporate the head pose angles in one or some of the stages to enable the supervised manifold learning [26]. A straightforward solution could be the adaption of the distance and weight matrix according to the angle difference between pairwise face images. The head pose estimation problem is then interrupted as a regression problem, which was usually solved as a classification problem. As a result, continuous head poses can be generalized by the model.

The general framework of manifold learning can be represented as follows: Stage 1, neighborhood searching; Stage 2, graph weighting; Stage 3, low-dimensional manifold computation; and Stage 4, projection from unseen data to the manifold and pose estimation.

In Stage 1, the distance matrix of **D** = {*d _{ij}*} can be adapted as follows:

where *θ _{i}* and

*θ*are the angles of two poses, which keep the same denotation as previous sections. The

_{j}*f*is some reciprocal increasing positive function, for example,

*f*encourages the distance decreasing of the nearer poses and increasing of farther poses. The farther the poses are, the more penalties the distance will gain.

In Stage 2, the weight matrix of **W** = {*w _{ij}*} can be adapted by similar idea of supervision information incorporation.

where *g* is defined as some positive decreasing function, which is similar to the *f* applied in Stage 1.

In Stage 3, let us take the LLE for an instance. The original objective function of LLE shown in Eq. (6) can be adapted as follows:

where the **Λ** = {Λ_{i,j}} measures the similarity between the angles of pairwise poses. A possible form of *Λ* is the heat kernel.

The adaption of the objective function can preserve the local linearity of the original data and enhance the similarity for neighborhoods, which are facilitated with similar poses. This is implemented by the second term of Eq. (19) that introduces the supervision information. Following the derivation from Eq. (6) to Eq. (7), Eq. (19) can be simplified as:

where **Λ**. For the low-dimensional embedding, eigenvectors decomposition of **M** *+ λ***L** can be performed. By the supervision information incorporation, the method is much capable of imposing discriminative projection to the learned embedding.

In Stage 4, the GRNN algorithm is applied to produce the mapping from unseen data to the low-dimensional embedding. During testing time, the support vector regression (SVR) with RBF kernel and smoothing cubic splines are taken.

A novel method of supervised manifold learning for head pose estimation [27, 28] is proposed based on the framework from the former method. Similarly, angles of poses are incorporated in all three stages of the general manifold learning structure.

In Stage 1, an improved version of *f* is proposed as:

where *f* is defined as a rectified reciprocal form *ε* is an arbitrary small positive constant that avoids the denominator of *f* being zero. This adaption for the distance matrix further enhances the effects of the supervision information during the procedure of neighbors search.

In Stage 2, taking LLE (NPE [29]), for an example, the local distance matrix shown in Eq. (4) is modified as

where _{i} is the angle of the reference face image * x_{i}*.This operation enhances the supervision during the computation of local correlated matrix.

In Stage 3, a supervised neighborhood-based fisher discriminant analysis (SNFDA) is proposed. The basic idea is to make the neighboring data points as close as possible and the nonneighboring data points as far as possible in the low-dimensional embedding. The SNFDA can be seen as a postprocessing procedure in this stage. Based on the low-dimensional represented data **Y** obtained from the original LLE or the modified LLE in Stages 1 and 2, the within- and between-neighborhood scatter matrices are defined as:

where

*A _{ij}* is the affinity between

*and*

**y**_{i}*, which is defined as the form of heat function:*

**y**_{j}Details about the inference of the scatter matrices can be found in the original reference. The transformed matrix **T**_{SNFDA} of SNFDA is computed from the generalized eigenvector decomposition problem

The top *d* eigenvectors span to the **T**_{SNFDA} and the transformed feature is obtained by * z* =

**T**

_{SNFDA}

**y**. This supervised learning manner successfully introduces the supervision information in a framework to provide a “good” projection from the original data to the low-dimensional. Due to the supervised learning, when the projection is applied on original data, more discriminative features can be obtained for head pose estimation.

In Stage 4, during testing time, the GRNN is applied to map the unseen data point to the low-dimensional embedding and the relevance vector machine (RVM) [30] is adopted to accomplish the pose estimation. Experimental results obtained by the proposed method performing on the database of FacePix [25] and MIT-CBCL [31] show big improvements compared with other state-of-the-art algorithms [23, 26] in Stage 3 and Stages 1 + 2 + 3. This means that this method is more robust for identity and illumination variations.

## 5. Summary

In this chapter, the head pose estimation, one of the most challenging tasks in the area of computer vision, is introduced, and the main types of methods are demonstrated and compared. Particularly, the manifold learning-based methods are attracted more attention. In reality, data distribution is usually nonlinear in high-dimensional represented space, e.g., the head pose images. Some potential structures are lying on nonlinear but smooth manifolds which are embedded in the original space. The manifold learning algorithms are able to discover and visualize such embedding. Almost all the algorithms are formalized based on the assumption of the local linearity of the nonlinear data. Those algorithms highly benefit the application of head pose estimation, because the face orientations (yaw and pitch) are found to be distributed along some specific manifolds. Promising performance is achieved by the classical manifold learning methods, which, however, are highly improved by the supervised manifold learning. It proves that the supervised information represented as angles of head poses is helpful in head pose estimation. However, there are still hurdles to take. Most of the methods are tested in different settings, e.g., different database is used in different method. A common framework could help to offer fair justifications. Other feature instead of the simple color space can be considered to better represent the face images. Some useful tools are available online to help better understand the work [16, 32, 33].