For the last decades, manifold learning has shown its advantage of efficient non-linear dimensionality reduction in data analysis. Based on the assumption that informative and discriminative representation of the data lies on a low-dimensional smooth manifold which implicitly embedded in the original high-dimensional space, manifold learning aims to learn the low-dimensional representation following some geometrical protocols, such as preserving piecewise local structure of the original data. Manifold learning also plays an important role in the applications of computer vision, i.e., face image analysis. According to the observations that many face-related research is benefitted by the head pose estimation, and the continuous variation of head pose can be modelled and interpreted as a low-dimensional smooth manifold, we will focus on the head pose estimation via manifold learning in this chapter. Generally, head pose is hard to directly explore from the high-dimensional space interpreted as face images, which is, however, can be efficiently represented in low-dimensional manifold. Therefore, in this chapter, classical manifold learning algorithms are introduced and the corresponding application on head pose estimation are elaborated. Several extensions of manifold learning algorithms which are developed especially for head pose estimation are also discussed and compared.
- manifold learning
- head pose estimation
- nonlinear feature reduction
- supervised manifold learning
- local linearity
- global geometry
Manifold learning becomes well known due to its property to learn the representative geometry in low-dimensional embedding, with which data analysis and visualization are significantly benefitted. From the observation of some nonlinear data, a low-dimensional smooth manifold (differentiable manifold) is embedded in the original high-dimensional space, which is implicit if we only consider the metrics of the original space. Manifold learning algorithm purpose is to learn such embedding according to some protocols, e.g., local linearity and global structure preserving. For the remainder of this chapter, the term manifold is used to refer as the smooth manifolds (differentiable manifolds) for convenience. As a complex high-dimensional data, face image analysis is a difficult topic in the field of computer vision due to the complicated facial appearance variations, among which the head pose challenges many face-related applications. Accurate head pose estimation is advantageous to face alignment and recognition, because frontal- or near-frontal faces are easier to handle compared with other poses. It has been found that facial appearance is lying on a manifold embedded form in the original high-dimensional space represented as face images. Correspondingly, the head pose can also be represented as a low-dimensional embedding, which is more representative and discriminative to model the variation. Therefore, the head pose estimation can be implemented by the manifold learning.
In principle, head pose refers to the view of the face to the imaging system, i.e., the camera center. 3D head transformation involves 6 degrees of freedom (DOF), which can be interpreted as the 3D translations and rotations from the head to the camera center. Among the six variables, the 3D rotations that are formally represented by pitch, roll, and yaw are taken as the head pose . A schematic demonstration taken from reference  is shown in Figure 1. This definition reduces the head pose to 3 DOF which are sufficient to model most of the in- and out-plane rotation of the head. It can be found that the pitch and yaw generate more self-occlusions and roll can be easily corrected by the position of eyes. So, in this chapter, the 2 DOF including yaw and pitch are considered. Usually, the head poses of yaw and pitch lead to the problem of self-occlusion, which subsequently results in the loss of informative features, e.g., facial texture and shapes. By comparing, frontal- or near-frontal faces are relatively easier to deal with. Examples of various head poses  are given in Figure 2. The task of the head pose estimation is actually to determine the yaw and pitch of an unknown face image or search the frontal face in a database. Once the head pose is obtained, further applications such as face alignment and recognition will be benefitted in efficiency and accuracy. Therefore, modern development of face-related research prefers to estimate the head pose and correct the head orientation by positioning the face to near frontal or warping the faces in various poses to a frontal face template .
Basically, head pose estimation methods broadly fall into several categories. Template-based methods treat the head pose estimation as a verification (or classification) problem. The testing face is projected to the data set labeled with known poses, the one from which the most significant similarity measured by various metrics is retrieved for the testing pose [4, 5]. Furthermore, pose detectors can be learned to simultaneously localize the face and recognize the pose . Regression-based methods estimate a linear or nonlinear function with the original faces or extracted facial features as input variables and discrete or continuous poses as output . Deformable models learn flexible facial modes [7–9]. By manipulating a set of parameters which specifies the pose, specific face example can be generated, which will be used to match the testing face. With the development of manifold learning [10–13], more promising results of head pose estimation are achieved. The essence of such methods is based on the assumption that the discriminative modes for head pose lie on low-dimensional manifolds embedded in high-dimensional space, i.e., the original color space or other low level feature space . The low-dimensional representation of the head pose images can be learned by unsupervised or supervised manifold learning.
In contrast, the template-based methods have the problem of serious dependence on training data. If similar poses to the query pose do not exist in the training set, the estimated result would be biased. The regression-based methods often require to use complicated regression models, for example, a high-order polynomial. However, complicated nonlinear function would cause the problem of overfitting, which will result in poor generalization of the model. The deformable models require the localization of dense facial features, such as landmarks of facial components, which are seriously influenced by the head pose. The manifold learning-based methods are somehow limited by some problems, such as identity and noise sensitivity; however, simple efforts can be made to efficiently improve the performance . More importantly, the manifold learning-based methods show promising performance of generalization. And the head pose can be easily modeled and better visualized with low-dimensional features. More details will be given in following sections.
According to the previous analysis, the main focus of this chapter will be on the manifold learning based on head pose estimation. The main notations used in this chapter are listed and interrupted in Section 2. In Section 3, classical manifold learning algorithms will be elaborated. In Section 4, adaptions and extensions of manifold learning algorithms, which are more suitable for head pose estimation, are discussed. Section 4 summaries the work, and some available resources of manifold learning are given.
denotes the data collection in the low-dimensional space.
is the covariance matrix for the centered data, where . Usually, this can be enabled by mean subtraction: , and .
is the weight matrix to model the graph for pairwise face images, which will be specified by different metrics, e.g. if the
is the distance matrix measuring the pairwise distances among the data points, which can be Euclidean or other distance metrics.
is a diagonal matrix whose elements are the
is the projection matrix, which is consisted of the eigenvectors corresponding to the ranked eigenvalues. are the top
3. Characteristics of manifold learning algorithms
Given a set of data points, for example, face images, it is difficult to directly estimate or extract the most significant modes from such high-dimensional representation of the data. If the distribution of data in the original feature space can be linearly structured, the classical principal component analysis (PCA) will be able to estimate the discriminative modes and then reduce the feature dimensions. An example of such a type of data is shown in Figure 3. However, if the data distribution of the original data is nonlinear, for example, the famous “swiss roll” shown in Figure 4(a), which is a smooth, continuous but nonlinear surface embedded in the 3D space, the structure interpreted as Euclidean distance is less preferable to represent the distribution of the data. Taking the two circled points sampled from the manifold shown in Figure 4(b), for instance, their Euclidean distance is close, while this is not guaranteed if the 3D structure is considered. The embedded structure can be explored with the help of nonlinear dimensionality reduction, such as manifold learning algorithms. The learned low-dimensional representation can approximately model the real distance of the sampled data points as shown in Figure 4(c).
In this section, in order to reveal the essence of manifold learning, the PCA is initially detailed. Other classical manifold learning algorithms will be elaborated in the following.
3.1. Principal component analysis (PCA)
PCA is one of the most popular unsupervised linear dimensionality reduction algorithms. The intrinsic feature of PCA is to estimate a linear space whose basis will be able to preserve the maximum variations in the original data. Mathematically, the low-dimensional data can be obtained by a linear transformation from the original data as denoted in Eq. (1).
makes Eq. (2) an unbiased estimation, which can be replaced by
which means that the original data can be linearly represented as a combination of the principal components.
Taking the head pose images of one identity shown in Figure 2, for example, the PCA is applied on the vectorized images. Figure 5 visualizes the low-dimensional representation of the face images in the first 3D dimensions. One can find obvious transitions for pitch and yaw along a 3D shape of valley. The three principal components are visualized in Figure 6, from which one face image is decomposed into a weighted accumulation of variations in the mean face. The first and third eigenfaces (principal component) clearly show the variation in yaw. Therefore, PCA can model the head poses as some of the discriminative principal components.
3.2. Locally linear embedding (LLE)
From the observation of the data shown in Figure 4, the smooth manifold is globally nonlinear but can be seen as linear from a local neighborhood. On the basis of this observation, the LLE attempts to represent each of the data by a weighted linear combination of a number of neighbors . The weight matrix
Moreover, the weight matrix
Eq. (6) can be rewritten as
where . Two constraints are made to center the data and avoid degenerate solutions: and . Then
The same data set used in the last experiment is processed with LLE and shown with the first three dimensions in Figure 7. The variation of the head pose in yaw with different pitch is obviously shown. The transition from a pose to another pose tends to be continuous and easy to locate. The learned manifold is smoother and more discriminative than PCA.
Isomap  is an abbreviation of isometric feature mapping , which is an extension of the classical algorithm of multidimensional scaling (MDS) . From the previous section, one can learn that the LLE represents the nonlinearity of the original data by preserving the local geometrical linearity. In contrast, the algorithm of Isomap proposed a global solution by constructing a graph for all pairwise data. This idea ensures the global optimum.
Specifically, Isomap firstly constructs a graph that can be represented as
From the distance matrix , an objective function is defined as
where and denote the distances conversion to inner products for the original data and the low-dimensional data, respectively. The operator
Figure 8 shows the results of the low-dimensional head poses obtained from Isomap. An interesting shape of “bowl” of the embedding surface obtained for the head pose images.
3.4. Laplacian eigenmaps (LE)
Compared to Isomap, the idea of graph representation of the data is also taken by the algorithm of LE. However, the difference is the later attempts to construct a weighted graph (other than distance graph) for the data, which is then represented as a Laplacian .
The first step of LE is to construct an adjacent graph whose vertices are the data points and edges are the adjacent connections for neighbors. A pair of points and are -neighbors and will be connected with edge if ||
The third step is to minimize an objective function
where the diagonal matrix of is computed by column sums of
As shown in Figure 9, the embedding surface with the shape of parabolais generated by LE, which is similar to the results obtained by LLE. But the latter produces smoother and more symmetric shape of the surface. The variation in yaw from left to right is shown symmetrically, and the frontal face approximately locates on the vertex of bottom.
3.5. Laplacian preserving projections (LPP)
The previously introduced algorithms do not clarify how an unseen data is projected to the low-dimensional space. To solve this problem, LPP reformulates the LE by representing the dimensionality reduction as a linear projection from the original to the low-dimensional data. The first two steps of LPP are exactly the same as LE, which construct the adjacent graph and compute the weights for each connection. The most significant difference is the LPP representing the dimensionality reduction from the original to the low-dimensional space as a projection
More improved nonlinear manifold learning algorithms are developed [13, 19], but in this section, the main idea of how to derive the low-dimensional representation of the head poses is the core. Details of the advanced versions of the manifold learning algorithms can be explored in the original references.
4. Head pose estimation via manifold learning
The manifold learning methods can successfully model the head pose variations in both yaw and pitch as discussed in the previous sections. However, there are still several difficulties to state. The introduction of noise, for example, identity, and illumination variations will affect the performance of those methods on the head pose estimation. Another point is that they do not infer how the low-dimensional representation of an unseen head pose image is obtained (except LPP) and how the pose is estimated. In this section, more sophisticated methods are introduced to solve these problems based on the original or extended manifold learning algorithms.
4.1. PCA-based head pose estimation
In Ref. , the PCA has been turned to be robust to invariance of identity. Another important conclusion is that the angle of 10° is found to be the lower bound to be discriminative. For the data set constructed following this finding, the PCA would produce promising results for head pose estimation.
A kernel machine-based method is proposed using the kernel PCA (KPCA) and kernel support vector classifier (KSVC) . The KPCA is an extension of the classical PCA. Let be a kernel that maps the original dimensional into a higher dimensional, which makes the nonlinearly separable data linearly separable in the higher dimensional space. Correspondingly, the covariance matrix
4.2. View representation by Isomap
The derivation of how the Isomap reduces the dimensionality of the original data to a low-dimension has been introduced in the previous section. Now the problem is how to connect the head pose to the features. A pose parameter map
where denotes the angles of the head poses from the training data and
Given a testing image , the goal now is to obtain the low-dimensional feature according to the embedding
Finally, the estimated pose of the testing image is computed from
The insight of this method focuses on the conversion from testing data to the subspace learned by nonlinear manifold learning. The algorithms of LLE and LE can also be generalized by the proposed idea.
4.3. The biased manifold embedding (BME)
The head pose estimation is subjected to the identity variation. The ideal case is to eliminate such negative effects, which means the face images with close pose angles should maintain nearer and the ones with quite different poses should stay farther in the low-dimensional manifold, even the poses are from the same identity. Based on this statement, the BME is proposed to modify the distance matrix according to the pose angles, which can be extended with almost all the classical algorithms .
The modified distance between a pair of data points and is given by:
In fact, the BME can be seen as a naÏve version of the supervised manifold learning. The head pose information is used as the supervision to enhance the construction of the graph. For the head pose estimation stage, the generalized regression neural network (GRNN)  is applied to learn the nonlinear mapping for the unseen data points, and linear multivariate regression is applied to estimate the head pose angle. This idea can be easily extended to the classical algorithms, e.g., Isomap, LLE, and LE, among which the biased LE achieves the lowest error rate on the data set of FacePix .
4.4. Head pose estimation as frontal view search
The two remarkable head poses, i.e., yaw and pitch, cause the problem self-occlusion. Compared with pitch, the yaw makes the problem more serious. An extended manifold learning (EML) method is proposed to specify the head pose estimation only considering the variation in the yaw . This work resorts to the frontal view search instead of directly estimating the head pose, which is more efficient and robust. The idea is based on the observation that the frontal face locates nearly at the vertex in the symmetrical shape of the embedding. However, if the pose distribution of the data is asymmetric, the location of the frontal face in the manifold will shift from the vertex. Therefore, the first trial of the EML method is data enhancement. All the images are horizontally flipped and both the original and flipped images are used for manifold learning. In order to make the method more robust to variations in environment, for example, illumination, the localized edge orientation histogram (LEOH) is presented to represent the original color mappings as more representative features. The idea is inspired by the classical HoG feature . The first step of LEOH is to apply a Canny edge detector on the original images. Then, the whole image is divided into
4.5. Head pose estimation by supervised manifold learning
A taxonomy of methods, which structures the general framework of manifold learning into several stages, is proposed to incorporate the head pose angles in one or some of the stages to enable the supervised manifold learning . A straightforward solution could be the adaption of the distance and weight matrix according to the angle difference between pairwise face images. The head pose estimation problem is then interrupted as a regression problem, which was usually solved as a classification problem. As a result, continuous head poses can be generalized by the model.
The general framework of manifold learning can be represented as follows: Stage 1, neighborhood searching; Stage 2, graph weighting; Stage 3, low-dimensional manifold computation; and Stage 4, projection from unseen data to the manifold and pose estimation.
In Stage 1, the distance matrix of
In Stage 2, the weight matrix of
In Stage 3, let us take the LLE for an instance. The original objective function of LLE shown in Eq. (6) can be adapted as follows:
The adaption of the objective function can preserve the local linearity of the original data and enhance the similarity for neighborhoods, which are facilitated with similar poses. This is implemented by the second term of Eq. (19) that introduces the supervision information. Following the derivation from Eq. (6) to Eq. (7), Eq. (19) can be simplified as:
where is the Laplacian matrix of
In Stage 4, the GRNN algorithm is applied to produce the mapping from unseen data to the low-dimensional embedding. During testing time, the support vector regression (SVR) with RBF kernel and smoothing cubic splines are taken.
A novel method of supervised manifold learning for head pose estimation [27, 28] is proposed based on the framework from the former method. Similarly, angles of poses are incorporated in all three stages of the general manifold learning structure.
In Stage 1, an improved version of
where . θi is the angle of the reference face image
In Stage 3, a supervised neighborhood-based fisher discriminant analysis (SNFDA) is proposed. The basic idea is to make the neighboring data points as close as possible and the nonneighboring data points as far as possible in the low-dimensional embedding. The SNFDA can be seen as a postprocessing procedure in this stage. Based on the low-dimensional represented data
Details about the inference of the scatter matrices can be found in the original reference. The transformed matrix
In Stage 4, during testing time, the GRNN is applied to map the unseen data point to the low-dimensional embedding and the relevance vector machine (RVM)  is adopted to accomplish the pose estimation. Experimental results obtained by the proposed method performing on the database of FacePix  and MIT-CBCL  show big improvements compared with other state-of-the-art algorithms [23, 26] in Stage 3 and Stages 1 + 2 + 3. This means that this method is more robust for identity and illumination variations.
In this chapter, the head pose estimation, one of the most challenging tasks in the area of computer vision, is introduced, and the main types of methods are demonstrated and compared. Particularly, the manifold learning-based methods are attracted more attention. In reality, data distribution is usually nonlinear in high-dimensional represented space, e.g., the head pose images. Some potential structures are lying on nonlinear but smooth manifolds which are embedded in the original space. The manifold learning algorithms are able to discover and visualize such embedding. Almost all the algorithms are formalized based on the assumption of the local linearity of the nonlinear data. Those algorithms highly benefit the application of head pose estimation, because the face orientations (yaw and pitch) are found to be distributed along some specific manifolds. Promising performance is achieved by the classical manifold learning methods, which, however, are highly improved by the supervised manifold learning. It proves that the supervised information represented as angles of head poses is helpful in head pose estimation. However, there are still hurdles to take. Most of the methods are tested in different settings, e.g., different database is used in different method. A common framework could help to offer fair justifications. Other feature instead of the simple color space can be considered to better represent the face images. Some useful tools are available online to help better understand the work [16, 32, 33].