Abstract
For the last decades, manifold learning has shown its advantage of efficient non-linear dimensionality reduction in data analysis. Based on the assumption that informative and discriminative representation of the data lies on a low-dimensional smooth manifold which implicitly embedded in the original high-dimensional space, manifold learning aims to learn the low-dimensional representation following some geometrical protocols, such as preserving piecewise local structure of the original data. Manifold learning also plays an important role in the applications of computer vision, i.e., face image analysis. According to the observations that many face-related research is benefitted by the head pose estimation, and the continuous variation of head pose can be modelled and interpreted as a low-dimensional smooth manifold, we will focus on the head pose estimation via manifold learning in this chapter. Generally, head pose is hard to directly explore from the high-dimensional space interpreted as face images, which is, however, can be efficiently represented in low-dimensional manifold. Therefore, in this chapter, classical manifold learning algorithms are introduced and the corresponding application on head pose estimation are elaborated. Several extensions of manifold learning algorithms which are developed especially for head pose estimation are also discussed and compared.
Keywords
- manifold learning
- head pose estimation
- nonlinear feature reduction
- supervised manifold learning
- local linearity
- global geometry
1. Introduction
Manifold learning becomes well known due to its property to learn the representative geometry in low-dimensional embedding, with which data analysis and visualization are significantly benefitted. From the observation of some nonlinear data, a low-dimensional smooth manifold (differentiable manifold) is embedded in the original high-dimensional space, which is implicit if we only consider the metrics of the original space. Manifold learning algorithm purpose is to learn such embedding according to some protocols, e.g., local linearity and global structure preserving. For the remainder of this chapter, the term manifold is used to refer as the smooth manifolds (differentiable manifolds) for convenience. As a complex high-dimensional data, face image analysis is a difficult topic in the field of computer vision due to the complicated facial appearance variations, among which the head pose challenges many face-related applications. Accurate head pose estimation is advantageous to face alignment and recognition, because frontal- or near-frontal faces are easier to handle compared with other poses. It has been found that facial appearance is lying on a manifold embedded form in the original high-dimensional space represented as face images. Correspondingly, the head pose can also be represented as a low-dimensional embedding, which is more representative and discriminative to model the variation. Therefore, the head pose estimation can be implemented by the manifold learning.
In principle, head pose refers to the view of the face to the imaging system, i.e., the camera center. 3D head transformation involves 6 degrees of freedom (DOF), which can be interpreted as the 3D translations

Figure 1.
The 3 DOF of head pose proposed in reference [

Figure 2.
Examples of various head poses. The images are cropped, centered, and resized to 64×64 pixels from the originals. One individual is selected and shown in different yaw and pitch. From left to right represents the variation in yaw: −90, −60, −30, 0, 30, 60, and 90°. From top to bottom represents the variation in pitch: 30, 0, and -30°. One can find that the effects of self-occlusion occur with an increasing yaw and pitch. The frontal faces (center image) show a full overview of the face.
Basically, head pose estimation methods broadly fall into several categories. Template-based methods treat the head pose estimation as a verification (or classification) problem. The testing face is projected to the data set labeled with known poses, the one from which the most significant similarity measured by various metrics is retrieved for the testing pose [4, 5]. Furthermore, pose detectors can be learned to simultaneously localize the face and recognize the pose [6]. Regression-based methods estimate a linear or nonlinear function with the original faces or extracted facial features as input variables and discrete or continuous poses as output [2]. Deformable models learn flexible facial modes [7–9]. By manipulating a set of parameters which specifies the pose, specific face example can be generated, which will be used to match the testing face. With the development of manifold learning [10–13], more promising results of head pose estimation are achieved. The essence of such methods is based on the assumption that the discriminative modes for head pose lie on low-dimensional manifolds embedded in high-dimensional space, i.e., the original color space or other low level feature space [14]. The low-dimensional representation of the head pose images can be learned by unsupervised or supervised manifold learning.
In contrast, the template-based methods have the problem of serious dependence on training data. If similar poses to the query pose do not exist in the training set, the estimated result would be biased. The regression-based methods often require to use complicated regression models, for example, a high-order polynomial. However, complicated nonlinear function would cause the problem of overfitting, which will result in poor generalization of the model. The deformable models require the localization of dense facial features, such as landmarks of facial components, which are seriously influenced by the head pose. The manifold learning-based methods are somehow limited by some problems, such as identity and noise sensitivity; however, simple efforts can be made to efficiently improve the performance [15]. More importantly, the manifold learning-based methods show promising performance of generalization. And the head pose can be easily modeled and better visualized with low-dimensional features. More details will be given in following sections.
According to the previous analysis, the main focus of this chapter will be on the manifold learning based on head pose estimation. The main notations used in this chapter are listed and interrupted in Section 2. In Section 3, classical manifold learning algorithms will be elaborated. In Section 4, adaptions and extensions of manifold learning algorithms, which are more suitable for head pose estimation, are discussed. Section 4 summaries the work, and some available resources of manifold learning are given.
2. Notations
3. Characteristics of manifold learning algorithms
Given a set of data points, for example, face images, it is difficult to directly estimate or extract the most significant modes from such high-dimensional representation of the data. If the distribution of data in the original feature space can be linearly structured, the classical principal component analysis (PCA) will be able to estimate the discriminative modes and then reduce the feature dimensions. An example of such a type of data is shown in Figure 3. However, if the data distribution of the original data is nonlinear, for example, the famous “swiss roll” shown in Figure 4(a), which is a smooth, continuous but nonlinear surface embedded in the 3D space, the structure interpreted as Euclidean distance is less preferable to represent the distribution of the data. Taking the two circled points sampled from the manifold shown in Figure 4(b), for instance, their Euclidean distance is close, while this is not guaranteed if the 3D structure is considered. The embedded structure can be explored with the help of nonlinear dimensionality reduction, such as manifold learning algorithms. The learned low-dimensional representation can approximately model the real distance of the sampled data points as shown in Figure 4(c).

Figure 3.
A data set sampled from a multivariate Gaussian distribution. The most significant modes indicated by the red orthogonal axis can be learned by PCA, which preserve the largest variations in the original data.

Figure 4.
An example of the data set including a potential “swiss roll” structure. The figures are produced based on the code from [
In this section, in order to reveal the essence of manifold learning, the PCA is initially detailed. Other classical manifold learning algorithms will be elaborated in the following.
3.1. Principal component analysis (PCA)
PCA is one of the most popular unsupervised linear dimensionality reduction algorithms. The intrinsic feature of PCA is to estimate a linear space whose basis will be able to preserve the maximum variations in the original data. Mathematically, the low-dimensional data can be obtained by a linear transformation from the original data as denoted in Eq. (1).
where
which means that the original data can be linearly represented as a combination of the principal components.
Taking the head pose images of one identity shown in Figure 2, for example, the PCA is applied on the vectorized images. Figure 5 visualizes the low-dimensional representation of the face images in the first 3D dimensions. One can find obvious transitions for pitch and yaw along a 3D shape of valley. The three principal components are visualized in Figure 6, from which one face image is decomposed into a weighted accumulation of variations in the mean face. The first and third eigenfaces (principal component) clearly show the variation in yaw. Therefore, PCA can model the head poses as some of the discriminative principal components.

Figure 5.
Visualization of the low-dimensional features obtained by PCA. A surface of valley can be found. Blue dots show the face images sampled from the surface. Some face images are selected and shown. (a) The variation in pitch is shown with the yaw of −90° in a specific view of the surface. (b) The variation in yaw is shown with the pitch of 0° in another view of the surface.

Figure 6.
Representation of one face image by the mean face and first three eigenfaces obtained from PCA.
3.2. Locally linear embedding (LLE)
From the observation of the data shown in Figure 4, the smooth manifold is globally nonlinear but can be seen as linear from a local neighborhood. On the basis of this observation, the LLE attempts to represent each of the data by a weighted linear combination of a number of neighbors [11]. The weight matrix
where
where
Moreover, the weight matrix
Eq. (6) can be rewritten as
where
The same data set used in the last experiment is processed with LLE and shown with the first three dimensions in Figure 7. The variation of the head pose in yaw with different pitch is obviously shown. The transition from a pose to another pose tends to be continuous and easy to locate. The learned manifold is smoother and more discriminative than PCA.

Figure 7.
Visualization of the low-dimensional features obtained by LLE. A surface of “wings” is observed. (a) The variation in yaw with the pitch of -30° is found along the edge of one “wing” of the 3D surface. (b) Another variation in yaw with pitch of 0° is found along the ridge of the 3D surface.
3.3. Isomap
Isomap [10] is an abbreviation of isometric feature mapping [17], which is an extension of the classical algorithm of multidimensional scaling (MDS) [18]. From the previous section, one can learn that the LLE represents the nonlinearity of the original data by preserving the local geometrical linearity. In contrast, the algorithm of Isomap proposed a global solution by constructing a graph for all pairwise data. This idea ensures the global optimum.
Specifically, Isomap firstly constructs a graph that can be represented as
From the distance matrix
where
Figure 8 shows the results of the low-dimensional head poses obtained from Isomap. An interesting shape of “bowl” of the embedding surface obtained for the head pose images.

Figure 8.
Visualization of the low-dimensional features obtained by Isomap. (a) The variation in yaw with the pitch of -30° is found along the edge of the shape. (b) Another variation in yaw with pitch of 0° is found along the geodesic path in the middle of the shape. The interesting thing is the frontal face locates approximated at the center.
3.4. Laplacian eigenmaps (LE)
Compared to Isomap, the idea of graph representation of the data is also taken by the algorithm of LE. However, the difference is the later attempts to construct a weighted graph (other than distance graph) for the data, which is then represented as a Laplacian [12].
The first step of LE is to construct an adjacent graph whose vertices are the data points and edges are the adjacent connections for neighbors. A pair of points
The third step is to minimize an objective function
where the diagonal matrix of
where
As shown in Figure 9, the embedding surface with the shape of parabolais generated by LE, which is similar to the results obtained by LLE. But the latter produces smoother and more symmetric shape of the surface. The variation in yaw from left to right is shown symmetrically, and the frontal face approximately locates on the vertex of bottom.

Figure 9.
Visualization of the low-dimensional features obtained by LE.
3.5. Laplacian preserving projections (LPP)
The previously introduced algorithms do not clarify how an unseen data is projected to the low-dimensional space. To solve this problem, LPP reformulates the LE by representing the dimensionality reduction as a linear projection from the original to the low-dimensional data. The first two steps of LPP are exactly the same as LE, which construct the adjacent graph and compute the weights for each connection. The most significant difference is the LPP representing the dimensionality reduction from the original to the low-dimensional space as a projection
The bottom
More improved nonlinear manifold learning algorithms are developed [13, 19], but in this section, the main idea of how to derive the low-dimensional representation of the head poses is the core. Details of the advanced versions of the manifold learning algorithms can be explored in the original references.
4. Head pose estimation via manifold learning
The manifold learning methods can successfully model the head pose variations in both yaw and pitch as discussed in the previous sections. However, there are still several difficulties to state. The introduction of noise, for example, identity, and illumination variations will affect the performance of those methods on the head pose estimation. Another point is that they do not infer how the low-dimensional representation of an unseen head pose image is obtained (except LPP) and how the pose is estimated. In this section, more sophisticated methods are introduced to solve these problems based on the original or extended manifold learning algorithms.
4.1. PCA-based head pose estimation
In Ref. [20], the PCA has been turned to be robust to invariance of identity. Another important conclusion is that the angle of 10° is found to be the lower bound to be discriminative. For the data set constructed following this finding, the PCA would produce promising results for head pose estimation.
A kernel machine-based method is proposed using the kernel PCA (KPCA) and kernel support vector classifier (KSVC) [21]. The KPCA is an extension of the classical PCA. Let
4.2. View representation by Isomap
The derivation of how the Isomap reduces the dimensionality of the original data to a low-dimension has been introduced in the previous section. Now the problem is how to connect the head pose to the features. A pose parameter map
where
where
Given a testing image
Finally, the estimated pose of the testing image is computed from
The insight of this method focuses on the conversion from testing data to the subspace learned by nonlinear manifold learning. The algorithms of LLE and LE can also be generalized by the proposed idea.
4.3. The biased manifold embedding (BME)
The head pose estimation is subjected to the identity variation. The ideal case is to eliminate such negative effects, which means the face images with close pose angles should maintain nearer and the ones with quite different poses should stay farther in the low-dimensional manifold, even the poses are from the same identity. Based on this statement, the BME is proposed to modify the distance matrix according to the pose angles, which can be extended with almost all the classical algorithms [23].
The modified distance between a pair of data points
where
In fact, the BME can be seen as a naÏve version of the supervised manifold learning. The head pose information is used as the supervision to enhance the construction of the graph. For the head pose estimation stage, the generalized regression neural network (GRNN) [24] is applied to learn the nonlinear mapping for the unseen data points, and linear multivariate regression is applied to estimate the head pose angle. This idea can be easily extended to the classical algorithms, e.g., Isomap, LLE, and LE, among which the biased LE achieves the lowest error rate on the data set of FacePix [25].
4.4. Head pose estimation as frontal view search
The two remarkable head poses, i.e., yaw and pitch, cause the problem self-occlusion. Compared with pitch, the yaw makes the problem more serious. An extended manifold learning (EML) method is proposed to specify the head pose estimation only considering the variation in the yaw [15]. This work resorts to the frontal view search instead of directly estimating the head pose, which is more efficient and robust. The idea is based on the observation that the frontal face locates nearly at the vertex in the symmetrical shape of the embedding. However, if the pose distribution of the data is asymmetric, the location of the frontal face in the manifold will shift from the vertex. Therefore, the first trial of the EML method is data enhancement. All the images are horizontally flipped and both the original and flipped images are used for manifold learning. In order to make the method more robust to variations in environment, for example, illumination, the localized edge orientation histogram (LEOH) is presented to represent the original color mappings as more representative features. The idea is inspired by the classical HoG feature [14]. The first step of LEOH is to apply a Canny edge detector on the original images. Then, the whole image is divided into
4.5. Head pose estimation by supervised manifold learning
A taxonomy of methods, which structures the general framework of manifold learning into several stages, is proposed to incorporate the head pose angles in one or some of the stages to enable the supervised manifold learning [26]. A straightforward solution could be the adaption of the distance and weight matrix according to the angle difference between pairwise face images. The head pose estimation problem is then interrupted as a regression problem, which was usually solved as a classification problem. As a result, continuous head poses can be generalized by the model.
The general framework of manifold learning can be represented as follows: Stage 1, neighborhood searching; Stage 2, graph weighting; Stage 3, low-dimensional manifold computation; and Stage 4, projection from unseen data to the manifold and pose estimation.
In Stage 1, the distance matrix of
where
In Stage 2, the weight matrix of
where
In Stage 3, let us take the LLE for an instance. The original objective function of LLE shown in Eq. (6) can be adapted as follows:
where the
The adaption of the objective function can preserve the local linearity of the original data and enhance the similarity for neighborhoods, which are facilitated with similar poses. This is implemented by the second term of Eq. (19) that introduces the supervision information. Following the derivation from Eq. (6) to Eq. (7), Eq. (19) can be simplified as:
where
In Stage 4, the GRNN algorithm is applied to produce the mapping from unseen data to the low-dimensional embedding. During testing time, the support vector regression (SVR) with RBF kernel and smoothing cubic splines are taken.
A novel method of supervised manifold learning for head pose estimation [27, 28] is proposed based on the framework from the former method. Similarly, angles of poses are incorporated in all three stages of the general manifold learning structure.
In Stage 1, an improved version of
where
In Stage 2, taking LLE (NPE [29]), for an example, the local distance matrix shown in Eq. (4) is modified as
where
In Stage 3, a supervised neighborhood-based fisher discriminant analysis (SNFDA) is proposed. The basic idea is to make the neighboring data points as close as possible and the nonneighboring data points as far as possible in the low-dimensional embedding. The SNFDA can be seen as a postprocessing procedure in this stage. Based on the low-dimensional represented data
where
Details about the inference of the scatter matrices can be found in the original reference. The transformed matrix
The top
In Stage 4, during testing time, the GRNN is applied to map the unseen data point to the low-dimensional embedding and the relevance vector machine (RVM) [30] is adopted to accomplish the pose estimation. Experimental results obtained by the proposed method performing on the database of FacePix [25] and MIT-CBCL [31] show big improvements compared with other state-of-the-art algorithms [23, 26] in Stage 3 and Stages 1 + 2 + 3. This means that this method is more robust for identity and illumination variations.
5. Summary
In this chapter, the head pose estimation, one of the most challenging tasks in the area of computer vision, is introduced, and the main types of methods are demonstrated and compared. Particularly, the manifold learning-based methods are attracted more attention. In reality, data distribution is usually nonlinear in high-dimensional represented space, e.g., the head pose images. Some potential structures are lying on nonlinear but smooth manifolds which are embedded in the original space. The manifold learning algorithms are able to discover and visualize such embedding. Almost all the algorithms are formalized based on the assumption of the local linearity of the nonlinear data. Those algorithms highly benefit the application of head pose estimation, because the face orientations (yaw and pitch) are found to be distributed along some specific manifolds. Promising performance is achieved by the classical manifold learning methods, which, however, are highly improved by the supervised manifold learning. It proves that the supervised information represented as angles of head poses is helpful in head pose estimation. However, there are still hurdles to take. Most of the methods are tested in different settings, e.g., different database is used in different method. A common framework could help to offer fair justifications. Other feature instead of the simple color space can be considered to better represent the face images. Some useful tools are available online to help better understand the work [16, 32, 33].
References
- 1.
Murphy-Chutorian, Erik and Trivedi, Mohan Manubhai. Head pose estimation in computer vision: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence. 2009; 31 (4):607–626. - 2.
Gourier, Nicolas, Hall, Daniela and Crowley, James L. Estimating face orientation from robust detection of salient facial structures. In: 2004 ICPR Workshop on Visual Observation of Deictic Gestures. Cambridge, UK: IEEE; 2004. - 3.
Taigman, Yaniv, Yang, Ming, Ranzato, Marc'Aurelio and Wolf, Lior. Deepface: Closing the gap to human-level performance in face verification. In: 2014 IEEE Conference on Computer Vision and Pattern Recognition. Columbus, OH, USA: IEEE; 2014. pp. 1701–1708. - 4.
Beymer, David James. Face recognition under varying pose. In: 1994 IEEE Conference on Computer Vision and Pattern Recognition. Seattle, WA, USA: IEEE; 1994. pp. 756–761. - 5.
Niyogi, Sourabh and Freeman, William T. Example-based head tracking. In: 2nd IEEE International Conference on Automatic Face and Gesture Recognition. Killington, VT, USA: IEEE; 1996. pp. 374–378. - 6.
Viola, Paul and Jones, Michael. Rapid object detection using a boosted cascade of simple features. In: 2001 IEEE Conference on Computer Vision and Pattern Recognition. Kauai, HI, USA: IEEE; 2001. pp. I-511–I-518. - 7.
Felzenszwalb, Pedro F and Huttenlocher, Daniel P. Pictorial structures for object recognition. International Journal of Computer Vision. 2005; 61 (1):55–79. - 8.
Cootes, Timothy F, Edwards, Gareth J and Taylor, Christopher J. Active appearance models. In: 1998 European Conference on Computer Vision. Freiburg, Germany: Springer; 1998. pp. 484–498. - 9.
Matthews, Iain and Baker, Simon. Active appearance models revisited. International Journal of Computer Vision. 2004; 60 (2):135–164. - 10.
Tenenbaum, Joshua B, De Silva, Vin and Langford, John C. A global geometric framework for nonlinear dimensionality reduction. Science. 2000; 290 (5500):2319–2323. - 11.
Roweis, Sam T and Saul, Lawrence K. Nonlinear dimensionality reduction by locally linear embedding. Science. 2000; 290 (5500):2323–2326. - 12.
Belkin, Mikhail and Niyogi, Partha. Laplacian eigenmaps for dimensionality reduction and data representation. Neural Computation. 2003; 15 (6):1373–1396. - 13.
Maaten, Laurens van der and Hinton, Geoffrey. Visualizing data using t-SNE. Journal of Machine Learning Research. 2008; 9 :2579–2605. - 14.
Dalal, Navneet and Triggs, Bill. Histograms of oriented gradients for human detection. In: IEEE Conference on Computer Vision and Pattern Recognition; San Diego, CA, USA: IEEE; 2005. pp. 886–893. - 15.
Wang, Chao and Song, Xubo. Robust frontal view search using extended manifold learning. Journal of Visual Communication and Image Representation. 2013; 24 (7):1147–1154. - 16.
Available from: http://isomap.stanford.edu/ - 17.
Balasubramanian, Mukund and Schwartz, Eric L. The isomap algorithm and topological stability. Science. 2002;295(5552):7?7. - 18.
Cox, Trevor F and Cox, Michael AA. Multidimensional Scaling. USA: CRC Press; 2000. - 19.
Weinberger, Kilian Q and Saul, Lawrence K. Unsupervised learning of image manifolds by semidefinite programming. International Journal of Computer Vision. 2006; 70 (1):77–90. - 20.
Sherrah, Jamie, Gong, Shaogang and Ong, Eng-Jon. Face distributions in similarity space under varying head pose. Image and Vision Computing. 2001; 19 (12):807–819. - 21.
Li, Stan Z, Fu, Qingdong, Gu, Lie, Scholkopf, B, Cheng, Yimin and Zhang, Hongjiag. Kernel machine based learning for multi-view face detection and pose estimation. In: 2001 IEEE International Conference on Computer Vision. Vancouver, Canada: IEEE; 2001. - 22.
Raytchev, Bisser, Yoda, Ikushi and Sakaue, Katsuhiko. Head pose estimation by nonlinear manifold learning. In: The 17th International Conference on Pattern Recognition; Cambridge, UK: IEEE; 2004. pp. 462–466. - 23.
Balasubramanian, Vineeth Nallure, Ye, Jieping and Panchanathan, Sethuraman. Biased manifold embedding: A framework for person-independent head pose estimation. In: 2007 IEEE Conference on Computer Vision and Pattern Recognition. Minneapolis, MN, USA: IEEE; 2007. pp. 1–7. - 24.
Specht, Donald F. A general regression neural network. IEEE Transactions on Neural Network. 1991; 2 (6):568–576. - 25.
Little, Danny, Krishna, Sreekar, John Jr, A and Panchanathan, Sethuraman. A methodology for evaluating robustness of face recognition algorithms with respect to variations in pose angle and illumination angle. In: ICASSP (2); Citeseer; 2005. pp. 89–92. - 26.
BenAbdelkader, Chiraz. Robust head pose estimation using supervised manifold learning. In: European Conference on Computer Vision; Crete, Greece: Springer; 2010. p. 518–531. - 27.
Wang, Chao and Song, Xubo. Robust head pose estimation via supervised manifold learning. Neural Networks. 2014; 53 :15–25. - 28.
Wang, Chao and Song, Xubo. Robust head pose estimation using supervised manifold projection. In: 2012 19th IEEE International Conference on Image Processing. Orlando, FL, USA: IEEE; 2012. pp. 161–164. - 29.
He, Xiaofei, Cai, Deng, Yan, Shuicheng and Zhang, Hong-Jiang. Neighbourhood preserving embedding. In: 2005 IEEE International Conference on Computer Vision. Beijing, China: IEEE; 2005. pp. 1208–1213. - 30.
Tipping, Michael E. Sparse Bayesian learning and the relevance vector machine. Journal of Machine Learning Research. 2001; 1 2:11–244. - 31.
Huang, Jennifer, Heisele, Bernd and Blanz, Volker. Component-based face recognition with 3D morphable models. In: International Conference on Audio-and Video-Based Biometric Person Authentication. Guildford, UK: Springer; 2003. pp. 27–34. - 32.
Available from: https://www.cs.nyu.edu/~roweis/lle/code.html - 33.
Available from: https://lvdmaaten.github.io/drtoolbox/