Towards Unconstrained Face Recognition Using 3D Face Model

Over the last couple of decades, many commercial systems are available to identify human faces. However, face recognition is still an outstanding challenge against different kinds of real world variations especially facial poses, non-uniform lightings and facial expressions. Meanwhile the face recognition technology has extended its role from biometrics and security applications to human robot interaction (HRI). Person identity is one of the key tasks while interacting with intelligent machines/robots, exploiting the non intrusive system security and authentication of the human interacting with the system. This capability further helps machines to learn person dependent traits and interaction behavior to utilize this knowledge for tasks manipulation. In such scenarios acquired face images contain large variations which demands an unconstrained face recognition system.


Introduction
Over the last couple of decades, many commercial systems are available to identify human faces. However, face recognition is still an outstanding challenge against different kinds of real world variations especially facial poses, non-uniform lightings and facial expressions. Meanwhile the face recognition technology has extended its role from biometrics and security applications to human robot interaction (HRI). Person identity is one of the key tasks while interacting with intelligent machines/robots, exploiting the non intrusive system security and authentication of the human interacting with the system. This capability further helps machines to learn person dependent traits and interaction behavior to utilize this knowledge for tasks manipulation. In such scenarios acquired face images contain large variations which demands an unconstrained face recognition system.  www.intechopen.com By nature, face recognition systems are widely used due to high universality, collectability and acceptability. However this attribute has natural challenges like uniqueness, performance and circumvention. Although several commercial biometrics systems use face recognition as a key tool but most of them perform in constrained environment. Real world challenges to face recognition problems involve varying lighting, poses, facial expressions, aging effects, occlusions which also include make up, facial scars or cuts, facial hairs, low resolution images and most recently spoofing. Due to their non intrusiveness faces are the most important biometrics to be employed in real life systems. Figure 1 shows the contribution of the face recognition system in biometrics market. Over the last decade, faces have been consistently used after the usage of fingerprints and AFIS (automated fingerprints identification system)/live scans. Face recognition finds its major applications in document classification, security access control, surveillance, web application and human robot interaction. However fingerprints can easily be spoofed (research is still in progress to deal with this issue using live scans), intrusive, aging effects are more in the sense of damages and finally template security and updation issues. Fig. 2. Detailed process for model based face recognition system. Texture map is generated by using homogeneous 2D point p in texture coordinates and a projection q of a general 3D point in homogeneous coordinates by using transformation H × p = q. Where H is the homography matrix Riaz et al. (2010).
In this chapter we focus on an unconstrained face recognition system and also study other soft biometric traits of human faces which can be useful for man machine interaction applications. The proposed system can be applied in major biometrics application equally well. The goal of this chapter is two folded. Firstly, it serves as self-contained and compact tutorial for face recognition systems from design to its development. It describes a brief history of face recognition systems, major approaches used and challenges. Secondly, it describes in detail an approach towards the development of a robust face recognition system against varying poses, facial expressions and lighting. This approach uses a 3D wireframe model which is useful for such applications. In order to proceed, a comprehensive overview of the face models currently used in the area of computer vision applications is provided. This covers an overview about deformable models, point distribution models, photorealistic models and finally wireframe models. In order to provide an automatic generation of a realistic human face model, a 3D wireframe model called Candide-III Ahlberg (2001) is used. This model has benefit over the others models in the sense of its speed, textures realization and well-defined of facial animation units. On the other hand realizing this model is more challenging since it is less detailed as compared to 3D morphable model which consists of a dense point distribution of facial points acquired from laser scans. Candide-III consists of a coarse mesh containing only 184 triangulations and defines facial action coding system (FACS) Ekman & Friesen (1978), which can easily be used for facial animations. Figure 2 show the overview of different modules working in our designed system.

System overview
The system mainly comprise of different modules contributing towards the final feature extraction. It starts with a face detection module followed by a face model fitting. Candide-III is fitted to face images using robust objective functions Wimmer et al. (2006). This fitting algorithm is compared with other algorithms and has been adopted here due to their efficiency in real time and robustness. The model is fitted to any face image with arbitrary pose, however pre-requisite for this module is face detection. If face detection module fails to find any face inside the image then system searches again for a face in next image. After model fitting, extracted structural information from this image contains shape and pose information. The fitted model is used for extracting the texture from given image. We use graphic tools to render texture to model surface and warp the texture to standard template. This standard template is a block based texture map shown in Figure 2. Texture variations are majorly caused by illuminations and shadows, which are dealt with textural feature and image filtering. We use principal component analysis (PCA) to parameterize extracted textures. Cognitive science explains that temporal information is an essential constituent for human perception and learning Sinha et al. (2006). In this regard, we choose local descriptors on the face image and track their motion to find temporal features. This motion provides local variations and deformations in the face image caused by facial expressions. Finally we construct a feature vector set by concatenating these three different features for any given image.
Where b s , b g and b t are the structural, textural and temporal features. For texture extraction we comparatively perform different methods including discrete cosine transform (DCT), principal components analysis (PCA) and local binary patterns (LBP). From these three types of textural feature, it is observed that PCA outperforms other two feature set and hence used for further experimentation. The results are given in detail in section 10.

Related work
Face recognition has always been a focus of attention for the researchers and has been addressed in different ways by using holistic and heuristic features, dense and sparse feature representation, parametric and non-parametric models and content-and context-aware methods Zhao et al. (2003)Zhao & Chellappa (2005. Due to the universality, collectability non-intrusiveness, face recognition is currently used as a baseline algorithm in several biometric systems either as a standalone technology or together with other biometrics, called multibiometric systems. In 3D space, faces are complex objects consisting of a regularized 79 Towards Unconstrained Face Recognition Using 3D Face Model www.intechopen.com structure and pigmentation and mostly observed performing various action and conveying a set meaningful information O'Toole (2009). Besides these meanigful set of information, faces convey several challenges which are under consideration by the research community. Human facial recognition system is unconstrained and provides stability against varying poses, facial expressions, changing illuminations, partial occlusions (including facial hair, scars and make ups) and temporal effects (aging factors). Traditional recognition systems have the abilities to recognize the human using various techniques like feature based recognition, face geometry based recognition, classifier design and model based methods. In Zhao et al. (2003) the authors give a comprehensive survey of face recognition and some commercially available face recognition software. Subspace projection method like PCA was firstly used by Sirvovich and Kirby Sirovich & Kirby (1987) , which were latterly adopted by M. Turk and A. Pentland introducing the famous idea of eigenfaces Turk & Pentland (1991). This chapter focuses on the modeling of human face using a three dimensional model for shape model fitting, texture and temporal information extraction and then low dimensional parameters for recognition purposes. The model using shape and texture parameters is called Active Appearance Model (AAMs), introduced by Cootes et. al. Cootes et al. (1998. For face recognition using AAM, Edwards et al  use weighted distance classifier called Mahalanobis distance. In Edwards et al. (1996) the authors used separate information for shape and gray level texture. They isolate the sources of variation by maximizing the interclass variations using discriminant analysis, similar to Linear Discriminant Analysis (LDA), the technique which was used for Fisherfaces representation Belheumeur et al. (1997). Fisherface approach is similar to the eigenface approach however outperforms in the presence of illuminations. In Wimmer et al. (2009) the authors have utilized shape and temporal features collectively to form a feature vector for facial expressions recognition. These models utilize the shape information based on a point distribution of various landmarks points marked on the face image. Blanz et al. Blanz & Vetter (2003) use state-of-the-art morphable model from laser scaner data for face recognition by synthesizing 3D face. This model is not as efficient as AAM but more realistic. In our approach a wireframe model known as Candide-III Ahlberg (2001)

Spatio-temporal Multifeatures (STMF)
We address the problem in which a 3D model can extract a common feature set automatically from face images and performs unconstrained face recognition. This system can not only be used for biometric applications but also for soft biometeric traits document classification, access control and medical care. In such applications an automatic and efficient feature extraction technique is necessary to interpret maximum available information from the faces image sequences. Further, the extracted feature set should be robust enough to be directly applied in real world applications. In such scenarios faces are seen from different views under varying facial deformations and poses. This issue is treated by using 3D modeling of the faces. The invariance to facial poses and expressions is discussed in detail in section 10. For face recognition textural information plays a key role as features Zhao & Chellappa (2005)Li & Jain (2005) whereas facial expressions are mostly person independent and require motion and structural components Fasel & Luettin (2003). Similary facial structure and texture vary significantly between gender classes. On the basis of this knowledge and literature survey we categorize three major features as primary and secondary contributor to three different facial classifications. Table 1 summarizes the significance of these constituents of the feature vector with their primary and secondary contribution towards the feature set formation. Since our feature set consists of all three kinds of information hence it can successfully represent facial indentity, expressions and gender. The results are discussed in detail in section 10. Model parameters are obtained in an optimal way to maximize information within the face region in the presence of different facial pose and expressions. We use a 3D wireframe model however, any other comparable model can be used here.

Model fitting and structural features
Our proposed algorithm is initialized by applying a face locator in the given image. We use Viola and Jones face detector Viola & Jones (2004). If a face is found then the system proceeds towards face model fitting. For model fitting, local objective functions are calculated using haar-like features. An objective function is a cost function which is given by the equation 2. A fitting algorithm searches for the optimal parameters which minimizes the value of the objective function. For a given image I,ifE(I, c i (p)) represents the magnitude of the edge at point c i (p), where p represents set of parameters describing the model, then objective function is given by: Where n = 1, . . . , 113 is the number of vertices c i describing the face model. This approach is less prone to errors because of better quality of annotated images which are provided to the system for training. Further, this approach is less laborious because the objective function design is replaced with automated learning. For details we refer to Wimmer et al. (2008). The geometry of the model is controlled by a set of action units and animation units. Any shape s can be written as a sum of mean shape s and a set of action units and shape units.
Where φ a is the matrix of action unit vectors and φ s is the matrix of shape vectors. Whereas α denotes action units parameters and σ denotes shape parameters Li & Jain (2005). Model deformation governs under facial action coding systems (FACS) principles Ekman & Friesen (1978). The scaling, rotation and translation of the model is described by Where R and t are rotation and translation matrices respectively, m is the scaling factor and π contains six pose parameters plus a scaling factor. By changing the model parameters, it is possible to generate some global rotations and translations. We extract 85 parameters to control the structural deformation.

Textural representation and parameterization
For texture extraction from the face images after model fitting, two different approaches are studied in this chapter. In section 6.1 texture extraction is performed using conventional AAM method, whereas in section 6.2 texture mapping approach is studied. Texture map is formed by storing each triangular patch to a block in memory. This block represents surface texture extracted from 3D surface of the face. Once texture is extracted, it is parametrized by using mean texture g m and matrix of eigenvectors P g to obtain the parameter vector b g Li & Jain (2005).

Image warping
Once structural information of the image is obtained from model fitting, we extract texture from the face region by mapping it to a reference shape. A reference shape is extracted by finding the mean shape over the dataset. Image texture is extracted using planar subdivisions of the reference and the example shapes. We use delauny triangulations for the convex hull of the facial landmarks. Texture warping between the triangles is performed using affine transformation. This texture warping is used for CKFE database Kanade et al. (2000) and MMI database Maat et al. (2009). By warping texture to a reference shape, facial expressions are neutralized and hence useful for face recognition.

Optimal texture representation
Each triangular patch represents meaningful texture which is stored in a square block of the texture map. A single unit of the texture map represents a trianglular patch. We experiment with three different sizes of the texture blocks and choose an optimal size for our experimentation. These three block sizes include 2 3 × 2 3 ,2 4 × 2 4 and 2 5 × 2 5 .W e calculate energy function from these texture maps of individual persons and observe the energy spectrum of the images in our database for each triangular patch. If N is the total number of images, and p i be a texel value (which is equal to a single pixel value) in texture map, then we define energy function as: Fig. 3. Energy spectrum of two randomly selected subjects from PIE database. Energy values for each patch is comparatively calculated and observed for three different texture sizes.
Where p j is the mean value of the pixels in j th block, j = 1...M and M = 184 is the number of blocks in a texture map. In addition to Equation 6, we find variance energy by using PCA for each block and observe the energy spectrum. The variation within the given block has similar behavior for two kinds of energy functions except a slight variation in the energy values. Figure 3 shows the energy values for two different subjects randomly chosen from our experiments. It can be seen from Figure 3 that behavior of the textural components is similar between different texture sizes. The size of the raw feature vector extracted directly from texture map increases exponentially with the increase of texture block size. If d × d is the size of the block, then the length of the raw feature vector is . This vector length calculation depends upon how texture is stored in the texture map. This can be seen in Figure 4. We store each triangular patch from the face surface to upper triangle of the texture block. The size of raw feature vector extracted for d = 2 3 , d = 2 4 and d = 2 5 is 6624, 25024 and 97152 respectively. Any higher value will exponentially increase the raw vector without any improvement in the texture energy. We do not consider higher values due to increase in vector length. The overall recognition rate produced by different texture sizes from eight randomly selected subjects with 2145 images from PIE database is shown in Figure 5. The results are obtained using decision trees and Bayesian networks for classification. The classification Fig. 4. Texture from each triangular patch is stored as upper triangle of the texture block in texture map. A raw feature vector is obtained by concatenating the pixel values from each block. Fig. 5. Comparison over eight random subjects from the database with three different sizes of texture blocks. Recognition rate slightly improved as texture size is increased however causes a high increase on the length of raw feature vector. We compromise on texture block of size 16 × 16. procedure is given in detail in next section 10. By trading off between the performance and size of the feature vectors, we choose texture block size to 16 × 16 during our experiments.

Affine vs. perspective transformation
We consider perspective transformation because affine warping of the rendered triangle is not invariant to 3D rigid transformations Riaz et al. (2010). In general, texture warping is performed using affine transformation from a given image to a reference shape Cootes et al. (1998) Riaz, Mayer, Wimmer, Beetz & Radig (2009) Riaz, Mayer, Beetz & Radig (2009b. This preserves affinity after the transformation. However, for faces with different views triangular patches on the edges are not well defined and these triangles are tilted such that they contain very less information about the texture as compared to those triangles which are frontal. In order to equally weight all triangles and we use a homogeneous transformation. Fig. 6. Detailed texture extraction approach. Each triangular patch from the face surface is stored as a block in a texture map. This texture map is further used for feature extraction.
The homogeneous transformation M is the given by Where K is the camera matrix. An undistorted texture map from the face region is calculated in two steps. Firstly, we find the homography H, by obtaining the rotation and translation of the triangle, by supposing that the initial triangle lies on the texture plane, the first vertex lies on the origin and the first edge lies on the x-axis. Secondly, affine transformation A is calculated, so that the mapped triangle on the texture plane fits the upper triangle of the rectangular texture block. The lower triangular area is not considered in this regard. Where R and t are the unknown to be calculated. In order to fit any arbitrary triangle to this upper triangle, we use an affine transformation A. This process is shown in Figure 6. For detail, refer to Riaz et al. (2010).

Temporal features
Further, temporal features of the facial changes are also calculated that take movement over time into consideration. Local motion of feature points is observed using optical flow. We do not specify the location of these feature points manually but distribute equally in the whole face region. The number of feature points is chosen in a way that the system is still capable of performing in real time and therefore inherits a tradeoff between accuracy and runtime performance. Since the motion of the feature points are relative so we choose 140 points in total to observe the optical flow. We again use PCA over the motion vectors to reduce the descriptors. If t is the velocity vector, Where temporal parameters b t are computed using matrix of eigenvectors P t and mean velocity vectors t m .

Feature fusion
We combine all extracted features into a single feature vector. Single image information is considered by the structural and textural features whereas image sequence information is considered by the temporal features. The overall feature vector becomes: Where b s , b g and b t are shape, textural and temporal parameters respectively with m, n and p being the number of parameters retained from subspace in each case. Equation 9 is called multi-feature. We extract 85 structural features, 74 textural features and 12 temporal features textural parameters to form a combined feature vector for each image. These features are then used for decision tree (DT) and bayesian network (BN) for different classifications. The face feature vector consists of the shape, texture and temporal variations, which sufficiently defines global and local variations of the face. All the subjects in the database are labeled for classification. Since features arise from different sources, it is not quite obvious to fuse them together to get a feature set. This can cause the dominance of the features with higher values and ones with low values are ignored. We use simple scaling of the features in [0, 1]. However, any suitable method for feature fusion can be applied here.

Comparative texture descriptors
From above texture representations, features are extracted using three different approaches, a) PCA, b) discrete cosine transform (DCT) and c) local binary pattern (LBP). Each texture map consists of 184 texture blocks where each texture block corresponds to texture in a triangular surface. The size of each block is 16 × 16 pixels. This size is chosen by trading off between accuracy and efficiency. DCT coefficients are extracted in a zig-zag pattern from top-left corner of each block. We extract five coefficients per block and obtain a feature set of length 5 × 184 = 920. The advantage of using DCT over conventional approach is two fold, 1) it reduces the dimensions to a great extent, 2) DCT coefficients contain low frequency information which are robust to distortions and noise. For LBP descriptor, we consider those pixels for coding which lie inside face area. An LBP histogram of 255 gray levels and color histogram are calculated and used as texture features. The results of three different features types on all subjects of PIE database session from November 2000 to December 2000 Terence et al. (2002) is shown in Table 2. A J48 decision tree from Weka Witten & Frank (2005) is used as classifier. The detail about classifier specification is given in section 10. It can be seen that PCA outperforms the other two features types. For further experimentation, we use PCA for feature extraction.

Local energy based descriptors
From section 6.2, we have texture extracted from the face images and stored in a texture map in triangular patches of same sizes. Each patch represents a specific area of the face. Instead of using conventional subspace learning which do not preserve localization information, we find heuristic features representative of each patch. If training data consists of M images, then we can calculate the variance energy of each patch using: Where i = 1,...,N and N = 136, which is the number of pixel in triangle from texture map, p j is the average value of jth patch, with j = 1, 2, . . . , 184. Finally a feature vector E is formed by finding energy descriptor for each triangular patch and is written as: There are three major benefits of using patch based representation.
• The extracted feature vector although is compact and sufficient to perform well in view invariant face recognition.
• It avoids subspace learning and new faces can be added easily in the database.
• Since such representation presevers localization of the facial features, it outperforms the conventional AAM and holistic approaches.

Experimental evaluation
In order to validate the extracted feature, we have used different subjects from three different databases called, CMU-PIE database Terence et al. (2002), MMI database Maat et al. (2009) and Cohn-Kanade facial expressions database (CKFED) Kanade et al. (2000). These databases consist of face images with different variations, like varying poses, facial expressions, gender information and talking faces. MMI and CKFED contain image sequences with temporal information. CKFED consists of 97 subjects range in age from 18 to 30 years. Sixty-five percent are female, 15 percent are African-American and three percent Asian or Latino. The MMI facial expression database holds over 2000 videos and over 500 images of about 50 subjects displaying various facial expressions on command. In case of CKFE and MMI databases we compute spatio-temporal feature from the image sequences. CMU-PIE database is collected between October and December 2000 consisting of 41,368 images of 68 people. Each person is captured with 13 different poses, 43 different illumination conditions, and with 4 different expressions. We take spatial feature and test them against frontal and half-profile poses. The texture extracted in this case is stored as a texture map after removing perspective distortion.
During all experiments, we use two-third of the feature set for building the classifier model with 10-fold cross validation to avoid overfitting. The remaining feature set is used for testing purpose. We use all subjects from MMI and CKFED and partially use PIE database to perform face recognition and person dependent facial expressions and gender classification. Since STMF set arises from different sources, so decision tree (DT) is applied for classification. However, other classifiers can also be applied here depending upon the application. We choose J48 decision tree algorithm for experimentation which uses tree pruning called subtree raising and recursively classifies until the last leave is pure. We use same configuration for all classifiers trained during the experiments. The parameters used in decision tree are: confidence factor C = 0.25, with two minimum number of instances per leaf and C4.5 approach for reduced error-pruning Witten & Frank (2005). For further validation, we use random  Table 3. Comparison of three different approaches used for face recognition. The overall recognition rate shows the number of images correctly classified. PCA representation of the surface texture from a face outperforms conventional AAM approach, however energy based descriptors are not only compact but perform even better than two other approaches.
forests for classification with default Weka Witten & Frank (2005) parameters and 10-fold cross validation. The results coincide with those from decision trees. During all experimentation, we use same training and testing approach. For subspace learning, one-third of the database is used while the remaining part is projected to this space. We retain 97% of the eigenvalues during the subspace learning.

Expression invariant face recognition
MMI and CKFE databases contain six basic facial expressions in the form of image sequences. Although neutral expression is present as a seventh expression but we exclude it during experiments and solve the problem as six class problem. All images are frontal and hence we use spatio-temporal features with texture warped to reference shape. Most of the face recognition information is available in textural components and hence we obtain stability of our feature in face recognition results in the presence of facial expressions. Texture warping neutralizes the effect of facial expressions. The recognition results using decision tree and Bayesian networks are shown in Table 4. However, structural and temporal part of the same feature set contain sufficient facial expression information (refer to Table 1). In this way, a single STMF is representative of facial expression, face recognition and gender information.

Pose invariant face recognition
In section 6.2, we explained the detailed process for texture extraction. Since the model is defined over a coarse mesh of vertices, so it is useful to consider texture map as an image with undistorted texture patches. In the presence of different facial poses, triangles at the face edges are tilted such that texture information is extremely distorted. In order to solve this problem, we project each triangle on a block of 16 × 16 pixel size. The block size is chosen by trading off between efficiency and accuracy. In this procedure, each triangle is weighted Approach % Accuracy model based Mayer et al. (2009) 87.1% TAN Cohen et al. (2003) 83.3% LBP + SVM Shan et al. (2009) 92.6% IEBM Asthana et al. (2009) 92.9% Fixed Jacobian Asthana et al. (2009) 89.6% STMF + DT 93.2%  (2009) equally toward the feature calculation and face edges are not destroyed but rather provide the detailed texture information that might be lost during conventional image warping approach. The recognition results with different approaches are given in Table 3.

Facial expressions and gender classification
Facial expressions recognition is performed on CKFED with six universal facial expressions: anger, disgust, fear, laugh, sadness and suprise. Each video sequence starts from a neutral face and reaches up to the peak of the particular expression. We exclude neutral expression during the experiments because it is included in all image sequences and cause more confusion. However, a neutral expression can also be considered as a seventh expression during classification. It can be automatically segmented using velocity vectors magnitudes. STMF features with their three structural, textural and temporal constituents are used for experiments. Finally, the results are compared in Table 5 with the state-of-the-art approaches which uses comparable system along with the confusion matrix from our experiments. We further estimated age using FGNet database FG-NET AGING DATABASE (n.d.). This database contains 1002 images of 62 subjects with images of different ages ranging from 0 to 69 years. We divide the whole dataset in seven classes. Since the database consists of static images hence we experiment only with shape and textural component of the feature set. A classification rate of 49.70% is achieved with texture whereas the classification rate improved to 57.29% using support vector machine based classification. The mean absolute error (MAE) was 0.769.

Conclusions and future work
This chapter explained the STMFfor unconstrained face recognition. The spatial part of this feature set consists of structural (section 5) and textural (section 6) information. Two different types of texture extraction approaches are discussed in detail in section 6.1 and section 6.2. Further a comparative study of three different textural features has been studied which shows that PCA outperforms LBP and DCT. Since PCA is global representation of a face, hence it does not contain local information. In section 9.1 a local representation for each triangular patch is calculated. Since this local representation is added the extracted features, hence it further improves the results and outperforms holitic PCA. Since feature set given in equation 9 is consistent with Table 1 hence it can be used for facial expression recognition, gender classification and age estimation. The results are shown in section 10.1. This chapter provides a comprehensive overview and a compact description of 3D face modeling, face recognition, classifying soft-biometric traits including facial expressions, gender and age. However such systems require more memory and relatively slower as compared to conventional image based approaches. Future goal of this work is to enhance its efficiency to apply it in real time for interactive systems. Further more diverse conditions with large variabilities are required to be tested.