Biometric systems for human recognition are an ongoing demand. Among all biometric technologies which are employed so far, face recognition is one of the most widely outspread biometrics. Its daily use by nearly everyone as the primary mean for recognizing other humans and its naturalness have turned face recognition into a well-accepted method. Furthermore, this image procurement is not considered as intrusive as the other mentioned alternatives.
Nonetheless, in spite of the various facial recognition systems which already exist, many of them have been unsuccessful in matching up to expectations. 2D facial recognition systems are constrained by limitations such as physical appearance changes, aging factor, pose and changes in lighting intensity. Recently, to overcome these challenges 3D facial recognition systems have been issued as the newly emerged biometric technique, showing a high level of accuracy and reliability, being more robust to face variation due to the different factors.
A face-based biometric system consists of acquisition devices, preprocessing, feature extraction, data storage and a comparator. An acquisition device maybe a 2D-, 3D- or an infra-red- camera that can record the facial information. The preprocessing can detect facial landmarks, align facial data and crop facial area. It can filter irrelevant information such as hair, background and reduce facial variation due to pose change. In 2D images, landmarks such as eye, eyebrow, mouths etc, can be reliably detected, in contrast, nose is the most important landmark in 3D face recognition.
The 3D information (depth and texture maps) corresponding to the surface of the face may be acquired using different alternatives: A multi camera system (stereoscopy), range cameras or 3D laser and scanner devices. Different approaches have been presented from the 3D perspective. The first approach would correspond to all 3D approaches that require the same data format in the training and in the testing stage. The second philosophy would enclose all approaches that take advantage of the 3D data during the training stage but then use 2D data in the recognition stage. Approaches of the first category report better results than of the second group; however, the main drawback of this category is that the acquisition conditions and elements of the test scenario should be well synchronized and controlled in order to acquire accurate 3D data. Thus, they are not suitable for surveillance applications or control access points where only one “normal” 2D texture image (from any view) acquired from a single camera is available. The second category encloses model-based approaches. Nevertheless, model-based face recognition approaches present the main drawback of a high computational burden required to fit the images to the 3D models.
In this chapter, we study 3D face recognition where we provide a description of the most recent 3D based face recognition techniques and try to coarsely classify them into categories, as explained in the following subsequent sections.
2. Iterative closest point
(Maurer et al., 2005) presented a multimodal algorithm that uses Iterative Closest Point (ICP) to extract distance map, which is the distance between mesh of reference and probe. This method includes, face finding, landmark finding, and template computation. They used weighted sum rule to fuse shape and texture scores. If 3D score is high, algorithm uses only shape for evaluation. In experimental tests by using 4007 faces in the FRGC v2 database, a verification rate of 87.0% was achieved at %0.1 false accept rate (FAR). (Kakadiaris et al., 2007)performed face recognition with an annotated model that is non-rigidly registered to face meshes through a combination of ICP, simulated annealing and elastically adapted deformable model fitting. A limitation of this approach is the imposed constraints on the initial orientation of the face.
Performance of 3D methods highly depends on registration performance, where ICP is commonly used. ICP registration performance is highly dependent on initial alignment and it performs solid registration. However, expression variations degrade registration success. To overcome this problem, (Faltemier et al., 2008) divided the face into different overlapping regions where each face region was registered independently. Distance between regions was used as a similarity measure and results were fused using modified Borda count. They achieved 97.2% rate on FRGC v2 database. Other approaches to discard the effect of expressions were also studied by dividing the face into separate parts and extracting features from each part in 2D and range images (Cook et al., 2006; McCool et al., 2008).
3. Geometric approach
The early work of applying invariant functions on 3-D face recognition was done over a decade ago. At that time, people began with the geometrical properties introduced in differential geometry, such as principal curvatures, Gaussian curvature, etc. Basically, these approaches use the invariant functions, e.g., Gaussian curvature which is invariant under Euclidean transformations, to extract information from the face surface and, then, perform a classification that is based on the extracted information. (Riccio & Dugelay, 2007) proposed a particular 2D-3D face recognition method based on 16 geometric invariants, which were calculated from a number of “control points”. The 2D face images and 3D face data are related through those geometric invariants. The method is invariant to pose and illumination, but the performance of the method closely depend on the accuracy of “control points” localization.
In the approach proposed by (Elyan & Ugail, 2009), the first goal was to automatically determine the symmetry profile along the face. This was undertaken by means of computing the intersection between the symmetry plane and the facial mesh, resulting in a planner curve that accurately represents the symmetry profile. Once the symmetry profile is successfully determined, a few feature points along the symmetry profile are computed. These feature points are essential to compute other facial features, which can then be utilized to allocate the central region of the face and extract a set of profiles from that region (Fig. 1). In order to allocate the symmetry profile, it was assumed that it passes through the tip of the nose. This was considered as the easiest feature point to recover and to allocate using a bilinear blended Coon’s surface patch. Coon’s patch is a parametric surface defined by a given four-boundary curves. In (Elyan & Ugail, 2009), the four boundaries of the coon’s patch were determined based on a boundary curve that encloses an approximated central region of interest, which is simply the region of the face that contains or likely to contain the nose area. This region was approximated based on the centre of the mass that represents the 3D facial image. They have computed the Fourier coefficients of the designated profiles and stored it in a database, other than storing the actual points of the profile. Thus, having a database of images representing different individuals where each person was represented by two profiles stored by means of their Fourier coefficients.
Moreover, several works in the literature propose to map 3D face models into some low-dimensional space, including the local isometric facial representation (Bronstein et al., 2007), or conformal mapping (Wang et al., 2006). Some works, for simplification, try also to investigate partial human biometry, meaning recognition based only on part of a face, as for example in (Drira et al., 2009), where authors used the nose region for identification purposes. (Szeptycki et al., 2010) explored how conformal mapping to 2D space (Wang et al., 2006) can be applied to partial face recognition. To deal with the computational cost of 3D face recognition they have utilized conformal maps of 3D surface to a 2D domain, thus simplifying the 3D mapping to a 2D one. The principal issue addressed in (Szeptycki et al., 2010) was to create facial feature maps which can be used for recognition by applying previously developed 2D recognition techniques. Creation of 2D maps from 3D face surfaces can handle model rotation and translation. However, their technique can be applied only to images with variation in pose and lighting. The expression changes were avoided. To create face maps which are later used for recognition, they started with models preprocessing (hole, spike removal). Next step was to segment the rigid part of a face that has less potential to change during expression. Finally, they performed UV conformal parameterization as well as shape index calculation for every vertex; the process is shown in Fig. 2.
(Song et al., 2009) detected the characteristics of the three regions eyes, nose and mouth in the human face, and then calculated the geometric characteristics of these regions by finding the straight-line Euclidean distance, curvaturedistance, area, angle and volume.
Another face recognition system that is based on 3D geometric features was developed by (Tin & Sein, 2009). It is based on the perspective projection of a triangle constructed from three nodal points extracted from the two eyes and lips corners (Fig. 3). The set of non-linear equations was established using the nodal points of a triangle built by any three points in a 2D scene.
An automatic 3D face recognition system using geometric invariant feature was proposed by (Guo et al., 2009). They utilized two kinds of features, one is the angle between neighboured facets, they made it as the spatial geometric feature; the other is the local shape representation vector, and they made it as the local variation feature. They combined these two kinds of features together, and obtained the geometric invariant feature. Before feature extraction, they have presented a regularization method to build the regular mesh models. The angle between neighboured facets is invariant to scale and pose; meanwhile, local shape feature represents the exclusive individual shape.
(Passalis et al., 2007) focused on intra-class object retrieval problems, specifically, on face recognition. By considering the human face as a class of objects, the task of verifying a person’s identity can be expressed as an intra-class retrieval operation. The fundamental idea behind their method is to convert raw polygons in R 3 space into a compact 2D description that retains the geometry information, and then perform the retrieval operation in R 2 space. This offers two advantages:
working in R 2 space is easier, and
the system can apply the existing 2D techniques.
A 3D model is first created to describe the selected class. Apart from the geometry, the model also includes any additional features that characterise the class (e.g., area annotation, landmarks). Additionally, the model has a regularly sampled mapping from R 3 to R 2 (UV parameterization) that can be used to construct the equivalent 2D description, the geometry image. Subsequently, a subdivision-based model is fitted onto all the objects of the class using a deformable model framework. The result is converted to a geometry image and wavelet decomposition is applied. The wavelet coefficients are stored for matching and retrieval purposes (Fig. 4).
(Zaeri, 2011) investigated a new 3D face image acquisition and capturing system, where a test-bed for 3D face image feature characteristic and extraction was demonstrated.(Wong et al., 2007) proposed a multi-region face recognition algorithm for 3D face recognition. They identified the multiple sub-regions over a given range facial image and extracted summation invariant features from each sub-region. For each sub-region and the corresponding summation invariant feature, a matching score was calculated. Then, a linear fusion method was developed to combine the matching scores of individual regions to arrive at a final matching score. (Samir et al., 2006) described the face surface using contour lines or iso-contours of the depth function while using the nose tip as a reference point for alignment. The face surface is represented as a 2D image (e.g., depth-map), and then a 2D image classification techniques are applied. This approach requires that the surfaces are aligned by the iterative closest point algorithm or by feature-based techniques. Then, the deformable parts of the face are detected and excluded from the matching stage or downgrade their contribution during matching. This, however, may lead to loss of information (e.g., excluding the lower part of the face) which is important for classification. A different approach is to use an active appearance model or in the general case, a 3D deformable model which may be fitted to the face surface. The difficulty in this case is in building a (usually linear) model that can capture all possible degrees of freedom hidden in facial expressions and fitting the model to the surface in hand.
The approach of (Mpiperis et al., 2007) relies on the assumption that the face is approximately isometric, which means that geodesic distances among points on the surface are preserved, and tries to establish an expression-invariant representation of the face. This technique does not have the disadvantages outlined in some other methods (loss of information and dealing with face variability). (Mpiperis et al., 2007) have considered the face surface as a 2D manifold embedded in the 3D Euclidean space, characterized by a Riemannian metric and described by intrinsic properties, namely geodesics (Figures 5 and 6).
(Li et al., 2009) proposed a 3D face recognition approach using Harmonic Mapping and ABF++ as the mesh parameterization techniques. This approach represents the face surface in both local and global manners, which encodes the intrinsic attributes of surface in planar regions. Therefore, surface coarse registration and matching can be dealt with in a low dimensional space. The basic idea is to map 3D surface patches to a 2D parameterization domain and encode the shape and texture information of a 3D surface into a 2D image. Therefore, complex geometric processing can be analyzed and calculated in a low-dimensional space. The mean curvature to characterize the points of surface is employed. Then, both local shape description and global shape description with curvature texture are constructed to represent the surface. With the selected surface patches in local regions, Harmonic Mapping is used to construct the local shape description. Harmonic Mappings are the solutions to partial differential equations from the Dirichlet energy defined in Riemann manifolds. An example of the constructed local shape description at a feature point on a facial surface is shown in Fig. 7, while the global shape description is shown in Fig. 8. For the overall meshes of probe or gallery images, nonlinear parameterization ABF++ with free boundary, proposed by (Sheffer et al., 2005), is used to create global shape description.
The method presented by (Guo et al., 2010) is based on conformal geometric maps which does not need 3D models registration, and also maps 3D facial shape to a 2D domain which is a diffeomorphism through a global optimization. The 2D maps integrate geometric and appearance information and have the ability to describe the intrinsic shape of the 3D facial model, called Intrinsic Shape Description Maps (Fig. 9).
(Harguess & Aggarwal, 2009) presented a comparison of the use of the average-half-face to the use of the original full face with 6 different algorithms applied to two- and three-
dimensional (2D and 3D) databases. The average-half-face is constructed from the full frontal face image in two steps; first the face image is centred and divided in half and then the two halves are averaged together (reversing the columns of one of the halves). The resulting average-half-face is then used as the input for face recognition algorithms. (Harguess & Aggarwal, 2009) compared the results using the following algorithms: eigenfaces, multi-linear principal components analysis (MPCA), MPCA with linear discriminant analysis (MPCALDA), Fisherfaces (LDA), independent component analysis (ICA), and support vector machines (SVM).
4. Active appearance model approach
Many researchers have used the active appearance model (AAM) (Cootes et al., 2001) in modelling 3D face images. The AAM is a generative and parametric model that allows representation of a variety of shapes and appearances of human faces. It uses the basis vectors that are obtained by applying principal component analysis (PCA) to the input images and tries to find the maximum amount of variance. Although AAM is simple and fast, fitting it to an input image is not an easy task because it requires nonlinear optimization that finds a set of suitable parameters simultaneously, and its computation is basically conducted in an iterative manner. Usually, the fitting is performed by a variety of standard nonlinear optimization methods.
(Abboud et al., 2004) proposed the facial expression synthesis and recognition system by face model with AAM. After extracting appearance parameters of AAM for recognition, they recognized facial expression in Euclidian and Mahalanobis space of these parameters. Also, (Abboud & Davoine, 2004) proposed a bilinear factorization expression classifier for the recognition and compared it to linear discriminant analysis (LDA). Their results showed that the bilinear factorization is useful when only a few number of training samples are available. (Ishikawa et al., 2004) used AAM for tracking around the eye region and recognized the direction of gaze.
(Matthews et al., 2004) suggested that the performance of an AAM built with single-person data is better than that of AAM built with multiple person data for the pose and illumination problems. (Xiao et al., 2004) employed 3D shapes in the AAM in order to solve the pose problem and used a nonrigid structure-from-motion algorithm for computing this 3D shape from 2D images. The 3D shape provides the constraints on the 2D shape, which can be more deformable, and these constraints make fittingmore reliable. (Hu et al., 2004) proposed another extension of a 2D + 3D AAM fitting algorithm, called the multiview AAM fitting algorithm. It fits a single 2D + 3D AAM to multiple view images obtained simultaneously from multiple affine cameras. (Mittrapiyanuruk et al., 2004) proposed the use of stereo vision to construct a 3D shape and estimate the 3D pose of a rigid object using AAM. (Cootes et al., 2002)proposed using several face models to fit an input image. They estimated the pose of an input face image by a regression technique and then fitted the input face image to the face model closest to the estimated pose. However, their approach requires pose estimation, which is another difficult problem, since the pose estimation might cause an incorrect result when the appearance of the test face image is slightly different from the training images due to different lighting conditions or different facial expressions. (Sung & Kim, 2008) proposed an extension of the 2D + 3D AAM to a viewbased approach for pose-robust face tracking and facial expressions. They used the PCA with missing data (PCAMD) technique to obtain the 2-D and 3-D shape basis vectors since some face models have missing data. Then, they developed an appropriate model selection for the input face image.Thismodel selection method uses the pose angle that is estimated from the 2D + 3D AAM directly.
(Park et al., 2010) proposed a method for aging modelling in the 3D domain. Facial aging is a complex process that affects both the shape and texture (e.g., skin tone or wrinkles) of a face. This aging process also appears in different manifestations in different age groups. While facial aging is mostly represented by facial growth in younger age groups (e.g., ≤ 18 years old), it is mostly represented by relatively large texture changes and minor shape changes (e.g., due to change of weight or stiffness of skin) in older age groups (e.g., >18). Therefore, an age correction scheme needs to be able to compensate for both types of aging processes. (Park et al., 2010) have shown how to build a 3D aging model given a 2D face aging database. Further, they have compared three different modelling methods, namely, shape modelling only, separate shape and texture modelling, and combined shape and texture modelling (e.g., applying second level PCA to remove the correlation between shape and texture after concatenating the two types of feature vectors).
5. Filtering based approach
(Yang et al., 2008) applied the canonical correlation analysis (CCA) to learn the mapping between the 2D face image and 3D face data. The proposed method consists of two phases. In the learning phase, given the 2D-3D face data pairs of the subjects for training, PCA is first applied on both 2D face image and 3D face data to avoid the curse of dimensionality and reduce noise. Then the CCA regression is performed between the features of 2D-3D in the previous PCA subspaces. In the recognition phase, given an input 2D face image as a probe, the correlation between the probe and the gallery is computed as matching score using the learnt regression. Furthermore, to simplify the mapping between 2D face image and 3D face data, a patch based strategy is proposed to boost the accuracy of matching. (Huang et al., 2010) presented an asymmetric 3D-2D face recognition method, that uses textured 3D face image for enrolment while performs automatic identification using only 2D facial images. The goal is to limit the use of 3D data to where it really helps to improve face recognition accuracy. The proposed method contains two separate matching steps: Sparse Representation Classifier (SRC) which is applied to 2D-2D matching, and CCA which is exploited to learn the mapping between range local binary pattern (LBP) faces (3D) and texture LBP faces (2D). Both matching scores are combined for the final decision.
(Günlü & Bilge, 2010) divided 3D faces into smaller voxel regions and applied 3D transformation to extract features from these voxel regions, as shown in Fig. 10. The number of features selected from each voxel region is not constant and depends on their discrimination.
(Dahm & Gao, 2010) presented a novel face recognition approach that implements cross-dimensional comparison to solve the issue of pose invariance. The approach implements a Gabor representation during comparison to allow for variations in texture, illumination, expression and pose. Kernel scaling is used to reduce comparison time during the branching search, which determines the facial pose of input images. This approach creates 2D rendered views of the 3D model from different angles, which are then compared against the 2D probe. Each rendered view is created by deforming the 3D model’s texture with the 3D shape information, as shown in Fig. 11.
(Wang et al., 2010) proposed another scheme for 3D face recognition that passes through different stages. They used iterative closet point to align all 3D face images with the first person. Then a region defined by a sphere of radius 100 mm centred at the nose tip was cropped to construct the depth image. The Gabor filter was used to capture the useful local structure of the depth images.
Another approach that deals with 3D face recognition was presented by (Cook et al., 2007), where they used multi-scale techniques to partition the information contained in the frequency domain prior to dimensionality reduction. In this manner, it is possible to increase the information available for classification and, hence, increase the discriminative performance of both Eigenfaces and Fisherfaces techniques, which were used for dimensionality reduction. They have used the Gabor filters as a partitioning scheme, and compared their results against the discrete cosine transform and the discrete wavelet transform.
6. Statistical approach
(Rama & Tarrés, 2005) have presented Partial Principal Component Analysis (P2CA) for 3D face recognition. The main advantage in comparison with the model-based approaches is its low computational complexity since P 2 CA does not require any fitting process. However, one of the main problems of their work is the enrolment of new persons in the database (gallery set) since a total of five different images are needed for getting the 180º texture map. Recently, they presented a work that automatically creates 180º texture maps from only two images (frontal and profile views) (Rama & Tarrés, 2007). Nevertheless, this work has also another constraint; it needs a normalization (registration) process for both eyes where they should be perfectly aligned at a fixed distance. Thus, errors in the registration of the profile view lead to noisy areas of the reconstructed 180º images (Fig. 12).
(Gupta et al., 2007) presented a systematic procedure for selecting facial fiducial points associated with diverse structural characteristics of a human face. They have identified such characteristics from the existing literature on anthropometric facial proportions. Also, they have presented effective face recognition algorithms, which employ Euclidean/geodesic distances between these anthropometric fiducial points as features along with linear discriminant analysis (LDA) classifiers. They have demonstrated how the choice of facial fiducial points critically affects the performance of 3D face recognition algorithms that employ distances between them as features.
Anthropometry is the branch of science that deals with the quantitative description of physical characteristics of the human body. Anthropometric cranio-facial proportions are ratios of pairs of straight-line and/or along-the-surface distances between specific cranial and facial fiducial points (Fig. 13).
(Ming et al., 2010) proposed algorithm for 3D-based face recognition by representing the facial surface, by what is called a Bending Invariant (BI), invariant to isometric deformations resulting from expressions and postures. In order to encode relationships in neighbouring mesh nodes, Gaussian-Hermite moments are used for the obtained geometric invariant, which provide rich representation, due to their mathematical orthogonality and effectiveness in characterizing local details of the signal. Then, the signature images are decomposed into their principle components based on Spectral Regression Kernel Discriminate Analysis (SRKDA) resulting in a huge time saving.
7. Local binary patterns
In (Zhou et al., 2010), Local Binary Patterns (LBP) method was used to represent 3D face images. The Local Binary Pattern (LBP) method describes the local texture pattern with a binary code. It is built by thresholding a neighbourhood P with radius R (typically denoting the 8 surrounding pixels) by the gray value g of its centre c. Also, (Ming et al., 2010) proposed a framework for 3D face recognition that is based on the 3D Local Binary Patterns (3D LBP). In the feature extraction stage, 3D LBP is adopted to describe the intrinsic geometric information, negating the effect of expression variations effectively. 3D LBP encodes relationships in neighbouring mesh nodes and own more potential power to describe the structure of faces than individual points. In learning stage, Spectral Regression is adopted to learn principle components from each 3D facial image. With dimensional reduction based on Spectral Regression, more useful and significant features can be produced for a face, resulting in a huge saving in computational cost. Finally, face recognition is achieved using Nearest Neighbour Classifiers.
8. Other 3D face recognition approaches
In order to enhance robustness to expression variations, a procedure for 3D face recognition based on the depth image and Speeded-Up Robust Features (SURF) Operator was proposed by (Yunqi et al., 2010). First, they have applied the Fisher Linear Discriminant (FLD) method on the depth image to perform coarse recognition to catch the highly ranked 3D faces. On the basis of this step, they extracted the SURF features of the 2D gray images that are corresponding only to those highly ranked 3D faces, to carry out the refined recognition. SURF algorithm was first proposed by (Bay et al., 2008). At present, SURF has been applied to image registration, camera calibration and object recognition. Furthermore, (Kim & Dahyot, 2008) presented another approach for 3D face recognition usingSVM and SURF Descriptor.
On the other hand, (Wang et al., 2009) used a spherical harmonic representation with the morphable model for 2D face recognition. The method uses a 2D image to build a 3D model for the gallery, based on a 3D statistical morphable model. Also, (Biswas et al., 2009) proposed a method for albedo estimation for face recognition using two-dimensional images. However, they assumed that the image did not contain shadows. (Zhou et al., 2008) used nearest-subspace patch matching to warp near frontal face images to frontal and project this face image into a pre-trained low-dimensional illumination subspace. Their method requires training of patches in many different illumination conditions.
9. 3D face fitting
A 3D Morphable Model (3DMM) consists of a parameterized generative 3D shape, and a parameterized albedo model together with an associated probability density on the model coefficients. Together with projection and illumination parameters, a rendering of the face can be generated. Given a face image, one can also solve the inverse problem of finding the coefficients which most likely generated the image. Identification and manipulation tasks in coefficient space are trivial, because the generating factors (light, pose, camera, and identity) have been separated. Solving this inverse problem is termed “model fitting”, and was introduced for faces by (Blanz & Vetter, 1999). A similar method has also been applied to stereo data (Amberg et al., 2007) and 3D scans (Amberg et al., 2008).
A 3D deformation modelling scheme was proposed by (Lu & Jain, 2008) to handle the expression variations. They proposed a facial surface modelling and matching scheme to match 2.5D facial scans in the presence of both nonrigid deformations and pose changes (multiview) to a stored 3D face model with neutral expression.
They collected data for learning 3D facial deformations from only a small group of subjects, called the control group. Each subject in the control group provides a scan with neutral expression and several scans with nonneutral expressions. The deformations (between neutral scan and nonneutral scans) learned from the control group are transferred to and synthesized for all the 3D neutral face models in the gallery, yielding deformed templates with synthesized expressions (Fig. 14). For each subject in the gallery, deformable models are built based on the deformed templates. In order to learn deformation from the control group, a set of fiducial landmarks is needed. Besides the fiducial facial landmarks such as eye and mouth corners, landmarks in the facial area with little texture, for example, cheeks are extracted in order to model the 3D surface movement due to expression changes. A hierarchical geodesic-based resampling scheme constrained by fiducial landmarks is designed to derive a new landmark-based surface representation for establishing correspondence across expressions and subjects.
(Wang et al., 2009) proposed an improved algorithm aiming at recognizing faces of different poses when each face class has only one frontal training sample. For each sample, a 3D face is constructed by using 3DMM. The shape and texture parameters of 3DMM are recovered by fitting the model to the 2D face sample which is a non-linear optimization problem. The virtual faces of different views are generated from the 3DMM to assist face recognition. They have located 88 sparse points from the 2D face sample by automatic face fitting and used their correspondence in the 3D face as shape constraint (Fig. 15).
(Daniyal et al., 2009) proposed a compact face signature for 3D face recognition that is extracted without prior knowledge of scale, pose, orientation or texture. The automatic extraction of the face signature is based on fitting a trainedPoint Distribution Model (PDM) (Nair & Cavallaro, 2007). First, a facial representation based on testing extensive sets of manually selected landmarks is chosen. Next, a PDM is trained to identify the selected set of landmarks (Fig. 16). The recognition algorithm represents the geometry of the face by a set of Inter-Landmark Distances (ILDs) between the selected landmarks. These distances are then compressed using PCA and projected onto the classification space using LDA. The classification of a probe face is finally achieved by projecting the probe onto the LDA-subspace and using the nearest mean classifier.
(Paysan et al., 2009) proposed a generative 3D shape and texture model, the Basel Face Model (BFM). The model construction passes through four steps: 3D face scanning, Registration, Texture Extraction and Inpainting, and Model. The model is based on parameterizing the faces using triangular meshes. A face is then represented by two dimensional vectors: shape and texture, constructing two independent Linear Models. Finally, a Gaussian distribution is fit to the data using PCA (Fig. 17).
(Toderici et al., 2010) proposed a face recognition method which utilizes 3D face data for enrolment, while it requires only 2D data for authentication. During enrolment, 2D+3D data (2D texture plus 3D shape) is used to build subject-specific annotated 3D models. First, an Annotated Face Model (AFM) is fitted to the raw 2D+3D data using a subdivision based deformable framework. Then, a geometry image representation is extracted using the UV parameterization of the model. In the authentication phase, a single 2D image is used as the input to map the subject-specific 3D AFM. After that, an Analytical Skin Reflectance Model (ASRM) is applied to the gallery AFM in order to transfer the lighting from the probe to the texture in the gallery.
10. Face recognition in video
Face recognition in video has gained wide attention as a covert method for surveillance to enhance security in a variety of application domains (e.g., airports). A video contains temporal information (e.g., movements of facial features) as well as multiple instances of a face, so it is expected to lead to a better face recognition performance compared to still face images. However, faces appearing in a video have substantial variations in pose and lighting. These pose and lighting variations can be effectively modelled using 3D face models (Yin et al., 2006). Given the trajectories of facial feature movement, face recognition is performed based on the similarities of the trajectories. The trajectories can also be captured as nonlinear manifolds and the distance between clusters of faces in the feature space establishes the identity associated with the face. Production of 3D faces from video can be performed using morphable models, stereography, or structure from motion (SFM).
(Park et al., 2005) proposed a face recognition system that identifies faces in a video using 3D face model. Ten video files were recorded for ten subjects under four different lighting conditions at various poses with yaw and pitch motion. Recognition using multiple images and temporal cue was explored and majority voting and score sum were used to fuse the recognition result from multiple frames. To use temporal cues for the recognition, a LDA based classifier was used. After the face pose in a video was estimated, frames of different poses under specific lighting condition and specific order were extracted to form a probe sequence.
(Von Duhn et al., 2007) designed a 3D face analyzer using regular CCTV videos. They used a three view tracking approach to build 3D face models over time. The proposed system detects, tracks and estimates the facial features. For the tracking, an Active Appearance Model approach is adapted to decrease the amount of manual work that must be done. After the tracking stage, a generic model is adapted to the different views of the face using a face adaptation algorithm, which includes two steps: feature point adaptation and non-feature point interpolation. Finally, the multiple views of models are combined to create an individualized face model. To track the facial motion under three different views, i.e., front view, side view, and angle view, predefined fiducial points are used.
Also, (Roy-Chowdhury & Xu, 2006) estimated the pose and lighting of face images contained in video frames and compared them against synthetic 3D face models exhibiting similar pose and lighting. However, the 3D face models were registered manually with the face image in the video. (Lee et al., 2003) proposed an appearance manifold based approach where each database or gallery image was matched against the appearance manifold obtained from the video. The manifolds were obtained from each sequence of pose variations. (Zhou et al., 2003) proposed to obtain statistical models from video using low level features (e.g., by PCA) contained in sample images. The matching was performed between a single frame and the video or between two video streams using the statistical models.
(Park et al., 2007) explored the adaptive use of multiple face matchers in order to enhance the performance of face recognition in video. To extract the dynamic information in video, the facial poses in various frames are explicitly estimated using Active Appearance Model and a Factorization based 3D face reconstruction technique. The motion blur is estimated using Discrete Cosine Transformation (DCT). The performance of the proposed system could be improved by dynamically fusing the matching results from multiple frames and multiple matchers.
Further, (Wang et al., 2004) have successfully developed a hierarchical framework for tracking high density 3D facial expression sequences captured from a structure-lighting imaging system. The work in (Chang et al., 2005), utilized six 3D model sequences for facial analysis and editing. The work was mainly for facial expression analysis. (Papatheodorou & Rueckert, 2004) evaluated a so-called 4D face recognition approach, which was, however, just the 3D static data plus texture, no temporal information was explored. (Li et al., 2003) reported a model fitting approach to generate facial identity surfaces through video sequences. The application of this model to face recognition relies on the quality of the tracked low resolution face model.
(Sun & Yin, 2008) proposed to use a Spatio-Temporal Hidden Markov Model (HMM) which incorporates 3D surface feature characterization to learn the spatial and temporal information of faces. They have created a face database including 606 3D model sequences with six prototypic expressions. To evaluate the usability of such data for face recognition, they applied a generic model to track the range model sequences and establish the correspondence of range model frames over time. After the tracking model labelling and LDA transformation, they trained two HMM models (S-HMM and T-HMM) for each subject to learn the spatial and temporal information of the 3D model sequence. The query sequence was classified based on the results of the two HMMs.
(Medioni et al., 2007) utilized synthetic stereo to model faces in a 3048 x 4560 video stream. By tracking the pose and location of the face, a synthetic stereo rig based upon the different poses between two frames is initialized. Multiple point clouds from different stereo pairs are created and integrated into a single model. (Russ et al., 2006) utilized a 3D PCA based approach for face recognition. The approach determines a correspondence that utilizes a reference face aligned via ICP to determine a unique vector input into PCA. The coefficients from PCA are used to determine the identity as in 2D PCA face recognition. (Kakadiaris et al., 2006) converted the 3D model into a depth map image for wavelet analysis. This approach performs well and does not utilize ICP as the basis for each match score computation, but does for the depth map production.
Moreover, (Boehnen & Flynn, 2008) presented an approach to combine multiple noisy low density 3D face models obtained from uncalibrated video into a higher resolution 3D model using SFM method. SFM is a method for producing 3D models from a calibrated or uncalibrated video stream utilizing equipment that is inexpensive and widely available. The approach first generates ten 3D face models (containing a few hundred vertices each) of each subject using 136 frames of video data in which the subject face moves in a range of approximately 15 degrees from frontal. By aligning, resampling, and merging these models, a new 3D face model containing over 50,000 points is produced. An ICP face matcher employing the entire face achieved a 75% rank one recognition rate.
Using a data set of varying facial expressions and lighting conditions, (Bowyer et al., 2006) reported an improvement in rank one recognition rate from 96.11% with two frames per subject to 100% with four frames per subject. In another study, (Thomas et al., 2007) observed that the recognition rate generally increases as the number of frames per subject increases, regardless of the type of camera being used. They also found that the optimal number of frames per subject is between 12 and 18, given the particular data sets used.
(Canavan et al., 2007) discussed that the 3D geometry of a rotating face can be embedded in the continuous intensity changes of an image stream, and therefore the recognition algorithm does not require an explicit 3D face model. Further, multiple video frames that capture the face at different pose angles can be combined to provide a more reliable and comprehensive 3D representation of the face than any single view image. Also, they have discussed that a video sequence of a face with different poses might help alleviate the adverse effect of lighting changes on recognition accuracy. For instance, a light source can cast shadows on a face, but at the same time, it also reveals the 3D curvatures of the face by creating sharp intensity contrasts (such as silhouette).
(Dornaika & Davoine, 2006) introduced a view- and texture-independent approach that exploits the temporal facial action parameters estimated by an appearance-based 3D face tracker. The facial expression recognition is carried out using learned dynamical models based on auto-regressive processes. These learned models can also be utilized for the synthesis and prediction tasks. In their study, they used the 3D face model Candide (Ahlberg, 2001). This 3D deformable wireframe model is given by the 3D coordinates of the vertices P i, i = 1,..., n where n is the number of vertices. Thus, the shape up to a global scale can be fully described by the 3n-vector g, the concatenation of the 3D coordinates of all vertices P i. The vector g can be written as:
where is the standard shape of the model, and the columns of Sand Aare the shape and action units, respectively. Thus, the term Sτ s accounts for shape variability (inter-person variability) while the term Aτ a accounts for the facial action (intra-person variability).
In this chapter, we have presented a study on the most recent advancements in 3D face recognition field. Despite the huge developments made in this field, there are still some problems and issues which need to be resolved.
Due to the computational complexity, fussy pre-treatment, and expensive equipment, 3D technology is still not used widely in practical applications. To acquire an accurate 3D face data, some very costly equipment must be used, such as 3D laser scan or stereo camera system. Also, they are still not as stable and efficient as 2D cameras, and for some cases like the stereo camera system, calibration is needed before use. Moreover, they take a longer time to acquire (or reconstruct) when compared to the 2D camera. Further, 3D data require much more storage space. Other challenges include feature points allocation (this is still a debatable topic) that is also sensitive to the quality of data. Sampling density of the facial surface and accuracy of the depth, are among the issues that require more investigations. Furthermore, no standard testing protocol is available to compare between different 3D face recognition systems.
On the other hand, in video-based face recognition, experiments have shown that multi-frame fusion is an effective method to improve the recognition rate. The performance gain is probably related to the use of 3D face geometry embedded in video sequences. However, it is not clear how the inter-frame variation has contributed to the observed performance increase. Will the multi-frame fusion work for videos of strong shadows? How many frames are necessary for maximizing the recognition rate without incurring a heavy computational cost? To address these issues, more exploration is needed from the research community.
The author would like to acknowledge and thank Kuwait Foundation for the Advancement of Sciences (KFAS) for financially supporting this work.