Even for the present-day computer technology, the biometric recognition of human face is a difficult task and continually evolving concept in the area of biometric recognition. The area of face recognition is well-described today in many papers and books, e.g. (Delac et al., 2008), (Li & Jain, 2005), (Oravec et al., 2010). The idea that two-dimensional still-image face recognition in controlled environment is already a solved task is generally accepted and several benchmarks evaluating recognition results were done in this area (e.g. Face Recognition Vendor Tests, FRVT 2000, 2002, 2006, http://www.frvt.org/). Nevertheless, many tasks have to be solved, such as recognition in unconstrained environment, recognition of non-frontal images, single sample per person problem, etc.
This chapter deals with single sample per person face recognition (also called one sample per person problem). This topic is related to small sample size problem in pattern recognition. Although there are also advantages of single sample – fast and easy creation of a face database and modest requirements for storage, face recognition methods usually fail to work if only one training sample per person is available.
In this chapter, we concentrate on the following items:
Mapping the state-of-the-art of single sample face recognition approaches after year 2006 (the period till 2006 is covered by the detailed survey (Tan et al., 2006)).
Generating new face patterns in order to enlarge the database containing single samples per subject only.
Such approaches can include modifications of original face samples using e.g. noise, mean filtering, suitable image transform (forward transform, then neglecting some coefficients and image reconstruction by inverse transform), or generating synthetic samples by ASM (active shape method) and AAM (active appearance method).
Comparing recognition efficiency using single and multiple samples per subject.
We illustrate the influence of number of training samples per subject to recognition efficiency for selected methods. We use PCA (principal component analysis), MLP (multilayer perceptron), RBF (radial basis function) network, kernel methods and LBP (local binary patterns). We compare results using single and multiple training samples per person for images taken from FERET database. For our experiments, we selected large image set from FERET database.
Highlighting other relevant important facts related to single sample recognition.
We analyze some relevant facts that can influence further development in this area. We also outline possible directions for further research.
2. Face recognition based on a single sample per person
2.1. General remarks
Generally, we can divide the face recognition methods into three groups (Tan et al., 2006): holistic methods, local methods and hybrid methods.
Holistic methods like PCA (eigenfaces), LDA (fisherfaces) or SVM need principally more image samples per person in the training phase. To solve the one sample problem there are basically two ways how to deal with it:
To extend the classical methods to be trained from single sample more efficiently – e.g. 2D-PCA (Yang et al., 2004), (PC)2A (Wu & Zhou, 2002), E(PC)2A (Chen et al., 2004a), SPCA (Zhang, et al., 2005), APCA (Chen & Lovell, 2004), FLDA (Chen, et al., 2004b), Gabor+PCA+WSCM (Xie & Lam, 2006).
To enlarge the training set by new representations or generating new views.
Local methods can be divided into 2 groups:
Local feature based, which mostly work with some type of graph spread over the face regions with corners in important face features – face recognition is formulated as a problem of graph matching. These methods deal with the one sample problem better than the typical holistic methods (Tan et al., 2006). EBGM (Elastic Bunch Graph Matching) or DCP (directional corner points) are examples of this type of methods.
Local appearance-based methods extract information from defined local regions. The features are extracted by known methods for texture classification (Gabor wavelets, LBP, etc.) and the feature space is reduced by known methods like PCA or LDA.
An excellent introduction to the single sample problem and survey of related methods mapping state-of-the-art till 2006 is described and discussed in (Tan et al., 2006).
2.2. State-of-the-art in single sample per person face recognition from 2006
After year 2006, new approaches were proposed. They are based mainly on enhancement of various conventional methods.
Principal Component Analysis (PCA) is still one of the most popular methods used to deal with one sample problem. Despite of its popularity, calculating of representative covariance matrix from one sample is very difficult task. In contrast to conventional application of PCA, 2DPCA (Yang et al. 2004) is based on two dimensional matrices, where the image does not need to be previously transformed into a 1D vector.
In (Que et al., 2008) a new face recognition algorithm MW(2D)2PCA was proposed. Modular Weighted (2D)2PCA (MW(2D)2PCA) is based on the study of (2D)2PCA. Weighting method (W) emphasizes the different influence of different eigenvectors and image blocking method (M) can extract detailed information of face image more effectively. Modularization of image into several blocks according to face elements provides more detailed information of face and assigns this approach rather to local appearance than holistic methods. The best recognition rate achieved by this method was 74.14%.
Similar approach, that deals with the single sample problem from human perception point of view, was proposed in (Zhan et al., 2009) where modularized image was processed by 2D DCT to extract features, instead of (2D)2PCA. Gabor filters can be applied even to the image divided into several areas to reduce illumination impact as it is shown in (Nguyen & Bai, 2009).
Standard way to solve single sample problem is to use local facial representations. Conventional procedure in local methods is face image partitioning into several segments. In (Akbari et al., 2010), an algorithm based on single image per person, with input images segmented into 7 partitions was proposed. The moment feature vectors of a definite order for all images are extracted and distance measure is used to recognize the person.
Another way to get better results of recognition is a fusion of more biometrics. In (Ma et al., 2009) a new multi-modal biometrics fusion approach was presented. They used face and palmprint biometrics and combined the normalized Gaborface and Gaborpalm images at the pixel level. They presented a kernel PCA plus RBF classifier (KPRC) to classify the fused images. Using both face and palmprint samples, the average recognition results were improved from 42.60% and 52.36% (single-modal biometrics) to 87.01% (multi-modal biometrics).
In (Xie & Lam, 2006) novel Gabor-based kernel principal component analysis with doubly nonlinear mapping for human face recognition was proposed. The algorithm is evaluated using 4 databases: Yale, AR, ORL and YaleB database. The best of the proposed variations of the algorithm GW+DKPCA get very good results even under varying lighting, expression and perspective conditions.
(Kanan & Faez, 2010) presents a new approach for face representation and recognition based on Adaptively Weighted Sub-Gabor Array (AWSGA). The proposed algorithm utilizes a local Gabor array to represent faces partitioned into sub-patterns. It employs an adaptively weighting scheme to weight the Sub-Gabor features extracted from local areas based on the importance of the information they contain and their similarities to the corresponding local areas in the general face image. Experiments on AR and Yale databases show, that the proposed method significantly outperforms eigenfaces and modular eigenfaces in most of the benchmark scenarios under both ideal conditions and varying expressions and lighting conditions and this method achieves better results under partial occlusion conditions than the local probabilistic approach.
A novel feature extraction method named uniform pursuit (UP) was proposed in (Deng et al., 2010). A standardized procedure on the large-scale FERET and FRGC databases was applied to evaluate the one sample problem. Experimental results show that the robustness, accuracy and efficiency of the proposed UP method can compete successfully with the state-of-the-art one sample based methods.
In (Qiao et al., 2010), a new graph-based semi-supervised dimensionality reduction algorithm called sparsity preserving discriminant analysis (SPDA) based on SDA was developed. Experiments on AR, PIE and YaleB databases show that proposed method outperforms the SDA method.
Solution for single sample problem based on Fisherface method on generic dataset was presented in (Majumdar & Ward, 2008). The method was also extended to multiscale transform domains like wavelet, curvelet and contourlet. Results on Faces94 and the AT&T database show, that this approach outperforms SPCA and Eigenface Selection methods. Best results came from the Pseudo-fisherface method in the wavelet domain.
In (Gao et al., 2008), a method based on singular value decomposition (SVD) was used to evaluate the within-class scatter matrix so that the FLDA could be applied for face recognition with only one sample image in training set. The experiments on FERET, UMIST, ORL and Yale databases show, that the proposed method outperforms other state-of-the-art methods like E(PC)2A, SVD perturbation and different FLDA implementations.
A novel local appearance feature extraction method based on multi-resolution Dual Tree Complex Wavelet Transform (DT-CWT) was presented in (Priya & Rajesh, 2010). Experiments with ORL and Yale databases show, that this method and its block-based modification get very good results under illumination, perspective and expression variations conditions compared to PCA and global DT-CWT, while keeping low computational complexity.
In (Tan & Triggs, 2010) original LBP method used for face recognition was extended. More efficient preprocessing was proposed to eliminate illumination variances using LTP (local ternary patterns) – generalization and enhancement of the original LBP texture descriptor. By replacing the local histogram with a distance transform based similarity metrics the performance of the LBP/LTP face recognition was further improved. Experiments under difficult lighting conditions with Face Recognition Grand Challenge, Extended Yale-B, and CMU PIE databases provide results comparable to up to date methods.
Another extension of the LBP algorithm was presented in (Lei et al., 2008). The face image is first decomposed by multi-scale and multi-orientation Gabor filters. Local binary pattern analysis is then applied on the derived Gabor magnitude responses. Using FERET database with 1 image per person in the gallery, the method achieved results outperforming LBP, PCA and FLDA. To improve the recognition accuracy, it helps to add some synthetic samples of subject to the learning process. Standard procedures to create synthetic samples are the parallel deformation method (generate novel views of a single face image under different poses) (Tan et al., 2006), modification by noise or filtering original images. In (Xu & Yang, 2009) the feature extraction technique called Local Graph Embedding Discriminant Analysis(LGEDA) was proposed, where the imitated images were generated using a mean filter.
In (Su et al., 2010) an Adaptive Generic Learning (AGL) method was described. To better distinguish the persons with single face sample, a generic discriminant model was adopted. As a specific implementation of the AGL, a Coupled Linear Representation (CLR) algorithm was proposed to infer, based on the generic training set, the within-class scatter matrix and the class mean of each person given its single enrolled sample. Thus, the traditional Fisher’s Linear Discriminant (FLD) can be applied to one sample problem task. Experiments are taken on images from FERET, XM2VTS, CAS-PEAL databases and a private passport database. The results show, that the Adaptive Gabor-FLD outperforms other methods like E(PC)2A, LBP and other FLD implementations. The proposed method is related to methods using virtual sample generation although it does not explicitly generate any virtual sample.
3. Face recognition methods
We use various methods in order to deeply explore the behavior of face recognition methods for single sample problem and to compare the methods using multiple face samples - both real-world samples and virtually generated samples. Used methods are briefly introduced in this subchapter.
3.1. Methods based on principal component analysis - PCA (PCA, 2D PCA and KPCA)
3.1.1. Principal component analysis - PCA
One of the most successful techniques used in face recognition is principal component analysis (PCA). The method based on PCA is named eigenface and was pioneered by Turk and Pentland (Turk & Pentland, 1991). In this method, each input image must be transformed into one dimensional image vector and set of these vectors forms input matrix. So the main idea behind PCA is that each n-dimensional face image can be represented as a linearly weighted sum of a set of orthonormal basis vectors.
This standard statistical method can be used for feature extraction. Principal component analysis reduces the dimension of input data by a linear projection that maximizes the scatter of all projected samples (Bishop, 1995).
For classification of projected samples Euclidean distance or other metrics can be used. Mahalanobis Cosine (MahCosine) is defined as the cosine of the angle between the image vectors that were projected into the PCA feature space and were further normalized by the variance estimates (Beveridge et al., 2003).
3.1.2. Two-dimensional PCA – 2D PCA
PCA is well-known feature extraction method mostly used as a baseline method for comparison purpose. Several extensions of PCA have been proposed. A major problem of using PCA lies in computation of covariance matrix what is computationally expensive. This computation can be significantly reduced by computing PCA features for columns (or rows) without previous matrix-to-vector conversion. This approach is also called two dimensional PCA (Yang et al., 2004). Main idea behind 2D PCA is the projection of image columns (rows) onto covariance matrix computed as the average of covariance matrices of each column for all training images. Let be an by image matrix and average image defined as, where is number of all k training images. Then covariance matrix can be calculated by
Equation (1) reveals that the image covariance matrix can be obtained from the outer product of column (row) vectors of images, assuming the training images have zero mean.
For that reason, we claim that original 2D PCA works in the column direction of images. Result of feature extraction is then a matrix instead of a vector. Feature matrix has the same number of columns (rows) as width (height) of face image.
The extraction of image features is computationally more efficient using 2D PCA than PCA since the size of the image covariance matrix is quite small compared to the size of a covariance matrix in PCA (by using Turk & Pentlands optimization it depends on number of training images). 2D PCA is not only more efficient than PCA but it is possible to reach even higher recognition accuracy (Yang et al., 2004).
Despite its better efficiency, 2D PCA has also one disadvantage because it needs more coefficients for image representation than PCA. Because the size of the image covariance matrix for 2D PCA is equal to the width of images, which is quite small compared to the size of a covariance matrix in PCA, 2D PCA evaluates the image covariance matrix more accurately and computes the corresponding eigenvectors more efficiently than PCA.
3.1.3. Kernel PCA – KPCA
PCA is a linear algorithm that is not able to work with nonlinear data. Kernel PCA (Müller et al., 2001) is a method computing a nonlinear form of PCA. Instead of directly doing nonlinear PCA, it implicitly computes linear PCA in high-dimensional feature space that is in non-linear relation to input space.
3.2. Support vector machine - SVM
Support vector machines (SVM) (Asano, 2006; Hsu et al., 2003; Müller et al., 2001; Boser et al, 1992) are based on the concept of decision planes that define optimal boundaries. Its fundamental idea is very simple: the boundary is located to achieve the largest possible distance for the vectors of different sets. Example of this is shown in the Fig. 1. This figure illustrates linearly separable problem. In the case of linearly nonseparable problem, kernel methods are used. The concept of kernel method is a transformation of the vector space into a higher dimensional space.
The kernel function is defined as follows:
Kernel function is equivalent to the distance between x and x’ measured in the higher dimensional space transformed by a nonlinear mapping.
3.3. Methods based on neural networks (MLP, RBF network)
Neural network (Bishop, 1995; Haykin, 1994; Oravec et al., 1998) is a massive parallel processor that is inspired by biological nervous systems. Neural network is able to learn and to adapt its free parameters (connections between neurons known as synaptic weights are adjusted during the learning process).
3.3.1. Multilayer perceptron
Multilayer perceptron operates with functional and error signals. The functional signal propagates forward starting at the network input and ending at the network output as an output signal. The error signal originates at output neurons during the learning and propagates backward. MLP is trained by backpropagation algorithm.
MLP represents nested sigmoidal scheme (Haykin, 1994), its form for single output neuron is
whereis a sigmoidal activation function,is the synaptic weight from neuron in the last hidden layer to the single output neuron , and so on for the other synaptic weights, is the -th element of the input vector . The weight vector denotes the entire set of synaptic weights ordered by layer, then neurons in a layer, and then number in a neuron.
3.3.2. Radial basis function network
Radial basis function network (RBF) (Oravec et al., 1998; Hlaváčková, 1993) is a feedforward network consisting of input, one hidden and output layer. Input layer distributes input vectors into the network, hidden layer represents RBFs hi. Linear output neurons compute linear combinations of their inputs. RBF network topology is shown in Fig. 2.
RBF network is trained in three steps:
Determination of centers of the hidden neurons
Computation of additional parameters of RBFs
Computation of output layer weights.
where x is the input of RB activation function hi and wi are weights. Output of network is a linear combination of RBFs.
3.4. Local binary patterns – LBP
Local binary patterns (LBP) were first described in (Ojala et al., 1996). It is a computationally efficient descriptor to capture the micro-structural properties and was proposed for texture classification. The operator labels the pixels of an image by thresholding the 3x3-neighbourhood of each pixel with the center value and considering the result as a binary number. Later the LBP operator has been extended to use circle neighborhoods of different sizes - the pixel values are bilinearly interpolated (Fig. 3).
Another extension uses just uniform patterns. A local binary pattern is called uniform if it contains at most two bitwise transitions from 0 to 1 or vice versa when the binary string is considered circular. For example, 00000000, 00011110 and 10000011 are uniform patterns. Such patterns represent important features on the image like corners or edges. Uniform patterns account for most of the pattern in images (Ojala et al., 1996).
A system using LBP for face recognition is proposed in (Ahonen et al., 2004, 2006). Image is divided into non-overlapping regions. In each region a histogram of uniform LBP patterns is computed, the histograms are concatenated into one histogram (see Fig. 4 for illustration), which represents features extracted from the image in 3 levels (pixel, region and whole image).
The χ2 metric is used as the distance metric for comparing the histograms:
where S and M are the histograms to be compared and i is the i-th bin of histogram.
4. Face database
We used images selected from FERET image database (Phillips et al., 1998). FERET face images database is de facto standard database in face recognition research. It is a complex and large database, which contains more than 14126 images of 1199 subjects of dimensions 256 x 384 pixels. Images differ in head position, lighting conditions, beard, glasses, hairstyle, expression and age of subjects.
We worked with grayscale images from Gray FERET (FERET Database, 2001). We selected image set containing total 665 images from 82 subjects. It consists of all available subjects from whole FERET database that have more than 4 frontal images containing also corresponding eyes coordinates (i.e. largest possible set fulfilling these conditions from FERET database was chosen). The used image sets are visualized in Fig. 5.
The images were preprocessed. Our preprocessing consists of
geometric normalization (aligning according to eye coordinates)
masking (cropping an ellipse around the face)
resizing to 65x75pix
Fig. 6 shows an example of the original image and the image after preprocessing.
5. Simulation results for 1 - 4 original training images per subject
In our experiments with original training images we compared the efficiency of several algorithms in scenario with 1 (single sample problem), 2, 3 and 4 images/subject. We carefully selected algorithms generally considered to play the major role in today face recognition research. Also standard PCA was included for comparison purposes. All these methods are briefly reviewed in subchapter 3. Face recognition methods.
In each test with different number of images in training set we made 4 runs with different selection of the images into the training set: original one with choosing the first images alphabetically by name and 3 additional training/testing collections with randomly shuffled images. The final test results are the average from these 4 values.
Our results are summarized in Table 1 and in Fig. 7. All figures and tables in this chapter contain values whose meaning is recognition accuracy in % achieved on test sets. The notation n_train means n images (samples) per subject in training sets.
Presented results are summarized as follows:
Neural networks and SVM
For single sample per person training sets, methods based on neural networks (RBF network and MLP) and also SVM achieved less favorable results (below 70%). The extension of the training sets by second sample per person slightly increased face recognition test results for MLP and SVM methods. For RBF network, the second sample improved the result to the value above 85%. Impact of adding third sample per person into training sets caused a significant improvement of test results (for RBF and SVM above 90% accuracy was achieved). Adding more than four samples per person into training sets has only a minimal effect on increasing the face recognition results and has a negative impact on the computational and time complexity. The larger training sets the better recognition results were achieved.
PCA with Euclidean distance metric as a reference method shows that more images per subject in training set lead to more accurate recognition results, improving from 68% with 1 img./subj. to 89% with 4 img./subj. Although there was reported that 2D PCA can reach higher accuracy in term of precision, PCA slightly overcome 2D PCA in our experiments. However, 2D PCA still has big advantage in comparison to PCA which lies in faster training time due to using smaller covariance matrix. As it is shown in (Li-wei et al., 2005), 2D PCA is equal to block-based PCA and it means that it uses only several parts of covariance matrix used in PCA. In other words we lose information from rest of covariance matrix that can lead to worse recognition rates. KPCA achieved slightly better results compared to 2D PCA (KPCA is included for comparison purposes here and it will not be used further within this chapter).
PCA+SVM method is a two-stage setup including both feature extraction and classification. Features are first efficiently extracted by PCA with optimal truncating the vectors from the transform matrix. The parameters for the selection of the transformation vectors are based on our previous research (Oravec et al., 2010). The classification stage is performed by SVM. SVM model is created with the best parameters found using cross-validation on the training set. PCA+SVM has very good recognition rate even with 1 img./subj. and with 3 and 4 img./subj. it outperforms all other methods in our tests reaching 97% recognition rate with 4 img./subj.
In our experiments, we used local binary patterns method for face recognition in 3 different modifications. The image is divided into 5x5 or 7x7 blocks from which the concatenated histogram is computed. The “LBP 7x7w” modification adds also weighting of the histogram with different weights according to corresponding image regions. This weighting has been proposed in (Ahonen et al., 2004).
Results for all LBP methods are the best in our tests and were outperformed only slightly with PCA+SVM method with 3 and 4 img./subj. The main characteristic of LBP is that the recognition results are very good even for 1 img./subj.. From the graph in Fig. 7 we see that the recognition rates for the three LBP methods go parallel with each other. The LBP is starting with 83% reaching 94% accuracy with 4 img./subj. LBP 7x7 is approximately 1.5% better than the 5x5modification and the LBP 7x7w more than 2% better reaching almost 97% accuracy with 4 img./subj.
Within this chapter, we work with images of size 65x75 pixels after preprocessing. In Table 2, results for image size 130x150pix (FERET default standard) are shown for illustration. Generally, larger size of images can yield slightly better recognition rates.
6. Simulation results for training sets enlarged by generating new samples
In the previous subchapter, we presented recognition results for methods trained by 1 img./subj. in training sets. We also presented the comparison to results for 2, 3 and 4 img./subj. in training sets, while 2nd, 3rd and 4th images were the original images, i.e. the images were real, taken from the original face database.
Herein we consider different situation: only 1 original sample is available and we try to enhance recognition accuracy by generating new samples to the training sets in artificial manner. Thus, we try to enlarge the training sets by generating new (virtual, artificial) samples. We propose to generate new samples by modifying single available original image in different ways – this is why we will use the term image modification (or modified image). Natural continuation of such approach leads to generating synthetic face images.
In our tests we use different modifications of available single per person images: adding noise, applying wavelet transform and performing geometric transformation.
6.1. Modifications of face images by adding Gaussian noise
Noise in face images can seriously affect the performance of face recognition systems (Oravec et al., 2010). Each image capturing generates digital or analog noise of diverse intensity. The noise is also generated while transmitting and copying analog images. Noise generation is a natural property for image scanning systems. Herein we use noise for generating modified samples of original image. In our modifications, we use Gaussian (Truax, 1999) noise.
Gaussian noise was generated using Gaussian distribution function
Gaussian noise was applied on each image with zero mean and in two random intervals of variance. Examples of images degraded by Gaussian noise can be seen in Fig. 8. The labels 03-06noise and 08-16noise mean that the variance of Gaussian noise is random between values 0.03 - 0.06 and 0.08 - 0.16, respectively. The same notation is used also in presented graphs and tables (Tab. 3a and 3b, Fig. 9a and 9b). Noise parameters settings for our simulations were determined empirically. Training sets were created by noise modification of samples added to the original one (1+1noise, 1+2noise and 1+3noise).
Neural networks and SVM
The improvement for RBF, MLP and SVM is clearly visible. In both noise modifications (03-06noise and 08-16noise), the most significant increase in accuracy of test results is achieved by RBF network (about 80% for 1+3 training sets). Similarly to the tests in subchapter 5, adding more samples into training sets has a constant effect on the recognition results.
The results of PCA and 2D PCA methods are only slightly affected when adding additional images with different amount of noise to the training set. The results with the noise images added are approximately 1% worse than the original recognition rate with 1 img./subj. Reason for this effect can be probably found in the fact that the transformation matrix computed from the training sample with added noise represents the variances in the space worse than after computing it from original images only. Adding samples to training set is also very uneconomical from the point of view of PCA methods since the time needed to compute the transform matrix grows.
The effect observed with PCA can be observed also with PCA+SVM method. Adding the noise images to the training set leads to worse results than with the original training set for about 1% for every scenario. The SVM classification model is influenced by the features extracted from noisy samples, but this accuracy drop is not dramatic.
The results of LBP methods are not influenced with the noisy samples at all. This has two reasons:
By LBP method no model or transformation is calculated from the training images, so there cannot be such global effect to the recognition results as with PCA or SVM.
The histograms of LBP patterns in noisy images change rapidly so the distance between the noisy image and the original image of the same person is higher than the distance between two original images of different persons. The consequence is that the minimal distances between the testing and training images do not change and the results are the same as without the noisy images in training set. See Table 4. for illustration of the distances between original and noisy images.
6.2 Modifications of face images based on wavelets
where j is the power of binary scaling, k is a constant of the filter and function ψ is a basic wavelet, f(x) is a function which is to be transformed.
Our modifications of face images were done by three steps:
Forward transform of image by DWT
Setting horizontal, diagonal and vertical details in frequency spectrum
Image reconstruction by inverse DWT
We used two types of wavelets: Reverse biorthogonal 2.4 (Vargic & Procháska, 2005) and Symlets 4 (Puyati et al., 2006) (Fig. 11.). These wavelets were chosen empirically – our aim was to produce slight change in the expression of a face. The training sets were created similarly to those with the noise modification (1+1, 1+2 and 1+3), see subchapter 6.1. An example of new samples is shown in Fig. 12.
Neural networks and SVM
Experiment with wavelet transform demonstrated improvement of one sample per person face recognition using neural network methods - RBF network and MLP. These methods confirmed increase of recognition rate with extending the training sets with images modified by wavelet transform. Improvement above 10% was achieved for RBF network with adding three samples per person (1+3_train) into training sets. On the other hand, SVM method achieved very low face recognition accuracy.
Experiments with extending the training set with images modified by wavelet transform show that there is only a small influence to the results of PCA and 2D PCA methods. The accuracy increases when adding the images with stronger wavelet modification. The accuracy of recognition results is only 1% higher than original 1img./subj. The modified images do not cause any significant change in recognition and so there is almost no gain by adding new sample.
In contrast to PCA, the effect of decreasing accuracy can be seen when also SVM is involved. When 3 images modified by wavelets are added to the training set, the recognition result is almost 30% worse than using the original image only. In this case, only 50% accuracy can be obtained.LBP
The effect of the wavelet modifications to the LBP histogram is similar to that with the noise images, so the LBP results stay the same as with the original training set.
6.3. Modifications of face images based on geometry
One of the most successful approaches to samples generation is that based on geometric transformation. The idea is to learn some suitable manifolds and extend training set by new synthetic poses or expressions based on original image (Wen et al., 2003). Because generation of new samples is based on facial features and their position on the face, these features need to be localized at first.
After the all facial features are properly localized and represented by contour and middle points, the next step is to generate target expressions. Because the change of an expression involves moving detected feature points, there is a need to change texture information as well. Real expressions and direction of movements during the expression depends on strength of muscles contractions. We divided each face image into triangles according to direction of these contractions. Face features localization process and dividing into triangles (also called triangulation) is fully automated (unlike usual manual method described in (C.-kai Yang & Chiang, 2007)) using active shape models (Milborrow, 2008). Using active shape models produces very precise positions of facial features and facial boundaries. Result of triangulation is facial graph containing only triangles among detected points determining facial features.
Making use of rule based system, similar to system described in (Yang & Chiang, 2007), we generated different expressions from each training sample by moving location of points in the facial graph. Texture in each triangle containing moved points is then interpolated from original according new coordinates. This procedure with different rules creates new “smile” and “sad” expressions (Fig. 14) and represents more sophisticated approach to generating additional training samples.
Simulation results for geometric modifications are summarized in Table 6 and Fig. 15. Only results for SMILE expression were included in the graph since it helps to improve recognition. It agrees with the fact that the face database contains more faces with smiles than sad faces. In this way it is also possible to present results consistent with other graphs – 1, 2 and 3 samples per face.
The results are summarized as follows:
Neural networks and SVM
Both RBF network and MLP achieved better recognition accuracy using SMILE face expression images (the increase compared with one sample per person about 10%). Tests with extending the training set by SMILE+SAD face expression images were most effective for MLP method (75.61%). For SVM method, these new samples caused the drop of recognition rate about 25%, similar to the wavelet transform.
Geometric transformation results show comparable influence as those of PCA and 2D PCA using wavelet modifications. The accuracy increases when adding samples with SMILE expression. The accuracy of recognition results is only 1% higher than original 1 img./subj. The modified images do not cause any significant change in recognition. An improvement could be expected when more face expressions is taken in account.
Adding one image modified by geometry into the training set (either SAD or SMILE modification) improved the recognition rate for only about 0.2-0.3% (adding SMILE transformation helps slightly more). Surprisingly, when both transformed images were added to the training set, the recognition rate drops almost 5%.
As expected, adding transformed images with artificial change of expression (SAD and SMILE emotion) to the training set improves recognition. LBP method reaches better results because the system is more resistant against change in expression. Better results are reached when both transformed images (SAD+SMILE) are used. When also the images in the test set are transformed (for every sample also distances for SAD and SMILE transformation are computed), the results are even better, yielding 87.22% accuracy for LBP 7x7w method with 1 img./subj.
6.4. Comments and summary for methods that are influenced significantly by enlarging training sets by adding modified samples
This subchapter deals with methods for which extending the training set by modified images influences recognition results significantly (compared to recognition using multiple original images). The modifications of images described above (noise, wavelets and geometric tranformations) may be most helpful to neural networks. The comparison of recognition results for original training sets and extended training sets for RBF network and MLP is shown in Fig. 16. In Fig. 16 (and similarly in Fig. 17), the horizontal axis represents the number of images per person in training sets: the meaning for method using original images is 1, 2,3 and 4 original images in the training set; the meaning for modified images is 1 original image, 1 original plus 1, 2, or 3 modified images. For RBF network, above 10% improvement using modified images was achieved. For MLP, geometric transformation was the most successful modification of face images (75.61%).
Figure 17. shows the negative effects of adding newly generated samples into training sets. This effect is clearly visible for PCA+SVM and SVM, when training sets are extended by wavelet transform and geometric transformation.
In this chapter, we considered relevant issues related to one sample per person problem in the area of face recognition. We focused mainly on recognition efficiency of several methods working with single and multiple samples per subject. We researched techniques for enlargement of the training set by new (artificial, virtual or nearly synthetic) samples, in order to improve recognition accuracy. Such samples can be generated in many ways – we concentrated on modifications of the original samples by noise, wavelets and geometric transformation. We proposed methods for modifying expression of a subject by geometric transformation and by wavelet transform. We examined the impact of these extensions on various methods (PCA, 2D PCA, SVM, PCA+SVM, MLP, RBF and LBP variants).
Methods such as PCA+SVM or LBP achieved recognition results above 80% for single sample per person in the training set. For these methods, adding new samples (modified images) did not help significantly. On the other hand, the utilization of the extended training sets for neural networks (MLP and RBF network) always increased the face recognition rate. This confirms that an appropriate extension of the input data set enhances the learning process and the recognition accuracy. Adding more than three new samples per person into the training sets has almost no influence on the recognition rate and has a negative impact on the computational and time complexity. The SVM method improved recognition accuracy only for extension of the training set by noise modification of images.
Experimental results for PCA and 2D PCA show only negligible influence of adding modified samples. We can conclude that the use of modified samples for PCA and 2D PCA has no added value, especially when samples are modified by Gaussian noise only.
PCA+SVM (two-stage method with PCA for feature extraction and SVM for classification) achieved very good results even for 1 img./subj. Adding any modified images to the training set did not improve the recognition rates, but the results were still one of the best from the compared methods.
Our experiments show that LBP is one of the most efficient state-of-the-art methods in face recognition. Adding noise and wavelet modified images to the training set does not have any effect on the recognition rates of LBP – unlike other methods that use the training sample to compute models or transformation matrices. This is caused by the nature of the method, where the histogram of LBP patterns of the noisy image differs too much from the original images. This can be also a disadvantage, when the images in the test set are corrupted with noise. On the other hand, adding images with transformed face expression helps and the system is more resistant to expression change in the images.
LBP for face recognition has obvious advantages such as state-of-the-art recognition rates even with 1 img./subj. in the training set, no need to train models or transformation matrices and good computational efficiency. But there is still potential to improve the results by possible modifications and optimization, which can be researched further: selection of LBP patterns, different preprocessing or modifications of LBP operator. The geometric transformation of images (emotional expression or head pose) and generating synthetic samples seem to be good ways how to improve the results. Further research is needed, since a simple extension of the training set with modified images does not always help.
We are currently working on a more sophisticated geometric transformation to cover more facial expressions. Although the results in section 6.3 show only a small improvement (with the exception of MLP where the improvement was significant), we suppose there is great potential of using samples with synthetic expression. The triangular model of face enables to extend the generation algorithm by other possibilities like generation of samples with different poses and illumination conditions. In the future, we also plan to publish modules generating new samples (with different expressions, poses and illumination) for our universal biometric system BioSandbox - (used in our experiments).
Modification of images using wavelet transform has also large potential to generate new samples. One way to create new samples by wavelet transform is a fusion of two face images, where a new image is generated by applying the wavelet transform on two original images, followed by suitable manipulations of coefficients in a transformed space and finally merging images by inverse transform.
Using mean filter (Xu, J. & Yang, J., 2009) is another simple way of creating modified images. By using mean filter with different kernels (2x2, 3x3…15x15), we achieved results close to the modifications by wavelet transform.
Evaluating face recognition in single sample image per subject conditions reflects the real-world scenario. Also other effects such as various occlusions or lighting variation need to be taken into account when trying to reflect real conditions. We also need to test our methods using face databases that contain samples with these variations. Face databases such as ORL or AR could be used for this purpose.
For authentication and identification purposes, face recognition with 1 img./subj. only may not be enough, because its accuracy does not necessarily reach the required level. Therefore face recognition methods can be combined with different biometrics to form a multimodal system with much better characteristics than each of the biometrics itself (Ross & Jain, 2004).
Research described in this paper was done within the grants 1/0214/10 and 1/0961/11 of the Slovak Grant Agency VEGA. Portions of the research in this paper use the FERET database of facial images collected under the FERET program, sponsored by the DOD Counterdrug Technology Development Program Office. We would like to thank to our colleague Radoslav Vargic for valuable consultation regarding practical use of wavelets. We also thank to our student Ján Režnák for preparation of KPCA results.
- Biosanbox project page – http://biosandbox.fei.stuba.sk