Performance of the adaptive fitness approach (AFA) using different number of training images from BANCA database with fixed and updated gallery set.
In the last two decades face recognition has emerged as an important research area with many potential applications that surely ease and help safeguard our everyday lives in many aspects (Zhao et al., 2003; Kirby & Sirovich, 1990; Turk & Pentland, 1991; Martinez & Kak, 2001; Belhumeur et al., 1997; Philipps et al., 2000; Eleyan & Demirel, 2007, 2011; Brunelli & Poggi, 1993; Wiskott et al., 1997). The face recognition problem from still images has been extensively studied (Sinha et al., 2006; Eleyan et al., 2008). Face recognition from video has recently attracted the attention of many researchers (Zhou et al., 2003; Li & Chellappa, 2002; Wechsler et al., 1997; Steffens et al., 1998; Eleyan et al., 2009). Video is inherently richer in information content when compared with still images. It has important properties that are absent in still images. Some of these important properties are the temporal continuity, dynamics and the possibility of constructing 3D models from faces. On the other hand, it should also be noted that video acquired facial data are normally of very low quality and low resolution, which make recognition algorithms very inefficient. The temporal continuity and dynamics of a person captured by a video makes it easier for humans to recognize people. Humans are usually able to recognize faces in very low resolution images. This is not the case for the computer based techniques which have been shown to be quite capable in recognizing faces from still images. Utilization of these properties for more efficient and high performance face recognition algorithms requires approaches that are different than the traditional approaches.
There are many reasons why humans are so successful in recognition of faces in video while computers are not. Some of these are: 1) Humans use a collection (flow) of data over time rather than an individual video image during both training and testing. 2) Humans are superbly capable of tracking objects. And in so doing can make excellent use of flow of data
In the training stage, when a new person is to be “memorized” many features such as appearance, gestures, gait etc. are encoded. Each person in the human memory (gallery) is encoded differently and there are quite a number of people memorized by humans. In the testing (recognition) stage human beings compare these features and make a decision on the identity of a person. This process however is not a “one shot” comparison, but it is continually made based on the flow of data. When the person is far a way for example, it is difficult to discern the facial features. However from the gait and gestures the human brain is able to extract important information to identify an approaching person. Based on this information the human brain automatically deems some of the people in the memory as unlikely candidates to match the approaching person and thus those candidates are not considered in further comparisons/associations. As the person approaches closer, the human brain restricts the comparison to reduced set of likely candidates in the memory.
Inspired by this biological process of making comparisons and making decisions based on a reduced set of candidates at testing stage, we propose in this chapter to design an analogues structure for computer based face recognition from video whereby the gallery is continually updated as the frames of the probe video is processed. In order to demonstrate the effectiveness of the proposed approach we employ features derived from PCA or LBP. After every probe frame, the feature vector is compared with the feature vectors of the gallery images, the unlikely images in the gallery are discarded based on the accumulated fitness of the gallery images. An update set of features are derived using remaining image in the gallery. The update set of features are used to test the next frame in the probe video. The results obtained using the idea of updated galley set indicates that significant improvement in recognition performance can be achieved. The adaptive fitness approach (AFA) is also tested without updating the gallery set. Again, this scheme with fixed gallery set gives comparable performance results as the scheme with updated gallery set.
The rest of the chapter is organized as follows. Section 2 briefly reviews feature extraction. Section 3 presents the face video database. Section 4 introduces the adaptive fitness update approach. Section 5 reports our experimental results and discussions, and Section 6 concludes this chapter.
2. Feature extraction
Feature extraction is a very crucial stage of data preparation for later on future processing such as detection, estimation and recognition. It is one of the main reasons for determining the robustness and the performance of the system that will utilize those features. It’s important to choose the feature extractors carefully depending on the desired application. As the pattern often contains redundant information, mapping it to a feature vector can get rid of this redundancy and preserve most of the intrinsic information content of the pattern. The extracted features have great role in distinguishing input patterns.
In this work, instead of using more biologically oriented features, for the reasons of simplicity we employ features derived from principal component analysis (PCA) (Kirby & Sirovich, 1990; Turk & Pentland, 1991) and local binary patterns (LBP) (Ahonen et al., 2004; Ojala et al., 2002). However the recognition framework does allow the incorporation of other features. In PCA case, one needs to prepare a projection space using the training set and use it to preparing the feature vectors of both training and tested sets. In LBP every image is processed independently to form its feature vectors. So if the size of the training set changed as it does in AFA, new space has to be prepared if PCA is used to form the feature vectors while feature vectors will stay same if LBP is used.
3. Video face database
In this study we used the BANCA database (Popovici et al., 2003), which is a multimodal database designed with various acquisition devices (2 cameras and 2 microphones), and under several scenarios (controlled, degraded and adverse). The videos were collected for 52 individual (26 male and 26 female) on 12 different occasions (4 recordings for each scenario). In our work we will be using the video sequences for the 52 individual with the three different scenarios. In the degraded scenario a web cam was used, while higher quality camera was used in the controlled and adverse scenarios. Figure 1 shows samples from the database for the three scenarios.
As it was computationally expensive to use all the frames in each individual’s video sequence, we selected 60 frames which correspond to every other frame in the video sequence. The face images from the first
It was essential to run face detection in the pre-processing stage on the extracted frames in order to prepare them for the face recognition process. For this reason, the local Successive Mean Quantization Transform (SMQT) (Nilsson et al., 2007) has been adopted for face detection and cropping due to its robustness to illumination changes. Cropped faces were converted to grayscale and histogram equalized to minimize the illumination problems. Bicubic interpolation was used to resize the resulting face images to the same size of the reference resolution (size of gallery images 128 ×128). Figure 2 shows an example of the face detection cropping and resizing preprocess for one of the image in BANCA database.
4. Adaptive fitness based updating
4.1. Adaptive Fitness Approach (AFA)
The features of each subject in the gallery are derived from the first
Inspired by this biological process which is employed by human brain in recognition tasks, we propose a simple approach to adaptively shrink the size of the gallery set after each frame of the test video is processed. A fitness measure i,k is defined using the Euclidean distance as
The accumulated fitness measure forms the basis for shrinking the size of the gallery by discarding the candidates in the gallery that are unlikely to form a match with the probe video frame. After eliminating unlikely candidates from the gallery, a new set of features is formed from the remaining more fit candidates. For example, if PCA is used for feature extraction, after eliminating images from the gallery the existing eigenspace is updated and new feature vectors is formed for the remaining images. On the other hand if LBP is used, throwing out an image accounts to throwing out the corresponding feature vector; thus there is no need to recalculate a new set of feature vectors. Eventually the continuous updating of the gallery promises to leave behind few candidates that are very likely to form a match with the person under test.
This approach has several advantages. Its resemblance to the recognition by human beings is the first to note. Second, it promises to speed up the recognition process due to the discarding of the unfit images from the gallery. However it should be pointed out that due to the discarding of images from the galley this approach may lead to, even though very unlikely, throwing out some of the correct images in the gallery.
The number of discarded images from the gallery set at each processed frame depends on the standard deviation of the accumulated fitness values at that particular frame. The standard deviation of this distribution is used to establish a fitness threshold c for discarding gallery images. The critical fitness value c is picked conservatively to ensure with almost 100% confidence that the correct gallery images are not eliminated. This forces one to process almost all the frames in order to come up with a decision since with a low c one discards few images from the gallery. This also leads to higher computational burden. This undesirable situation can be avoided by picking a higher threshold c.
The adaptive fitness approach can also be used without updating the gallery. In this scheme one simply process all the frames in the probe video and accumulates the fitness measure with the originally prepared feature vectors. This approach where the gallery is fixed and no updating is required is computationally more efficient compared with the scheme where one updates the gallery and the feature vectors. However, this advantage is not significant since the updating of feature vectors after the gallery is reduced in size can be done incrementally without much computational burden. Furthermore, in the scheme with gallery updating one does not need to process all the video frames to come up with a decision. Figures 3 and 4 give step by step the algorithms of these two schemes.
An example of how the accumulated fitness measure is employed in the video recognition process with updating of the gallery is depicted in Figures 5 and 6. The feature vectors in Figure 5 are derived from PCA where in Figure 6 feature vectors come from LBP. In this example the probe video belongs to person # 1. The accumulated fitness measure in both figures show clearly that the accumulated fitness corresponding to person # 1 increases while for all other people it is insignificant. Number of training images for each person in this example was n=1 using the controlled scenario (see first row in Table 1).
4.2. Adaptive Fitness Fusion Approach (AFFA)
To recognize an individual human beings use more than one feature such gait, face, body shape and even wearing. A simple fusing technique is employed. The individual fitness
measures coming from PCA and LBA are simply added. The recognition system based on feature vector fusion is the same as before. In the same manner, at the end of processing all the frames the individual with the highest fitness value is declared to be the correct subject. Figure 7 and 8 show the pseudo codes for the proposed fitness fusion idea with fixed and updated gallery set respectively.
5. Simulation results and discussions
Figures 9 to 14 show the performance of the proposed AFA with updated and fixed gallery. AFA used both LBP and PCA for feature extraction and the results were compared against single frame based PCA and LBP methods, respectively. The three scenarios were shown in these figures with 1 and 5 training images in the gallery set. Both updated and fixed galleries show high competitive results.
The performance of the system is tested using BANCA database under 3 scenarios: controlled, degraded and adverse. For each scenario there are 52 people. For each individual there are 4 videos. The initial gallery is formed from varying number of training images per individual. For this study the numbers ranged from 1 to 10 as 2nd column of table1 depicts.
Usually human beings recognize people by fusing more than one feature. Here we show how the simple approach can be extended to benefit from different feature vectors. This fusion further improves the performance significantly. Again we employ features derived from PCA and LBP for simplicity and convenience.
Due to the fact that the performance of the AFA without fusion was very high (almost 100%) in order to faithfully see the improvement of fusion we increased the video database. As explained in section 3, the Banca database consists of 52 people with 3 scenarios and 4 recordings for each scenario. We treated the 4 recordings of each individual in each scenario as a different individual. This modification accounts to using 208 subjects with 3 different
|Scenario||# of gallery|
images per individual (
|Recognition Performance (%)|
updated gallery set
fixed gallery set
|Controlled||1||95.67 / 97.12||97.60 / 100||66.25 / 85.62|
|2||97.60 / 98.56||98.56 / 100||77.17 / 90.51|
|3||98.08 / 99.04||99.04 / 100||82.13 / 93.28|
|5||99.04 / 100||99.52 / 100||89.14 / 96.58|
|10||100 / 100||100 / 100||96.16 / 98.38|
|Degraded||1||90.39 / 94.23||89.90 / 97.60||63.06 / 84.09|
|2||95.67 / 95.67||96.15 / 98.08||73.48 / 88.30|
|3||96.63 / 98.56||96.63 / 98.56||78.25 / 91.30|
|5||98.08 / 99.52||97.12 / 99.04||84.76 / 94.43|
|10||100 / 100||100 / 100||97.15 / 97.63|
|Adverse||1||91.83 / 96.63||92.31 / 97.60||68.17 / 87.24|
|2||95.19 / 99.52||96.63 / 98.56||78.65 / 92.76|
|3||98.56 / 100||98.08 / 99.56||84.25 / 95.41|
|5||99.52 / 100||99.04 / 100||90.35 / 97.39|
|10||100 / 100||100 / 100||99.04 / 98.89|
|Scenario||# of gallery|
images per individual
|Recognition Performance (%)|
|updated gallery set||fixed gallery set|
scenarios. This is far more challenging since the 4 recordings of each individual are quit similar in terms of feature vectors. The results of this modification in the database size together with the adaptive fitness fusion approach (AFFA) results between LBP and PCA with updated and fixed gallery sets are shown in table 2.
The graphs in figure 15 to figure 20 show examples of the performance results of the proposed fitness fusion in the three different database scenarios (controlled, degraded, adverse) with 1 and 5 training images. The results shown in these figures are obtained using the scheme with fixed gallery set. It is clear in all figures that the fusing of separately obtained the fitness values PCA and LBP using the PCA and LBP feature vectors helped to improve the performance of the system. For example, in figure 15 the performance of AFA approach in the degraded scenario with 1 training image was 71.16 % and 75.85% using PCA and LBP, respectively. When fusion technique AFFA was applied the performance increased to 79.33 %. In figure 16, the training images were increased from 1 to 5 images in the same scenario. With AFA the performance was 88.46 % and 92.79% using PCA and LBP, respectively, and with AFFA it reached 94.23%. Same observation can be made for the other two database scenarios with different training images in figures 17 to 20.
In this chapter a new biologically inspired approach called Adaptive Fitness Approach (AFA) for identifying faces form video sequences is proposed. The fitness value of each image in the gallery set is calculated and accumulated as the probe video frames are processed. To schemes are used with the AFA approach. First scheme employs discarding of unfit images from gallery followed by an update of the feature vectors. In the second scheme gallery and thus the feature vectors are kept fixed.
In order to demonstrate the proposed AFA approach with updated and fixed gallery schemes, PCA and LBP derived features are employed for convenience. Performance of both schemes is far superior to single frame based PCA or LBP approaches. Even for very small number of training images. The adaptive fitness framework is also shown to conveniently accommodate fusing of different feature vectors with further and significant improvement in recognition performance over the AFA with single feature.