Recognition rate (%) of the proposed eBGP descriptor on the SCface dataset using the patch-based topology.
An excellent face recognition for a surveillance camera system requires remarkable and robust face descriptor. Binary gradient pattern (BGP) descriptor is one of the ideal descriptors for facial feature extraction. However, exploiting local features merely from smaller region or microstructure does not capture a complete facial feature. In this paper, an extended binary gradient pattern (eBGP) is proposed to capture both micro- and macrostructure information of a local region to boost up the descriptor performance and discriminative power. Two topologies, the patch-based and circular-based topologies, are incorporated with the eBGP to test its robustness against illumination, image quality, and uncontrolled capture conditions using the SCface database. Experimental results show that the fusion between micro- and macrostructure information significantly boosts up the descriptor performance. It also illustrates that the proposed eBGP descriptor outperforms the conventional BGP on both the patch-based topology and the circular-based topology. Furthermore, a fusion of information from two different image types, orientational image gradient magnitude (OIGM) and grayscale image, attained better performance than using OIGM image only. The overall results indicate that the proposed eBGP descriptor improves the recognition performance with respect to the baseline BGP descriptor.
- surveillance system
- face recognition
- binary gradient pattern (BGP)
- facial feature extraction
- patch-based topology
- circular-based topology
Face recognition is one of the biometric verification methods that offers a wide range of applications such as law enforcement, forensics, biometric authentication, surveillance, and health monitoring . Face recognition has also been used to authenticate payment using mobile wallet, and the social media company like Facebook uses face recognition algorithm for the purpose of image tagging . One of the advantages of face recognition is being contactless between the subject and camera. Given the advantages offered by face recognition and with the advancement in computing power, significant research and methods have been proposed over the years in face recognition domain. In fact, a robust facial recognition system must be able to work with various real-life situations or unconstrained conditions, such as but not limited to pose, lighting, image or camera quality, occlusion, rotation, and translation. The system must also be able to perform extremely well in a domain where limited sample is available. In surveillance monitoring applications, a typical approach is to sample face appearing in videos and then match them with facial models generated from high-quality target face image [3, 4].
Feature extraction is the process of capturing feature of interest from the face and represents it in the form of feature vector. The extraction process is usually done by a face descriptor. This descriptor must be able to work with multiple variations such as illumination, occlusion, face expression, and image quality . Indeed, there is a collection of face descriptors proposed over the years such as scale-invariant feature transform (SIFT) , speeded up robust feature (SURF) , local binary pattern (LBP) , and histogram of oriented gradient (HOG) . In terms of facial feature representation, there are two types of representations that many descriptors have evolved around over the years. They are global and local feature representations. Global-based feature extraction like principal component analysis , linear discriminant analysis , and independent component analysis  preserves the statistical information of the face by turning each face image into a high-dimensional feature vector. Meanwhile, local-based feature splits input image into smaller patches and extracts the micro textural details from each patch before fusing these features back to form the global shape information. Local-based feature extraction has shown to be resilient to multiple variations by enforcing spatial locality in both pixel and patch levels. For instance, local feature descriptor is robust to local deformation in expression and occlusion. LBP  is an example of feature extraction method that works on this principle which achieved reasonably good performance but heuristic in nature. Recently, LBP has drawn great intention as a face descriptor due its reputation as a powerful texture descriptor . LBP extracts local-based spatial structure of an image by thresholding intensity of center pixel with its neighborhood. The product of this operation is characterized as local binary pattern, which then the distribution of binary pattern over the whole image is used to form the LBP histogram vector or feature vector. Neighborhood pixels are sampled on a circle, and any neighbor which does not fall exactly on the center of the pixel has an intensity computed from interpolation . Due to some shortcomings of LBP, for instance, LBP produces long histogram, and therefore it is memory-consuming , LBP is very sensitive for image rotation and noise , and it only captures microstructure and ignores macrostructure of the texture resulting in missing extra discriminative power . Several variants of LBP have been proposed in the literature, for example, rotation-invariant LBP , median robust extended local binary pattern (MRELBP) , and binary gradient pattern (BGP) . This paper touches on a number of relevant existing LBP-based descriptors. The rest of this paper is organized as follows. In Section 2, two state-of-the-art descriptors (the LBP  and its variant, the BGP ) would be briefly reviewed since we would embed the proposed extended BGP (eBGP) into these two descriptors. Section 3 describes the proposed eBGP descriptor. The evaluating results are analyzed and discussed in Section 4. Finally, conclusions are drawn in Section 5.
2. From local binary pattern (LBP) to binary gradient pattern (BGP)
LBP  is one of various texture descriptors and is known for being computationally efficient . It extracts local-based spatial structure of an image by thresholding intensity of center pixel with its neighborhood pixel
where and are the gray values of the center pixel and its neighbors, respectively,
The success of LBP has continued since then. A variety of LBP-based descriptors have been proposed recently to overcome all shortcomings toward noise, illumination, color, and temporal information. Huang and Yin  proposed an improved version of LBP, called binary gradient pattern (BGP), by introducing structural pattern and image gradient orientation (IGO) implementation in multiple directions rather than on X and Y directions only, as in the conventional manner. The implementation of IGO in multiple directions helps to improve discriminative power of the proposed descriptor. Figure 2 shows how BGP encodes binary string from a region of interest (ROI). Given a set of grayscale intensity value of 9 pixels as in Figure 2(a), BGP computes binary correlations between symmetric neighbors of central pixel from multiple
Binary string for the ROI is constructed from four principal binary numbers which is equivalent to 0111, and the label
The number of structural labels
Based on a series of results obtained from multiple databases such as Extended Yale B , AR , CMU Multi-PIE , FERET , and LFW  against a wide range of descriptors, BGPM is proven to be the best descriptor for each database. The BGPM descriptor has achieved invariance against illumination changes and local distortions while reducing the vector dimensionality. BGP compact representation makes BGP extremely fast and uses much fewer pattern labels than LBP at any spatial resolution. For instance, in a system with spatial resolution of (8,1), BGP histogram only needs 9 bins, with 8 bins for structural patterns, and 1 bin for nonstructural patterns, in contrast to the LBP which requires 59 bins. BGP and BGPM have been demonstrated to possess strong spatial locality and orientation properties which lead to effective discrimination.
Although BGP has shown to be efficient in processing time and achieving outstanding results in several databases, BGP was never being tested with a proper surveillance database like , which consists of low-resolution non-frontal face images taken by different camera quality. Like most of other local-based descriptors, BGP exploits information from microstructure only, however exploiting facial feature from macrostructure to complement the microstructure feature resulting in a more complete image representation [23, 24], especially for surveillance applications where noise, occlusion, and head position might impact the descriptor performance. In this paper, information from both micro- and macrostructures are captured and integrated into the BGP descriptor to boost up its performance for video surveillance applications. The new proposed descriptor is termed as an extended BGP (eBGP).
3. Extended binary gradient pattern (eBGP)
An eBGP extends the BGP descriptor by exploiting macrostructure information from topology with larger spatial resolution. There are many different types of macrostructure topologies that have been proposed for other LBP variants . In this paper, the patch-based topology with eight neighborhood patches and the circular topology are evolved with the proposed eBGP descriptor. Both topologies have been implemented by [24, 26], where each topology has its pros and cons with the implementation. Regardless of the topology, the microstructure information is always extracted using the same approach as in BGP. Herein, the eBGP is explained with the focus on extracting features from macrostructure based on the patch-based topology with eight neighborhood patches and the circular-based topology.
3.1 Patch-based topology
Patch-based topology is inspired by multi-scale block local binary pattern (MBLBP) . In this topology, macrostructure is made up of nine patches of pixels as in Figure 5. All these patches have the same size and width, while the center patch represents the ROI microstructure. Thereby, a default BGP operator is applied to the center patch in order to extract the microstructure information, whereas the macrostructure information is extracted from the eight neighborhood patches. Accordingly, multiple sizes of patches could be selected from this topology, and the size of the structure is determined by the spatial resolution of the center patch.
For instance, when exploiting microstructure information from (8,1) spatial resolution, the size of the center patch will be 3 × 3 pixels as illustrated in Figure 5(b). In this implementation, all patches have the same size and do not overlap each other; therefore the macrostructure is formed from nine patches of 3 × 3 pixels. Figure 5(a) depicts the macrostructure topology formed from 9 patches of 5 × 5 pixels when microstructure information is exploited from (16,2) spatial resolution. For comparison purposes, this research will evaluate two structures as illustrated in Figure 5(a) and (b), to match BGP results exploited from (8,1) and (16,2) spatial resolution. Using Figure 5(a) as an example, each neighborhood patch contains 25 pixels with each pixel having its own grayscale value. Unlike the center patch, no feature is extracted from the individual neighborhood patch. Instead, each neighborhood patch is represented by a single intensity value which will be used for thresholding. In this topology, the patch’s mean and median will be applied to represent the patch intensity. The patch’s mean (G) of a neighborhood patch (P), accounted from 25 pixels in a single 5 × 5 patch, is computed as follows:
On the other hand, the patch median is computed by finding the middle value of ordered pixel values. Additional experiments are conducted in this research to find the best representation for the patch-based topology. As an example, feature extraction from macrostructure is illustrated in Figure 6. Figure 6(a) shows the patch-based topology with the size of 3 × 3 pixels and its intensity value. In each patch, a median is calculated from all pixels within the patch, and the median now represents the image intensity of the patch as shown in Figure 6(b). The following steps are similar to what has been explained in BGP. By thresholding each patch with symmetric neighbors in four directions using Eqs. (2) and (3), four pairs of binary numbers are generated as shown in Figure 6(c). Once all the principal bits are computed, the label can be calculated using Eq. (4). In general, the flow for macrostructure extraction is like microstructure except for its representative value used during thresholding. Indeed, the microstructure information is extracted from neighborhood pixels, while the macrostructure information is extracted from neighborhood patches.
Since there are only eight neighbor patches, regardless of the structures’ size, the generated histogram vector which represents the macrostructure information is bound to the maximum of 16 bins. Observing only a structural pattern will greatly reduce the dimensionality of macrostructure information to eight bins. The total length of the histogram vector (
Subsequently, information fusion between micro- and macrostructures is conducted through concatenating the feature vectors of both the microstructure and the macrostructure, as illustrated in Figure 7. At this point, both feature vectors are contributed by the same weight. Figure 8 demonstrates an example of face image represented using the patch-based topology. Figure 8 illustrates that eBGP on the patch-based topology capable to capture the micro textural details and the macrostructure provides complementary information to the small details. Moreover, the macrostructure information contains less detailed information and may reduce the noise or outlier embedded in the image.
3.2 Circular-based topology
Circular-based topology borrows the basic implementation of LBP which identifies a neighborhood as a set of pixels on a circular ring. In this topology, two levels of information are extracted from neighborhood at two different spatial resolutions. The first level of information is the microstructure information, which is extracted from a set of pixels on a circular ring with radius
Figure 10(a) shows a sample of image intensity that falls on circular rings
In BGP scheme, the length of histogram vector is equal to the number of neighbors at any spatial resolution. Similar to the patch-based topology, the generated histogram vector which embeds micro- and macrostructure information is concatenated to form a final representation of features for each ROI. The total length of histogram vector in this scheme can be computed using:
Figure 11 illustrates the general flow of feature extraction in the circular-based topology. Overall, this topology employs BGP operator on two different spatial resolutions, where the smaller resolution is for the microstructure information and the larger resolution is for the macrostructure information. In this research, no interpolation has been done to neighboring pixels where the circle does not fall exactly on the center of pixels. Figure 12 presents a sample image that is extracted from the two spatial resolutions
Similar to the patch-based topology, BGP captures the micro-oriented edges from the small structure while capturing less details of information at a much larger spatial resolution. But the combination of these two information will complement each other in providing a complete face representation.
4. Results, discussion, and analysis
To illustrate a real-world video surveillance system, the effectiveness of the proposed eBGP descriptor was evaluated using the Surveillance Camera Face (SCface) database . The SCface database consists of low-resolution non-frontal face images taken by different camera quality. A series of experiments were planned to test all proposed topologies and structures on the SCface database. The performance of the proposed eBGP descriptor was evaluated against illumination, image quality, single sample per person, and real-world capture condition.
In fact, the SCface database is the most challenging database for face recognition, where its images were taken in uncontrolled indoor environment. The SCface database consists of 4160 images from 130 subjects. All images were taken at three distinct distances from the camera, where the cameras are installed at 2.25 m above the floor. Images were captured at distance 1 while the subject position is 4.20 m away from the camera, whereas for distances 2 and 3, the subject positions were at 2.60 and 1.00 m, respectively. The outdoor light was only the source of illumination, which came through a window on one side. The images were captured from five different quality commercial surveillance video cameras and two infrared night-vision cameras, in uncontrolled lighting so as to mimic the real-world conditions. Furthermore, full frontal mug shot image for each subject was captured using a high-quality photo camera with the capture conditions exactly the same as would be expected for any law enforcement. The high-quality photo camera for capturing visible light mug shots was installed the same way as the infrared camera but in a separate room with the standard indoor lighting, and it was equipped with adequate flash. In our experiments, the high-quality mug shot image of each person was used as a training gallery, while the remaining images from the five surveillance cameras and distances were used as test images, as depicted in Figure 13. With the focus of this research toward images in visible spectrum and single sample per person, especially for real-world surveillance system, the images taken from IR night-vision cameras and mug shot rotation were not used in this research. As preprocessing steps, all images in the SCface database were aligned based on the provided eye coordinates, so that the eyes’ line lies on a straight line. The images were then scaled and cropped to 64x64 pixel as has been implemented in .
The performance of the proposed eBGP descriptor was evaluated using the histogram intersection, where the histogram intersection computes the similarity between two discretized probability distributions or histogram vectors. Given
It is vital to stress that the classifier plays a decisive role in achieving better recognition rate. In this research, the experiments were dictated in such a way to focus on recognition rate improvement due to macrostructure information fusion. Hence, the recognition rate of the proposed eBGP descriptor and its baseline BGP descriptor were computed and compared to verify the recognition rate improvement. For comparative analysis, results of BGP descriptor on the SCface database are produced by running the BGP code requested from . This is to ensure analysis of the result can be done without any concern on the validity of the results. In fact, Huang and Yin  do not use the SCface database in their work; thus BGP code was altered to work with the SCface database.
4.1 Experiment settings and preprocessing
As a preprocessing step, each image is first transformed into OIGM images using the same method used by the BGP descriptor. OIGM images are then divided into
4.2 Results of patch-based topology
For better presentation, several notations are used to describe the experiment setup and implementation. BGPM(
Table 1 shows the performance of the proposed descriptor on the SCface database, where eBGPM(16;2) and eBGPM(8;1) represent the extended BGPM (eBGPM) with structures of Figure 5(a) and Figure 5(b), respectively. Results of BGPM(16;2) and BGPM(8;1) represent the baseline descriptor. As mentioned before in this section, the images of SCface database were captured by five cameras with three different distances. Table 1 shows the recognition rate results for each set and the average recognition rate over all cameras. The recognition rate for each set was calculated based on Eqs. (8) and (9).
From Table 1, it can be seen that none of the descriptors achieved recognition rate higher than 35% over all cameras and distances. Particularly, the images of distance 1 recorded the lowest recognition rate with an average of 4.58%, while the images of distances 2 and 3 achieved better recognition rates with an average of 14.89 and 15.73%, respectively. Table 1 also shows that eBGPM(8;1) slightly boosted up the performance comparable with BGPM(8;1) for all distances, where it attained the highest recognition rate over BGPM(8;1) on the distance 2 with an average recognition rate which equals to 3.54%. On the contrary, eBGPM(16;2) has a mix result with respect to its baseline BGPM(16;2); the performance drop can be observed from camera 1 gallery results, where distance 1, distance 2, and distance 3 show lower recognition rate comparable with the baseline descriptor. Similar to eBGPM(8;1), eBGPM(16;2) presented the highest recognition rate on distance 2 gallery images compared to those from distance 1 and distance 3. This is because the gallery images of distance 1, which have been acquired at 4.20 m distance, are low in resolution and small in size. Moreover, the process of scaling and cropping the images into 64 × 64 size leads to loss of the quality and some dominant features. On the other hand, the images of distance 3 are higher in quality and details. However, as the subjects are closer to the camera, which is installed at 2.25 m from the floor, in most natural head position, the upper half of the subject face is more dominant in the captured images as depicted by Figure 14. Figure 14 demonstrates that the images of distance 2 are slightly better in quality than the other two distances, but they still suffer from head position. This interprets the superiority of descriptors on this distance.
Due to these discouraging results by both the proposed eBGP descriptor and its baseline BGP, extra experiments were conducted on the SCface database. Since Table 1 illustrated that the recognition rate is improved with increase of the spatial resolution, consequently the BGPM descriptor is first extended to larger spatial resolution of (24,3). Even though recognition rate increased by including the macrostructure in eBGP, the overall recognition rate is still too low for realistic applications. It might be because the structural pattern and OIGM image were extracted from low-resolution and deformed images (after scaling and cropping have been done). Hence, two additional descriptors were then designed to investigate the effectiveness of structural patterns and OIGM image when exploiting the macrostructure information from low-resolution images. These descriptors still use BGPM in exploiting information from the microstructure, but they extract the macrostructure information in a different way.
The first additional descriptor, denoted as Type I
Results in Table 2 expose that the Type IIP descriptor achieved better recognition rate than the rest of descriptors. The results also illustrate that Type IIP achieved better performance on images of distance 2 than those from distances 1 and 3. Furthermore, it is notable to mention that employing BGPM(24;3) at larger spatial resolution did not help much in improving the recognition rate as much as Type IIP has achieved.
4.3 Results of circular-based topology
As described in Section 3.2, the macrostructure information are exploited from the outer circle which always has larger spatial resolution (
Performance of Type Ic and Type IIc descriptors on the SCface dataset at distance 1, distance 2, and distance 3 is presented in Tables 3,4, and 5, respectively. Similar to the results obtained by the patch-based topology, the average recognition rate of the images that belong to distance 1 from all cameras is the lowest compared to those from distance 2 and distance 3 as shown in Table 3. One noteworthy observation is that most of Type IIc descriptors at any spatial resolution achieved better recognition rate than Type Ic descriptors. Taking a closer look at the descriptor’s performance in Table 5, Type IIc descriptor with spatial resolution of and recorded the best results for all cameras on the test gallery of distance 3. On the other hand, for distance 2 test gallery, Type IIc descriptor with spatial resolution of and achieved the best result against other combinations.
For further evaluation, Table 6 demonstrates results of the proposed eBGP descriptor compared with state-of-the-art descriptors such as PCA , SIFT and sparse representation-based classification (SRC) , and edge-preserving super-resolution (SR) , on the SCface database at distance 2. All descriptors applied the same test conditions, where only one mug shot image per subject is used for training, while the remaining low-resolution images from all cameras are used as probe images. The results show that the proposed descriptors based on eBGP achieved the highest recognition rates compared to all other descriptors, especially eBGPM(16;2) (Type IIP) which has the best recognition rate over all camera images. Exploiting information from the macrostructure raised the BGPM results from the fifth highest to first. This indicates the importance of the macrostructure information in shaping a complete face representation in single-reference face recognition problem.
|Edge-preserving SR ||26.92||21.54||15.38||24.61||15.38||20.77|
|eBGPM(16;2) (Type IIP)||34.62||25.38||20.00||25.38||21.54||25.38|
In this paper, an extended BGP (eBGP) descriptor, which incorporates macrostructure information into BGP descriptor, has been proposed to improve the overall descriptor performance in single-reference face recognition problem. Results obtained from a series of experiments on the SCface database showed that a fusion of information extracted from micro- and macrostructures is capable of boosting up the performance of BGP descriptor. The proposed eBGP descriptor was tested with the patch-based and circular-based topologies; in overall, the circular-based topology outperformed the patch-based topology in terms of recognition rate. In patch-based topology, 5 × 5 structure recorded better hike in recognition rate than 3 × 3 structure, while in circular-based topology, larger spatial resolution showed better hike in the recognition performance. Moreover, a fusion of micro- and macrostructure information extracted from OIGM and grayscale image, respectively, raised the recognition rate higher. In fact, Type IIc setup always illustrated a better performance boost than Type Ic. With regard to thresholding implementation, it is worth to mention that local mean is on par with the local median for the descriptor and does not offer additional boost in the patch-based topology.
The authors highly acknowledge Universiti Sains Malaysia for its fund Universiti Sains Malaysia Research University Grant (RUI) no. 1001/PELECT/8014056.