Spatial Domain Representation for Face Recognition

Spatial domain representation for face recognition characterizes extracted spatial facial features for face recognition. This chapter provides a complete understanding of well-known and some recently explored spatial domain representations for face recognition. Over last two decades, scale-invariant feature transform (SIFT), histogram of oriented gradients (HOG) and local binary patterns (LBP) have emerged as promising spatial feature extraction techniques for face recognition. SIFT and HOG are effective techniques for face recognition dealing with different scales, rotation, and illumination. LBP is texture based analysis effective for extracting texture information of face. Other relevant spatial domain representations are spatial pyramid learning (SPLE), linear phase quantization (LPQ), variants of LBP such as improved local binary pattern (ILBP), compound local binary pattern (CLBP), local ternary pattern (LTP), three-patch local binary patterns (TPLBP), four-patch local binary patterns (FPLBP). These representations are improved versions of SIFT and LBP and have improved results for face recognition. A detailed analysis of these methods, basic results for face recognition and possible applications are presented in this chapter.


Introduction
Face recognition is a powerful biometric system in today's highly technological world. It is widely accepted over other biometric systems like, finger print, iris or speech recognition for security, surveillance, and commercial applications. Face recognition system is generally a procedure of multiple major stages: face detection, preprocessing, feature extraction and verification. A complete structure of face recognition system is shown in Figure 1. Face detection detects a single face or number of faces present in a given image. Viola-Jones face detection algorithms using Haar features [1], faster R-CNN face detector [2], and face detection based on Histograms of Oriented Gradient [3] are popular methods for detecting faces in an image. Generally, images are captured under unconstrained environment and hence needed to be preprocessed before feeding to feature extraction stage. Preprocessing mainly aims to reduce noise effect, difference of illumination, color intensity, background, and orientation. The correct recognition of image depends upon quality of captured image, lighting condition etc. [4]. Recognition rate can be improved by performing pre-processing on the captured image. Various pre-processing techniques are used in image processing to improve the recognition rate such as cropping, image resizing, histogram equalization and de-nosing filtering as described below.
1. Face Detection and Cropping: -Face detection involves detecting face image from whole image. Cropping can be done based on one or more features of the image such as eyes, lips, nose etc.
2. Image Resizing: -Variation in face image size, shape, pose etc. raises difficulty for designing face recognition algorithms. So it is very important to resize image before feature extraction. For this, face images are cropped again into a standard size. Affine transformation can be applied on face with Bilinear Interpolation algorithm.
3. Image Equalization: -Illumination variation problem in the original resized image is overcome by using histogram equalization. 4.Image De-noising and Filtering: -Raw images are captured with many noise during the time of capturing the image and later also. Wiener filter and median filter are used to remove noises [5].
Next is feature extraction which is considered as the most prominent stage in face recognition system to extract discriminative facial features. Extracted features are then represented as feature vector and are fed to verification stage. Feature selection is an optional stage before verification which reduces feature vector dimensions using dimensional reduction techniques [6]. Final stage is verification to identify an unknown by finding closest matching in gallery.

Existing face databases
There are a number of benchmark face databases for fair face recognition evaluation by researchers. These databases are designed with images or videos of a number of individuals with varying conditions and resolutions. A summary of benchmark face databases is tabulated in Table 1.
A detailed structure of some of these face databases are provided below.

A&T Database
A&T Database originally known as ORL database has face images captured in the interval April 1992 to April 1994. This database is collected by researchers of Cambridge University Engineering department for face recognition project. There are total 400 images in A&T database captured by taking 10 different images of 40 individuals. All images are captured in a dark homogeneous background with resolution 92 Â 112 pixels. Different varying conditions under which images captured are-times, lighting, open eyes, closed eyes, smiling, not smiling, glasses, no glasses, some images also have rotation variation. This database has 40 different directories, each with 10 images of an individual stored as .pgm format. Samples of images of A&T database is shown in Figure 2.  Figure 4.

CAS-PEAL-R1
This chapter mainly focuses on feature extraction stage in face recognition. It presents some well-known and recently explored spatial domain representations for

Histogram of oriented gradients (HOG)
Histogram of oriented gradients (HOG) is introduced by Dalal et al. [13] in 2005 for human detection. HOG is an effective descriptor for face recognition by computing normalized histograms of face gradient orientations in dense grid [14]. Basically, HOG generates local appearance and shape of face rather than local intensity gradients. HOG is based on computation, fine orientation binning, normalization and descriptor blocks.
A detailed implementation for extracting HOG features for face recognition is given as: and D y ¼ Results for a sample facial image using horizontal (D x Þ and vertical D y À Á derivative masks are shown in Figure 5. 2. Next step is fine orientation binning for extracting HOG features. Histogram channels are evenly selected in the range 0-180°for unsigned and 0-360°for signed gradient. Each cell can contribute in the form of pixel magnitude, gradient magnitude, square root or square of magnitude. In general, gradient magnitude yields the best results while square root reduces the performance [13]. 4.Different normalization schemes are presented in [15] for block normalization. Let ν represents un-normalized block with ν k k k as k th norm for k = 1, 2 and  ϱ a small constant. Different normalization schemes used are L1-norm, L1-sqrt, L2-norm and L2-hys. Generally, L2-hys is used for block normalization. L2-hys is obtained by first computing L2-norm and then clipping such that maximum value of ν is limited to 0.2 and then renormalizing.

Gradients in each cell
Sample input facial image and resultant HOG features are shown in Figure 6.

Scale invariant feature transform (SIFT)
Scale invariant feature transform (SIFT) is introduced by Lowe et al. [16] for extracting discriminative invariant features in an image. SIFT descriptor is widely used for facial feature representation by extracting blob-like local features [17]. These features are invariant to scale, translation and rotation resulting reliable matching. SIFT is described in four sections as: (1) Detection of scale-space extrema, (2) Detection of local extrema, (3) Orientation assignment, and (4) Keypoint descriptor representation.

Detection of scale-space extrema
First step is to identify keypoints in scale-space of grayscale input image f a; b ð Þwhich is defined as: where σ is standard deviation of Gaussian G a; b; σ ð Þ. Two closest scales of image with difference of multiplication factor k are used to effectively detect extrema in scale-space. Difference of Gaussian (DOG) is computed by taking difference of these two scaled versions of image convolved with original image given as:

Detection of local extrema
Local extrema (maxima/minima) of D a; b; σ ð Þis calculated by comparing sample pixel with eight neighbors in 3 Â 3 patch as well as nine neighbors above and below scaled images. To select sample point as local minima, it should be smaller than all 26 neighbors whereas for local maxima, selected point should be larger than all neighbors. After keypoint localization, low contrast and poorly localized points are removed by computing |D a; b; σ ð Þ| and discarding points with lower value to defined threshold.

Orientation assignment
Orientation assignment to each keypoint results in rotation invariance. For each Gaussian smoothened image L a; b ð Þ, orientation is assigned by computing gradient magnitude m a; b ð Þ, and gradient direction θ a; b ð Þ by its neighbor using Eqs. (8) and (9) respectively.

Keypoint descriptor representation
Finally, each detected keypoint is represented as 128 dimensional feature vector. This is obtained by computing magnitude and orientation of gradient at each point in 16 Â 16 sized patch of an image. Each 16 Â 16 patch is subdivided into 4 Â 4 nonoverlapping regions such that each 4 Â 4 region is represented by 8 bins. Hence, each keypoint descriptor is represented by 4 Â 4 Â 8 = 128 length vector. Figure 7 shows an example of assignment of SIFT descriptor for 8 Â 8 neighborhood. Length of each arrow corresponds sum of gradient magnitude in a specific direction for 4 Â 4 region.
Processing flow to generate SIFT features for face recognition is shown in Figure 8. Input original image is first preprocessed and difference of Gaussian  pyramid is generated as in Figure 8(c). Final resultant SIFT keypoints are then represented as feature vector to be fed to classifier for face recognition.

Linear phase quantization (LPQ)
Local phase quantization (LPQ) introduced by Ojansivu et al. [18,19] is blur tolerant texture based descriptor. LPQ is based on blur invariance property of frequency domain phase spectrum of an image. LPQ for face recognition is investigated by Ahonen et al. [20] and reported improved results for blurred facial images.
LPQ on an image pixel is applied by using short-term Fourier transform (STFT) over M Â M patch with image as center and four scalar frequencies. Imaginary and real components are then whitened and binary quantized to generate LPQ code for respective pixel. Complete process is detailed in Figure 9 where LPQ code is obtained for an image pixel [21]. Similarly, final LPQ feature vector can be obtained by shifting M Â M patch over the entire image.
Spatial blurring is performed by convolving grayscale input image f a; b ð Þ to point spread function (PSF). Frequency domain analysis can be represented as: Phase spectrum is obtained as: Now, if PSF is positive and even, then ⎳P u; v ð Þmust be either 0 or П, such that ⎳P u; v ð Þ ¼ 0 for P u; v ð Þ≥ 0 while, ⎳P u; v ð Þ ¼ П for P u; v ð Þ, 0. Since, shape of P u; v ð Þ generally selected is similar to Gaussian function, low frequency value of P u; v ð Þ is positive. This results ⎳P u; v ð Þ ¼ 0 and Eq. (11) becomes ⎳H u; v ð Þ ¼ ⎳F u; v ð Þ: Hence, it can be stated that LPQ possesses blur invariant property. Detailed mathematical analysis of LPQ can be obtained from [21].

Local binary patterns (LBP)
Local Binary Patterns (LBP) is introduced by Ojala et al. [22] as rotation invariant texture based feature descriptor. LBP as feature representation for face recognition is proposed by Ahohen et al. [23]. It stated that texture analysis of a local facial region represents its local appearance and fusion of all regions can generate an encoded global geometry of face.
Consider an input image and let f a; b ð Þ be its preprocessed version. Basic LBP operator on 3 Â 3 neighborhood of f a; b ð Þ and generated decimal code for center pixel is shown in Figure 10. LBP operator replaces each pixel of f a; b ð Þ with a calculated decimal code resulting in LBP encoded image f LBP a; b ð Þ. It is done by thresholding each pixel of 3 Â 3 neighborhood with its center pixel. Resultant is a binary code which is then converted into corresponding decimal code. Center pixel is then replaced by decimal code of generated binary stream. LBP code assigned to center pixel is given by Eq. (12). Here, i c represents center pixel, c n is gray level of neighbor pixels, and c p is gray level of center pixel.
Ahohen et al. [23] proposed that LBP operator can be used with varying neighborhood size M Â M and radius R to deal with different image scales. Notation P; R ð Þis used to represent P sampling points or neighbor pixels around center pixel for radius R. Thresholding is then performed by comparing center pixel with P neighbor pixels. Example of some selected values of P; R ð Þis shown in Figure 11. LBP for face recognition processes by building local LBP descriptor to represent local region and then combined to obtain global representation for entire face. Encoded image f LBP a; b ð Þ is evenly divided into non-overlapping blocks. Histogram for each block are calculated and final LBP feature vector is built by concatenating all regional histograms. LBP operator provides essential spatial information that plays a key role for face recognition. Complete processing flow to generate LBP feature vector is shown in Figure 12. Major advantages of LBP over other spatial feature representations are simple calculations, comparatively smaller feature vector size, more powerful towards noises and illumination balance. In recent years, various variants of LBP are widely implemented in texture analysis. Local ternary patterns (LTP) proposed by Tan et al. [24] is based on a ternary threshold operator. LTP is an improved LBP variant by using two LBP vectors for building one LTP representation. Other variants of LBP are compound local binary pattern (CLBP) [25], three-patch LBP (TPLBP) [26], four-patch LBP (FPLBP) [26] and improved local binary pattern (ILBP) [27]. These representations are verified to be more efficient than LBP against illumination and noise conditions.

Local ternary patterns (LTP)
Local ternary patterns (LTP) [24] is a generalization of LBP with reduced sensitivity to noise and illumination variations. LTP generates a 3-valued code by including a threshold around zero and improves resistance to noise. LTP works well for noisy images and different lighting conditions.  In LBP, neighbor pixels are compared with center pixel directly. Hence, a small variation in pixel values due to noise can drastically change LBP code. To overcome this limitation, LTP introduces a threshold AEt around center pixel i c and neighbor pixels are compared to generate 3-valued ternary code as: Here, c p and c n represent gray levels of center pixel and neighbor pixels respectively. Understanding of LTP encoding scheme to generate ternary LTP code is shown in Figure 13. Here, threshold t is set to 5, hence with center pixel value 40, the tolerance range is [35,45]. Neighbor pixels with gray level values in this range is replaced by zero, those above are replaced by 1 and below are replaced by À1 as described in Eq. (14).
Resultant ternary LTP code is split into two sub-LTP codes which are treated as two separate channels as shown in Figure 14. Lower and upper sub-LTP codes are  generated by replacing '-1' in original ternary code to '0' and '1' respectively. Hence, LTP represents each original image by two encoded images.

Compound local binary pattern (CLBP)
Compound local binary pattern (CLBP) proposed by Ahmed et al. [25] is an improved variant of LBP using 2P bits code. CLBP overcomes limitation of LBP by improving performance in case of flat image. LBP results poor for images with bright spots or dark patches i.e. in case of flat image LBP fails as shown in Figure 15.
Original LBP generates P bits code by taking gray level difference between center pixel and P neighbor pixels (sampling points). CLPB is an extension to LBP by generating 2P bits code for P neighbor pixels. Here, extra P bits encode magnitude information of difference between center pixel and P pixels. This way, CLBP increases robustness of texture representation mainly in case of flat images.
To generate 2P bits code, CLBP represents each neighbor pixel with two bits for sign and magnitude information. The first bit is same as LBP bit and represents sign of difference between center pixel and respective neighbor pixel. Second bit encodes magnitude of difference with respect to a calculated threshold M ab . This threshold is obtained by taking mean of magnitudes of difference between center pixel and all P pixels.
First bit is set to '1' if gray level of neighbor pixel is greater than or equals to center pixel and '0' otherwise. Second bit is '1' if absolute magnitude of difference between neighbor pixel and center pixel is greater than threshold and '0' otherwise. CLBP CLBP encoding scheme to generate 2P bits code for 3 Â 3 neighborhood of an image is shown in Figure 16. A 16-bits CLBP code is generated after thresholding using Eq. (16). Resultant CLBP code is then split into two 8 bits sub-CLBP codes to reduce possible binary patterns from 2 16 to (2 Â 2 8 Þ: First 8-bits code is concatenation of bits from pixels marked red in Figure 16(c). Again, second 8-bits code is obtained by concatenating bit values from left over pixels. Finally, these sub-CLBP codes are treated as channels for final feature vector representation.
Processing flow to generate histograms of CLBP encoded image for face recognition is shown in Figure 17. It explains how each pixel of original image is converted into CLBP encoded image. Figure 17(c) shows two sub-CLBP encoded images. Histogram of each encoded image are obtained as in Figure 17(d). These histograms can be individually used as separate feature vectors for face recognition or can be concatenated as a single final vector.

Three-patch LBP (TPLBP)
Original LBP and different variants of LBP generate 1-bit value or 2-bit value (for CLBP) by comparing two pixels, one as center pixel and other as one of the P neighbor pixels. Wolf et al. [26] proposed two different variants of LBP, namely, Three-patch LBP (TPLBP and Four-patch LBP (FPLBP) by comparing center pixel with more than one neighbor pixels.  TPLBP assigns each neighbor pixel in encoded image with 1-bit value by comparing gray level of three patches. For each center pixel i c , M Â M patch is considered and P additional same sized patches with center at distance of radius R is selected. Center pixel i c is compared with center pixels of two patches at δ distance apart along the ring of radius R. This way, TPLBP generates P bits code for i c as: here, c p , c m and c mþδ mod M are gray level of i c , gray levels of center pixel of m th and m þ δ ð Þ th patches respectively. d : ð Þ is L 2 norm and f is given as: τ is a user-specific threshold selected slightly greater than zero (say τ=.01) to obtain stability in flat regions. Figure 18 shows a sample example to generate TPBLP code for selected P ¼ 8, δ ¼ 2, M ¼ 3: TPLBP code generation for given sample using Eq. (17) is as: Processing flow to obtain TPLBP feature vector for face recognition is shown in Figure 19. Input facial image of size 64 Â 64 is first represented as TPLBP encoded image as in Figure 19(c). TPLBP encoded image is then divided into nonoverlapping patches of same size and histogram for each patch is obtained. These

Four-patch LBP (FPLBP)
Four-patch LBP (FPLBP) [26] is an extension to TPLBP by comparing center pixels of four patches to generate 1-bit value. Two different rings with radius R1 and R2 (R1 , R2) and P patches of size M Â M for each ring are selected around center pixel i c . Two patches with center symmetric are selected in inner ring and compared with corresponding patches in outer ring at distance δ along a circle. This way, FPLBP generates P=2 bit code for i c by obtaining P=2 pairs as:   Figure 20 shows a sample example to generate FPBLP code for selected P ¼ 8, δ ¼ 2, M ¼ 3: Also FPLBP code generation for given sample using Eq. (20) is as: Processing flow to obtain FPLBP feature vector for a sample facial image similar to TPLBP is shown in Figure 21.

Improved LBP (ILBP)
Improved LBP (ILBP) originally named as CLBP (complete LBP) is proposed by Guo et al. [27]. It is termed as ILBP to distinguish its abbreviation from compound LBP (CLBP). In ILBP, neighbor pixels are represented by its center pixel and a local difference sign-magnitude transform (LDSMT). A complete processing flow to  generate ILBP code is shown in Figure 22. ILPB generates 3P bits code for P neighbor pixels. An original image is first represented in terms of local threshold and global threshold. Local threshold is then further decomposed into sign and magnitude components. Consequently, three representations of P bits are obtained namely, ILBP_Sign (ILBP_S), ILBP_Magnitude (ILBP_M) and ILBP_Gobal (ILBP_G) and combined to form 3P bits ILBP code.
Let c p and c n represent gray levels of center pixel i c and P neighbor pixels respectively. Local threshold is generated by taking difference s p ¼ c n À c p . Subtracted vector s p is further divided into components, namely, magnitude of subtraction (m p ) and sign of subtraction (q p ) as: Understanding of ILPB encoding scheme to generate 3P bits ILBP code is shown in Figure 23. Figure 23(a) shows 3 Â 3 neighborhood with center pixel value 50. ILBP encoded image after local thresholding is shown in Figure 23(b) as [À38, À15, 20, 15, 22, À6, À41, 35]. After LDSMT, sign and magnitude vectors are obtained. It is clearly seen that original LBP uses only sign as LBP encodes À1 as 0 in sign vector representation. LBP code for above sample block is [0, 0, 1, 1, 1, 0, 0, 1]. Hence, LBP considers only sign components of subtraction while ILBP combines three representations, ILBP_S, ILBP_M and ILBP_G. Local region around center pixel is represented by LDSMT, assigning threshold value w.r.t sign leads ILBP_S and assigning threshold value w.r.t. magnitude leads ILBP_M. Similarly, image is also encoded using global threshold is termed as ILBP_G.
A comparative analysis of various spatial domain feature representations is given in Table 2.

Result analysis for face recognition
Face recognition has been explored over last many years, hence there exists a large number of researches in this domain. In this section, we present existing face recognition results and analysis based on different spatial domain representations. Deniz et al. [28] proposed face recognition using HOG features by extracting features from varying image patches which resulted in an improved accuracy. Recognition accuracy is evaluated on FERET database with best result of 95.4%. Other related researches are [29] which used EBGM-HOG and showed robustness to change in illumination, rotation and small displacements. Some existing works on face recognition using SIFT features are [30,31]. These works have also used
• Not invariant to rotations.
• Size of feature vector increases exponentially with number of neighbors leading to an increase of computational complexity in terms of time and space. • The structural information captured by it is limited. Only pixel difference is used, magnitude information ignored. • Performance decreases for flat images.

LPQ
• Performance is better as compare to LBP in case of blurred illumination and facial expression variations images.
• LPQ vector is about four times longer than an LBP vector with 8 neighbor pixels.

CLBP
• It gives better performance as compared to LBP as it uses both difference sign and magnitude.
• Feature vector is too long so it increases computational time.

LTP
• Resistant to noise.
• Not invariant under gray-scale transform of intensity values as its encoding is based on a fixed predefined thresholding.

TPLBP
• Rotation invariant for texture descriptor. • Capture information for not only microstructure but also macrostructure.

FPLBP
• Rotation invariant for texture descriptor. • Capture information for not only microstructure but also macrostructure.
• More complex. variants of SIFT such as volume-SIFT (VSIFT), partial-descriptor-SIFT (PDSIFT), learning SIFT at specific locations to improve verification accuracy. Face recognition using LPQ feature representation is inspired by [18,19] which used LPQ as blur invariant descriptor. Damane et al. [32] presented face recognition using LPQ under varying conditions of light, blur, and illumination. Experiments are performed on extended YALE-B, CMU-PIE, and CAS-PEAL-R1 face databases and results showed that LPQ has more robustness to light and illumination variation. Chan et al. [33] presented multiscale LPQ for face recognition and evaluated results on FERET and BANCA face databases. Multiscale LPQ is obtained by applying varying filter size and combining LPQ images, which are then projected into LDA space. Best results of 99.2% for FB, 92% for DP1 and 88% for DP2 are achieved on FERET probe sets.
Face recognition using LBP feature representation is one of the most researched area [34][35][36][37][38]. Again, Tan et al. [24] evaluated face recognition under varying lighting condition using LTP feature representation on Extended Yale-B, and CMU PIE face databases. They showed that LTP is more discriminant and less sensitive to noise in uniform regions and improved results in case of flat images. Wolf et al. [26] proposed TPLBP and FPLBP features for face recognition. Accuracy results are validated on two well-known databases, labeled faces in the wild (LFW) and multi PIE. They showed that combining several descriptors from the same LBP boosts family recognition rate. This paper claimed that best accuracy of 80.75% for TPLBP and 75.57% for FPLBP are obtained with the combination of ITML with MultiOSS ID and pose variation. Ahmed et al. [25] proposed CLBP features for facial expression recognition. It is an extension of LBP features. Results are verified in Cohn-Kanade (CK) facial expression database. CLBP features are classified with the help of SVM classifier. They showed that classification rate can be effected by adjusting the number of regions into which expression images are partitioned. For this, they considered three cases by dividing images into 3 Â 3, 5 Â 5, and 7 Â 6 patches. Best accuracy result for CLBP is 94.4% in case of image with 5 Â 5 patch size.

Conclusion
This chapter presents well-known and some recently explored spatial feature representations for face recognition. These feature representations are scale, translation and rotation invariants for 2-D face images. This chapter covers HOG, SIFT and LBP feature representations and complete processing flow to generate feature vectors using these representations for face recognition. SIFT and HOG based on computing image gradients and local extrema are commonly used feature representations for face recognition. LBP performs texture based analysis to represent local facial appearance and an encoded facial image. Other relevant spatial domain representations, such as, LPQ and variants of LBP are explained and analyzed for face recognition. LPQ possesses blur invariant property and provides improved results for blurred facial image. Different variants of LBP, such as, LTP, CLBP, TPLBP and FPLBP are more robust to noise and lighting conditions. These representations characterize facial features more effectively and obtain discriminative feature vectors for face recognition. the research grant. The sanctioned project title is "Design and development of an Automatic Kinship Verification system for Indian faces with possible integration of AADHAR Database." with reference no. ECR/2016/001659.