Open access peer-reviewed chapter

Feature Extraction Using Observer Gaze Distributions for Gender Recognition

Written By

Masashi Nishiyama

Reviewed: 13 December 2021 Published: 04 February 2022

DOI: 10.5772/intechopen.101990

From the Edited Volume

Recent Advances in Biometrics

Edited by Muhammad Sarfraz

Chapter metrics overview

145 Chapter Downloads

View Full Metrics

Abstract

We determine and use the gaze distribution of observers viewing images of subjects for gender recognition. In general, people look at informative regions when determining the gender of subjects in images. Based on this observation, we hypothesize that the regions corresponding to the concentration of the observer gaze distributions contain discriminative features for gender recognition. We generate the gaze distribution from observers while they perform the task of manually recognizing gender from subject images. Next, our gaze-guided feature extraction assigns high weights to the regions corresponding to clusters in the gaze distribution, thereby selecting discriminative features. Experimental results show that the observers mainly focused on the head region, not the entire body. Furthermore, we demonstrate that the gaze-guided feature extraction significantly improves the accuracy of gender recognition.

Keywords

  • gaze distribution
  • region of interest
  • feature extraction
  • pedestrian image
  • gender recognition

1. Introduction

Gender recognition, which is of interest in the field of soft-biometrics, is part of the collection of statistical data about people in public spaces. Furthermore, gender recognition has many potential applications, such as video surveillance and consumer behavior analysis. Often, gender recognition experiments are conducted on pedestrians captured on video. Researchers have proposed several methods for automatically recognizing gender in pedestrian images; many of these techniques use convolutional neural networks (CNNs) [1]. The existing methods can extract discriminative features for gender recognition and obtain highly accurate results when many training samples containing diverse pedestrian images are acquired in advance. However, the collection of a sufficient number of training samples is very time-consuming. Unfortunately, deep learning methods typically require these large training sets to maintain suitable recognition performance.

People quickly and correctly recognize gender; thus, we believe that people effectively extract visual features from subjects in images. For instance, people correctly recognize gender from facial images [2, 3]. It may be possible to reproduce human visual abilities in a computer algorithm with a small number of training samples and achieve a recognition performance equivalent to that of humans. Existing methods [4, 5] have been proposed to mimic human visual abilities for object recognition tasks. These methods used a saliency map generated from low-level features [6, 7, 8]. However, these saliency maps does not sufficiently represent human visual abilities because they are not directly measured from human observers. We thus consider that the existing methods disregard the deep mechanisms of human vision.

An increasing number of pattern recognition studies, specifically those attempting to mimic human visual ability, have measured the gaze distribution of observers [9, 10, 11, 12]. This gaze distribution has great potential in the collection of informative features for various recognition tasks. Several techniques [13, 14] have demonstrated that the gaze distribution facilitates the extraction of informative features. Sattar et al. [13] applied the gaze distribution to analyze fashion in images. Murrugarra-Llerena and Kovashka [14] applied the gaze distribution for attribute prediction in facial images. However, the existing methods using observer gaze distribution do not study gender recognition from pedestrian images. We consider that the region of interest measured from observers’ gaze is also effective for gender recognition.

Here, we conduct a gaze measurement experiment for observers performing a gender recognition task on images of subjects. We investigate if the gaze distribution measured from the observers facilitates gender recognition. Figure 1 shows the overview of our gaze-guided feature extraction. We generate a task-oriented gaze distribution from the gaze locations recorded while observers manually determined the genders of subjects in images. High values in a task-oriented gaze distribution correspond to regions that observers frequently view. We assume that these regions contain discriminative features for gender recognition because they appear to be useful when the observers are determining the subject gender. When extracting features to train the gender classifier, larger weights are assigned to the regions of the pedestrian images corresponding to the attention regions of the task-oriented gaze distribution. The experimental results indicate that our gaze-guided feature extraction improves the gender recognition accuracy when using a CNN technique with a small number of training samples.

Figure 1.

Overview of our gaze-guided feature extraction. We consider that the regions gathering the gaze distribution contain discriminative features for gender recognition because they appear to be useful when the observers are tackling the gender recognition task.

Advertisement

2. Generating a task-oriented gaze distribution for gender recognition

2.1 Observer gaze distribution in gender recognition

We discuss which body regions of subjects in images are frequently viewed for gender recognition by observers. With respect to the analytical study of facial images, Hsiao et al. [15] reported that people looked at the nose region when they recognized others. We consider that the human face is a key factor in gender recognition. Furthermore, we consider that the entire body, including the chest, waist, and legs, is also helpful. Thus, we aim to reveal the body regions that tend to collect the observer gaze distribution during a gender recognition task. Note that we assume that the pedestrian images have been pre-aligned using pedestrian detection techniques. The details of our method are described below.

2.2 Generating a task-oriented gaze distribution

To generate a task-oriented gaze distribution, we use a gaze tracker to acquire gaze locations while the observer views a pedestrian image on a screen. We briefly describe our method in Figure 2. We work with P participating observers and N pedestrian images. Given a gaze location xfyf in a certain frame f, the gaze distribution gp,n,fxy is computed as

Figure 2.

Overview of our method for generating a gaze distribution g˜xy.

gp,n,fxy=1x=xfy=yf,0otherwise,E1

where p is an observer, and n is a pedestrian image. Note that the observer not only looks at point xfyf on each pedestrian image, but also the region surrounding this point. Thus, we apply a Gaussian kernel to the measured gaze distribution gp,n,fxy. Figure 3 illustrates the parameters used to determine the size k of the Gaussian kernel. We compute the following equation:

Figure 3.

Parameters used to determine the kernel size for generating the gaze distribution.

k=2dhltanθ2,E2

where θ represents the angle of the region surrounding xfyf, l represents the screen’s vertical length, h represents the screen’s vertical resolution, and d represents the distance from the participant to the screen. We aggregate each gp,n,fxy to gp,nxy to represent the gaze distribution in a particular pedestrian image as

gp,nxy=f=1Fp,nkuvgp,n,fxy,E3

where Fp,n is the time taken by an observer to recognize the gender of the subject in the image. Function kuv represents a Gaussian kernel of size k×k and operator represents the convolution. Our method performs L1-norm normalization as gp,nxy=1. We aggregate gp,nxy into a single gaze distribution across all observers and all pedestrian images. The aggregated gaze distribution gxy is represented as

gxy=p=1Pn=1Ngp,nxy.E4

Note that we apply a scaling technique to the aggregated gaze distributions as follows: g˜xy=gxy/maxgxy. g˜xy is the final task-oriented gaze distribution.

Advertisement

3. Experiments to generate a task-oriented gaze distribution

3.1 Setup

We evaluated the task-oriented gaze distributions for gender recognition. We acquired the gaze locations for P=14 participating observers (seven men and seven women, with an average age of 22.6±1.3 years old, Japanese students). We used a display screen (size 53.1×29.9 cm, 1920×1080 pixels). The vertical distance between the screen and the participant was set to 65 cm, as illustrated in Figure 4. The height from the floor to the eyes of the participant was between 110 cm and 120 cm. The participants sat on a chair in a room with no direct sunlight (illuminance 825 lx). We use a standing eye tracker (GP3 Eye Tracker, sampling rate 60 Hz). We asked the participants to perform a gender recognition task to determine if the pedestrian in an image is a man or a woman. We determined which regions of the entire body were viewed by the participants to complete this task.

Figure 4.

Setup used to acquire the gaze distribution in a gender recognition task.

We used 4563 pedestrian images from the CUHK dataset included in the PETA dataset [16] with gender labels (woman or man). From this dataset, we use the N=8 pedestrian images in Figure 5 to use in the observer experiment to generate the gaze distribution map. We selected the four pedestrian images at the top of Figure 5 keeping the ratio of directions (front, back, left, and right) equal. We selected the remaining pedestrian images in Figure 5 in the same manner. When displaying the stimulus images on the screen, the pedestrian images were enlarged from 80×160 pixels to 480×960 pixels. We simply changed the stimulus images’ positions by adding random offsets to avoid a center bias [17, 18].

Figure 5.

Pedestrian images for generating task-oriented gaze distributions during the gender recognition task.

We acquired the gaze distribution when participants performed the gender recognition task according to the following procedure:

P1. A gray image is shown on the screen for one second.

P2. A pedestrian stimulus image is shown on the screen for two seconds.

P3. A black image is shown on the screen for two seconds, and the participant replied whether the pedestrian was a woman or a man.

P4. We repeated P1 to P3 until all eight pedestrian images had been displayed in random order.

In our preliminary experiment, we observed that participants first assessed the position of the pedestrian image on the screen and then, after establishing the position of the image, attempted to complete the gender recognition task. To determine Fp,n, we set the start time at the point when the gaze first stopped on the pedestrian image for more than 440 ms, and the end time corresponds to the pedestrian image no longer appearing on screen. In this scenario, the average Fp,n between the start and end times was 1.56±0.38 s. The participating observers achieved a gender recognition was accuracy of 100.0%.

We set θ=3° in Eq. (2) by considering the range of the fovea, which is approximately two degrees (as described in [19]), and the error of the eye tracker, which is about one degree (as described in the tracker’s specification sheet). We used a kernel size of k=125 for the enlarged pedestrian images (480×960 pixels). The size of the gaze distribution images was downsized by 80×160, adjusting according to the original size of the pedestrian images. This standardized the size of the test samples and training samples to input into the gender classifier.

3.2 Results

Figure 6 shows examples of the measured gaze distributions gp,nxy for the gender recognition task for a pedestrian image. We show the gaze distribution map from two participants for the pedestrian image shown in Figure 6(a). The dark regions in the gaze distribution maps represent the gaze locations recorded from the participants by the eye tracker. The minimum (black) and maximum (white) intensities in Figure 6 represent the maximum and minimum values of the measured gp,nxy, respectively. We observed that participants frequently concentrated their gaze on the head region to complete the gender recognition task.

Figure 6.

Examples of measured gaze distributions gp,nxy from two participants. (a) Stimulus image of pedestrian. (b) and (c) Gaze distributions measured from each participant viewing the pedestrian image in (a).

Figure 7 shows the overall task-oriented gaze distribution g˜xy for gender recognition synthesized from all of the participating observers. To study the properties of the task-oriented gaze distribution, we verify how the gaze distributions align with the pedestrian images of Figure 5. We see that the region corresponding to the head gathered a large number of gaze locations, while regions around the lower body and background gathered few gaze locations.

Figure 7.

Task-oriented gaze distribution g˜xy for the gender recognition task.

Advertisement

4. Feature extraction algorithm using the task-oriented gaze distribution for gender recognition

4.1 Overview of our gaze-guided feature extraction

Here, we describe our method to extract features using the task-oriented gaze distribution for gender recognition. The regions corresponding to high values in the distribution g˜xy appear to contain informative features because participants focus on these regions to manually recognize gender in the pedestrian images. Thus, we assume that these regions contain discriminative features for the gender classifiers. Based on this assumption, we extract these features by assigning higher weights to the regions corresponding to high values in the task-oriented gaze distribution. Figure 8 provides an overview of our method. Our methods assign weights using g˜xy for both the test samples and training samples. Therefore, we do not need to acquire gaze distributions on the test samples. Our method extracts the weighted features and applies deep learning and machine learning techniques to obtain the final classification.

Figure 8.

Overview of our gaze-guided feature extraction using the gaze distribution g˜xy.

4.2 Procedure

Given a gaze distribution g˜xy, our method computes the weight w˜xy for each pixel as

w˜xy=Cg˜xy.E5

We use a correction function C that weakens or emphasizes values according to the density of the gaze distribution.

We calculate a weighted intensity iwxy from an original intensity ixy as follows:

iwxy=w˜xyixy.E6

We generate a feature vector for gender recognition using raster scanning iwxy. The RGB images are converted to CIE L*a*b* color space. Note that our method weights the L* values and does not change the a*b* values. We consider only the lightness changes without any color changes because a numerical change in the L* channel corresponds to the lightness change in human perception.

Advertisement

5. Evaluation of the gender recognition performance using the gaze distribution

5.1 Comparison of weight correction functions for feature extraction

We evaluated the accuracy of gender recognition using various correction functions. We used the gaze distribution g˜xy, as shown in Figure 7. We randomly picked up pedestrian images from the CUHK dataset, which is included in the PETA dataset [16]. We equalized the ratio of women and men samples in the test sets and training sets to avoid problems associated with imbalanced data. The same individual did not appear in both training and test samples. We used 2720 pedestrian images as training samples and test samples. We applied 10-fold cross-validation for gender recognition. Both the training and test samples contained not only frontal poses, but also side and back poses. We evaluate the gender recognition performance as the accuracy of the woman or man classification labels. We generated feature vectors by raster scanning RGB values with down sampling (40×80×3 dimensions) from weighted pedestrian images. We used a linear support vector machine classifier (the penalty parameter was C=1) to confirm the baseline performance of gender recognition. For the other classifiers, we show experimental results in Section 5.2. We compared the accuracy of the following correction functions:

F1. Cz=z,

F2. Cz=min1za+b,

F3. Cz=1min1za+b, and

F4. Cz=1.

Figure 9 shows a visualization of the correction functions Cz. We determined the parameters of the gender classifier using a grid search over the validation sets. These validation sets consisted of the remaining pedestrian images not used in the test sets and training sets from the CUHK dataset. Parameters ab were set to {0.75, 0.21}.

Figure 9.

Visualization of correction functions Cz.

Figure 10(a) shows pedestrian images after applying Cz. Function F1 outputs an intensity weighted by the gaze distribution for each pixel. Function F2 emphasizes an intensity around a face using gaze distribution. In contrast, function F3 weakens the intensity. Function F4 directly outputs the intensity of the original pedestrian image.

Figure 10.

Gender recognition accuracy. (a) Examples of test pedestrian images after applying correction functions. F1 and F2 show the results of our gaze-based feature extraction. (b) Comparison of gender recognition accuracy using each gaze-guided weight correction function with a linear support vector machine classifier.

Figure 10(b) shows the gender recognition accuracy of each gaze-guided weight correction function for gender recognition. We confirmed that the accuracy of F1 and F2 was superior to that of F4. Thus, the use of the gaze distribution g˜xy appears to increase the performance of gender recognition. F2 yields superior performance compared with F1, indicating that this correction function improves gender recognition accuracy. The inverse weights of F3 decreased the accuracy compared with the other correction functions. Thus, we demonstrate that the regions corresponding to observer gaze distribution g˜xy measured from participants completing a gender recognition task contain discriminative features for the gender classifier.

5.2 Combining our gaze-guided feature extraction with existing classifiers

We investigated the gender recognition performance by combining our gaze-based feature extraction technique with representative classifiers. We used Mini-CNN architecture [20], which is a small network with few convolutional layers. We also used a large margin nearest neighbor (LMNN) classifier [21], which is a metric learning technique. The test samples and training samples described in Section 5.1 were used in the evaluation. We applied 10-fold cross-validation. Table 1 shows the accuracy for gender recognition with and without our gaze-guided feature extraction. Our gaze-based feature extraction method leads to improved gender recognition for both classifiers.

ConditionAccuracy using CNNAccuracy using LMNN
With our gaze-guided feature extraction79.6±2.278.5±1.1
Without our gaze-guided feature extraction75.3±3.176.0±2.7

Table 1.

Accuracy (%) of gender recognition by combining our gaze-guided feature extraction with existing classifiers.

5.3 Evaluation of assigning weights using saliency maps

We evaluated the gender recognition accuracy of a method that uses saliency maps. We used the existing methods of Zhang et al. [7], and Zhu et al. [8] to generate saliency maps. Figure 11 shows the saliency maps used in the evaluation of gender recognition. We scaled the intensity in the saliency map to fit the intensity range to [0,1]. We performed feature extraction using the saliency map instead of the task-oriented gaze distribution g˜xy. Our method assigned the test samples and training samples large weights in regions corresponding to high saliency values before using a CNN classifier. We evaluated the accuracy using the same conditions of Section 5.2. Table 2 shows the gender recognition accuracy obtained when using our task-oriented gaze distribution compared with the accuracy obtained using the existing saliency map approaches. The results indicate that our gaze-guided feature extraction method outperforms the use of saliency maps for gender recognition.

Figure 11.

Examples of saliency maps used in gender recognition. (a) Test pedestrian images. (b), (c) generated saliency maps.

Our gaze distributionZhang et al.’s saliency mapZhu et al.’s saliency map
79.6±2.2%66.9±2.5%66.8±2.8%

Table 2.

Gender recognition accuracy (%) using our task-oriented gaze distribution compared with using the existing saliency maps.

5.4 Visualization of the regions of focus when using CNNs

We conducted an experiment to visualize the regions of focus in a pedestrian image during gender recognition. To this end, we used gradient-weighted class activation mapping (Grad-CAM) [22]. Figure 12 shows the visualization results of the regions of focus of the CNN method. In (a), we show the pedestrian test images for gender recognition. In (b), we show the visualization results without our gaze-guided feature extraction. We only used the conventional CNN of the VGG16 model with fine-tuning. In the woman test samples, the model emphasized the leg and waist regions. In the man test samples, the model emphasized the shoulder and head regions. This indicates that the conventional CNN emphasizes various body part regions for gender recognition but in a different manner than used by the participating observers in the experiments of Section 3.2. In (c), we show the visualization results using our gaze distribution maps for gender recognition. We used our gaze-guided feature extraction with Mini-CNN, as described in Section 5.2. We confirmed that our method mainly emphasizes the head region, mimicking the human observers’ gaze behavior. In particular, we consider that our method recognizes gender by focusing on the hairstyle of the subject in an image because it emphasized the regions containing the boundary between the head and the background.

Figure 12.

Regions of focus of the gender classifier when performing gender recognition. We used CNNs and Grad-CAM. (a) Test pedestrian images. (b) Results without the use of the gaze distribution g˜xy. (c) Results with our gaze-guided feature extraction.

Advertisement

6. Conclusions

We hypothesized that the gaze distribution measured from observers performing a gender recognition task facilitates the extraction of discriminative features. We demonstrated that the gaze distribution measured during a manual gender recognition task tended to concentrate on specific regions of the pedestrian’s body. We represented the informative region as a task-oriented gaze distribution for a gender classifier. Owing to the efficacy of the task-oriented gaze distribution for feature extraction, our gender recognition method demonstrated increased accuracy compared with representative existing classifiers and saliency maps.

As part of our future work, we will expand our analytical study to explore the differences in gaze distributions with respect to observer nationality and ethnicity. Furthermore, we intend to generate gaze distributions for various tasks beyond gender recognition, such as evaluating impressions of subjects’ clothing in images.

Advertisement

Acknowledgments

This work was partially supported by JSPS KAKENHI Grant No. JP20K11864.

References

  1. 1. Fayyaz M, Yasmin M, Sharif M, Raza M. J-ldfr: Joint low-level and deep neural network feature representations for pedestrian gender classification. Neural Computing and Applications. 2021;33:361-391
  2. 2. Bruce V, Burton AM, Hanna E, Healey P, Mason O, Coombes A, et al. Sex discrimination: How do we tell the difference between male and female faces? Perception. 1993;22(2):131-152
  3. 3. Burton AM, Bruce V, Dench N. What’s the difference between men and women? Evidence from facial measurement. Perception. 1993;22(2):153-176
  4. 4. Walther D, Itti L, Riesenhuber M, Poggio T, Koch C. Attentional selection for object recognition—A gentle way. In: Proceedings of the Second International Workshop on Biologically Motivated Computer Vision. Berlin Heidelberg: Springer; 2002. pp. 472-479
  5. 5. Zhu JY, Wu J, Xu Y, Chang E, Tu Z. Unsupervised object class discovery via saliency-guided multiple class learning. IEEE Transactions on Pattern Analysis and Machine Intelligence. 2015;37(4):862-875
  6. 6. Itti L, Koch C, Niebur E. A model of saliency-based visual attention for rapid scene analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence. 1998;20(11):1254-1259
  7. 7. Zhang J, Sclaroff S, Lin X, Shen X, Price B, Mech R. Minimum barrier salient object detection at 80 fps. In: Proceedings of the IEEE International Conference on Computer Vision. IEEE Computer Society; 2015. pp. 1404-1412
  8. 8. Zhu W, Liang S, Wei Y, J. sun. Saliency optimization from robust background detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE Computer Society; 2014. pp. 2814-2821
  9. 9. Xu M, Ren Y, Wang Z. Learning to predict saliency on face images. In: Proceedings of IEEE International Conference on Computer Vision. IEEE Computer Society; 2015. pp. 3907-3915
  10. 10. Fathi A, Li Y, Rehg JM. Learning to recognize daily actions using gaze. In: Proceedings of the 12th European Conference on Computer Vision. Berlin Heidelberg: Springer; 2012. pp. 314-327
  11. 11. Xu J, Mukherjee L, Li Y, Warner J, Rehg JM, Singh V. Gaze-enabled egocentric video summarization via constrained submodular maximization. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE Computer Society; 2015. pp. 2235-2244
  12. 12. Karessli N, Akata Z, Schiele B, Bulling A. Gaze embeddings for zero-shot image classification. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE Computer Society; 2017. pp. 4525-4534
  13. 13. Sattar H, Bulling A, Fritz M. Predicting the category and attributes of visual search targets using deep gaze pooling. In: Proceedings of IEEE International Conference on Computer Vision Workshops. IEEE Computer Society; 2017. pp. 2740-2748
  14. 14. Murrugarra-Llerena N, Kovashka A. Learning attributes from human gaze. In: Proceedings of IEEE Winter Conference on Applications of Computer Vision. IEEE Computer Society; 2017. pp. 510-519
  15. 15. Hsiao JH, Cottrell G. Two fixations suffice in face recognition. Psychological Science. 2008;19(10):998-1006
  16. 16. Deng Y, Luo P, Loy CC, Tang X. Pedestrian attribute recognition at far distance. In: Proceedings of the 22nd ACM International Conference on Multimedia. Association for Computing Machinery; 2014. pp. 789-792
  17. 17. Bindemann M. Scene and screen center bias early eye movements in scene viewing. Vision Research. 2010;50(23):2577-2587
  18. 18. Buswell GT. How People Look at Pictures: A Study of the Psychology of Perception of Art. Chicago, IL: University of Chicago Press; 1935
  19. 19. Fairchild MD. Color Appearance Models. 3rd ed. New York City: Wiley; 2013
  20. 20. Antipov G, Berrani SA, Ruchaud N, Dugelay JL. Learned vs. hand-crafted features for pedestrian gender recognition. In: Proceedings of the 23rd ACM International Conference on Multimedia. Association for Computing Machinery; 2015. pp. 1263-1266
  21. 21. Weinberger KQ, Saul LK. Distance metric learning for large margin nearest neighbor classification. Journal of Machine Learning Research. 2009;10:207-244
  22. 22. Selvaraju RR, Cogswell M, Das A, Vedantam R, Parikh D, Batra D. Grad-cam: Visual explanations from deep networks via gradient-based localization. In: Proceedings of IEEE International Conference on Computer Vision. IEEE Computer Society; 2017. pp. 618-626

Written By

Masashi Nishiyama

Reviewed: 13 December 2021 Published: 04 February 2022