Accuracy (%) of gender recognition by combining our gaze-guided feature extraction with existing classifiers.
Abstract
We determine and use the gaze distribution of observers viewing images of subjects for gender recognition. In general, people look at informative regions when determining the gender of subjects in images. Based on this observation, we hypothesize that the regions corresponding to the concentration of the observer gaze distributions contain discriminative features for gender recognition. We generate the gaze distribution from observers while they perform the task of manually recognizing gender from subject images. Next, our gaze-guided feature extraction assigns high weights to the regions corresponding to clusters in the gaze distribution, thereby selecting discriminative features. Experimental results show that the observers mainly focused on the head region, not the entire body. Furthermore, we demonstrate that the gaze-guided feature extraction significantly improves the accuracy of gender recognition.
Keywords
- gaze distribution
- region of interest
- feature extraction
- pedestrian image
- gender recognition
1. Introduction
Gender recognition, which is of interest in the field of soft-biometrics, is part of the collection of statistical data about people in public spaces. Furthermore, gender recognition has many potential applications, such as video surveillance and consumer behavior analysis. Often, gender recognition experiments are conducted on pedestrians captured on video. Researchers have proposed several methods for automatically recognizing gender in pedestrian images; many of these techniques use convolutional neural networks (CNNs) [1]. The existing methods can extract discriminative features for gender recognition and obtain highly accurate results when many training samples containing diverse pedestrian images are acquired in advance. However, the collection of a sufficient number of training samples is very time-consuming. Unfortunately, deep learning methods typically require these large training sets to maintain suitable recognition performance.
People quickly and correctly recognize gender; thus, we believe that people effectively extract visual features from subjects in images. For instance, people correctly recognize gender from facial images [2, 3]. It may be possible to reproduce human visual abilities in a computer algorithm with a small number of training samples and achieve a recognition performance equivalent to that of humans. Existing methods [4, 5] have been proposed to mimic human visual abilities for object recognition tasks. These methods used a saliency map generated from low-level features [6, 7, 8]. However, these saliency maps does not sufficiently represent human visual abilities because they are not directly measured from human observers. We thus consider that the existing methods disregard the deep mechanisms of human vision.
An increasing number of pattern recognition studies, specifically those attempting to mimic human visual ability, have measured the gaze distribution of observers [9, 10, 11, 12]. This gaze distribution has great potential in the collection of informative features for various recognition tasks. Several techniques [13, 14] have demonstrated that the gaze distribution facilitates the extraction of informative features. Sattar et al. [13] applied the gaze distribution to analyze fashion in images. Murrugarra-Llerena and Kovashka [14] applied the gaze distribution for attribute prediction in facial images. However, the existing methods using observer gaze distribution do not study gender recognition from pedestrian images. We consider that the region of interest measured from observers’ gaze is also effective for gender recognition.
Here, we conduct a gaze measurement experiment for observers performing a gender recognition task on images of subjects. We investigate if the gaze distribution measured from the observers facilitates gender recognition. Figure 1 shows the overview of our gaze-guided feature extraction. We generate a task-oriented gaze distribution from the gaze locations recorded while observers manually determined the genders of subjects in images. High values in a task-oriented gaze distribution correspond to regions that observers frequently view. We assume that these regions contain discriminative features for gender recognition because they appear to be useful when the observers are determining the subject gender. When extracting features to train the gender classifier, larger weights are assigned to the regions of the pedestrian images corresponding to the attention regions of the task-oriented gaze distribution. The experimental results indicate that our gaze-guided feature extraction improves the gender recognition accuracy when using a CNN technique with a small number of training samples.
2. Generating a task-oriented gaze distribution for gender recognition
2.1 Observer gaze distribution in gender recognition
We discuss which body regions of subjects in images are frequently viewed for gender recognition by observers. With respect to the analytical study of facial images, Hsiao et al. [15] reported that people looked at the nose region when they recognized others. We consider that the human face is a key factor in gender recognition. Furthermore, we consider that the entire body, including the chest, waist, and legs, is also helpful. Thus, we aim to reveal the body regions that tend to collect the observer gaze distribution during a gender recognition task. Note that we assume that the pedestrian images have been pre-aligned using pedestrian detection techniques. The details of our method are described below.
2.2 Generating a task-oriented gaze distribution
To generate a task-oriented gaze distribution, we use a gaze tracker to acquire gaze locations while the observer views a pedestrian image on a screen. We briefly describe our method in Figure 2. We work with
where
where
where
Note that we apply a scaling technique to the aggregated gaze distributions as follows:
3. Experiments to generate a task-oriented gaze distribution
3.1 Setup
We evaluated the task-oriented gaze distributions for gender recognition. We acquired the gaze locations for
We used 4563 pedestrian images from the CUHK dataset included in the PETA dataset [16] with gender labels (woman or man). From this dataset, we use the
We acquired the gaze distribution when participants performed the gender recognition task according to the following procedure:
P1. A gray image is shown on the screen for one second.
P2. A pedestrian stimulus image is shown on the screen for two seconds.
P3. A black image is shown on the screen for two seconds, and the participant replied whether the pedestrian was a woman or a man.
P4. We repeated P1 to P3 until all eight pedestrian images had been displayed in random order.
In our preliminary experiment, we observed that participants first assessed the position of the pedestrian image on the screen and then, after establishing the position of the image, attempted to complete the gender recognition task. To determine
We set
3.2 Results
Figure 6 shows examples of the measured gaze distributions
Figure 7 shows the overall task-oriented gaze distribution
4. Feature extraction algorithm using the task-oriented gaze distribution for gender recognition
4.1 Overview of our gaze-guided feature extraction
Here, we describe our method to extract features using the task-oriented gaze distribution for gender recognition. The regions corresponding to high values in the distribution
4.2 Procedure
Given a gaze distribution
We use a correction function
We calculate a weighted intensity
We generate a feature vector for gender recognition using raster scanning
5. Evaluation of the gender recognition performance using the gaze distribution
5.1 Comparison of weight correction functions for feature extraction
We evaluated the accuracy of gender recognition using various correction functions. We used the gaze distribution
F1.
F2.
F3.
F4.
Figure 9 shows a visualization of the correction functions
Figure 10(a) shows pedestrian images after applying
Figure 10(b) shows the gender recognition accuracy of each gaze-guided weight correction function for gender recognition. We confirmed that the accuracy of F1 and F2 was superior to that of F4. Thus, the use of the gaze distribution
5.2 Combining our gaze-guided feature extraction with existing classifiers
We investigated the gender recognition performance by combining our gaze-based feature extraction technique with representative classifiers. We used
Condition | Accuracy using CNN | Accuracy using LMNN |
---|---|---|
With our gaze-guided feature extraction | ||
Without our gaze-guided feature extraction |
5.3 Evaluation of assigning weights using saliency maps
We evaluated the gender recognition accuracy of a method that uses saliency maps. We used the existing methods of Zhang et al. [7], and Zhu et al. [8] to generate saliency maps. Figure 11 shows the saliency maps used in the evaluation of gender recognition. We scaled the intensity in the saliency map to fit the intensity range to [0,1]. We performed feature extraction using the saliency map instead of the task-oriented gaze distribution
Our gaze distribution | Zhang et al.’s saliency map | Zhu et al.’s saliency map |
---|---|---|
5.4 Visualization of the regions of focus when using CNNs
We conducted an experiment to visualize the regions of focus in a pedestrian image during gender recognition. To this end, we used gradient-weighted class activation mapping (Grad-CAM) [22]. Figure 12 shows the visualization results of the regions of focus of the CNN method. In (a), we show the pedestrian test images for gender recognition. In (b), we show the visualization results without our gaze-guided feature extraction. We only used the conventional CNN of the VGG16 model with fine-tuning. In the woman test samples, the model emphasized the leg and waist regions. In the man test samples, the model emphasized the shoulder and head regions. This indicates that the conventional CNN emphasizes various body part regions for gender recognition but in a different manner than used by the participating observers in the experiments of Section 3.2. In (c), we show the visualization results using our gaze distribution maps for gender recognition. We used our gaze-guided feature extraction with
6. Conclusions
We hypothesized that the gaze distribution measured from observers performing a gender recognition task facilitates the extraction of discriminative features. We demonstrated that the gaze distribution measured during a manual gender recognition task tended to concentrate on specific regions of the pedestrian’s body. We represented the informative region as a task-oriented gaze distribution for a gender classifier. Owing to the efficacy of the task-oriented gaze distribution for feature extraction, our gender recognition method demonstrated increased accuracy compared with representative existing classifiers and saliency maps.
As part of our future work, we will expand our analytical study to explore the differences in gaze distributions with respect to observer nationality and ethnicity. Furthermore, we intend to generate gaze distributions for various tasks beyond gender recognition, such as evaluating impressions of subjects’ clothing in images.
Acknowledgments
This work was partially supported by JSPS KAKENHI Grant No. JP20K11864.
References
- 1.
Fayyaz M, Yasmin M, Sharif M, Raza M. J-ldfr: Joint low-level and deep neural network feature representations for pedestrian gender classification. Neural Computing and Applications. 2021; 33 :361-391 - 2.
Bruce V, Burton AM, Hanna E, Healey P, Mason O, Coombes A, et al. Sex discrimination: How do we tell the difference between male and female faces? Perception. 1993; 22 (2):131-152 - 3.
Burton AM, Bruce V, Dench N. What’s the difference between men and women? Evidence from facial measurement. Perception. 1993; 22 (2):153-176 - 4.
Walther D, Itti L, Riesenhuber M, Poggio T, Koch C. Attentional selection for object recognition—A gentle way. In: Proceedings of the Second International Workshop on Biologically Motivated Computer Vision. Berlin Heidelberg: Springer; 2002. pp. 472-479 - 5.
Zhu JY, Wu J, Xu Y, Chang E, Tu Z. Unsupervised object class discovery via saliency-guided multiple class learning. IEEE Transactions on Pattern Analysis and Machine Intelligence. 2015; 37 (4):862-875 - 6.
Itti L, Koch C, Niebur E. A model of saliency-based visual attention for rapid scene analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence. 1998; 20 (11):1254-1259 - 7.
Zhang J, Sclaroff S, Lin X, Shen X, Price B, Mech R. Minimum barrier salient object detection at 80 fps. In: Proceedings of the IEEE International Conference on Computer Vision. IEEE Computer Society; 2015. pp. 1404-1412 - 8.
Zhu W, Liang S, Wei Y, J. sun. Saliency optimization from robust background detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE Computer Society; 2014. pp. 2814-2821 - 9.
Xu M, Ren Y, Wang Z. Learning to predict saliency on face images. In: Proceedings of IEEE International Conference on Computer Vision. IEEE Computer Society; 2015. pp. 3907-3915 - 10.
Fathi A, Li Y, Rehg JM. Learning to recognize daily actions using gaze. In: Proceedings of the 12th European Conference on Computer Vision. Berlin Heidelberg: Springer; 2012. pp. 314-327 - 11.
Xu J, Mukherjee L, Li Y, Warner J, Rehg JM, Singh V. Gaze-enabled egocentric video summarization via constrained submodular maximization. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE Computer Society; 2015. pp. 2235-2244 - 12.
Karessli N, Akata Z, Schiele B, Bulling A. Gaze embeddings for zero-shot image classification. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE Computer Society; 2017. pp. 4525-4534 - 13.
Sattar H, Bulling A, Fritz M. Predicting the category and attributes of visual search targets using deep gaze pooling. In: Proceedings of IEEE International Conference on Computer Vision Workshops. IEEE Computer Society; 2017. pp. 2740-2748 - 14.
Murrugarra-Llerena N, Kovashka A. Learning attributes from human gaze. In: Proceedings of IEEE Winter Conference on Applications of Computer Vision. IEEE Computer Society; 2017. pp. 510-519 - 15.
Hsiao JH, Cottrell G. Two fixations suffice in face recognition. Psychological Science. 2008; 19 (10):998-1006 - 16.
Deng Y, Luo P, Loy CC, Tang X. Pedestrian attribute recognition at far distance. In: Proceedings of the 22nd ACM International Conference on Multimedia. Association for Computing Machinery; 2014. pp. 789-792 - 17.
Bindemann M. Scene and screen center bias early eye movements in scene viewing. Vision Research. 2010; 50 (23):2577-2587 - 18.
Buswell GT. How People Look at Pictures: A Study of the Psychology of Perception of Art. Chicago, IL: University of Chicago Press; 1935 - 19.
Fairchild MD. Color Appearance Models. 3rd ed. New York City: Wiley; 2013 - 20.
Antipov G, Berrani SA, Ruchaud N, Dugelay JL. Learned vs. hand-crafted features for pedestrian gender recognition. In: Proceedings of the 23rd ACM International Conference on Multimedia. Association for Computing Machinery; 2015. pp. 1263-1266 - 21.
Weinberger KQ, Saul LK. Distance metric learning for large margin nearest neighbor classification. Journal of Machine Learning Research. 2009; 10 :207-244 - 22.
Selvaraju RR, Cogswell M, Das A, Vedantam R, Parikh D, Batra D. Grad-cam: Visual explanations from deep networks via gradient-based localization. In: Proceedings of IEEE International Conference on Computer Vision. IEEE Computer Society; 2017. pp. 618-626