Open access peer-reviewed chapter

Perspective on Dark-Skinned Emotion Recognition Using Deep-Learned and Handcrafted Feature Techniques

Written By

Martins E. Irhebhude, Adeola O. Kolawole and Goshit Nenbunmwa Amos

Submitted: 05 December 2022 Reviewed: 30 December 2022 Published: 25 January 2023

DOI: 10.5772/intechopen.109739

From the Edited Volume

Emotion Recognition - Recent Advances, New Perspectives and Applications

Edited by Seyyed Abed Hosseini

Chapter metrics overview

195 Chapter Downloads

View Full Metrics

Abstract

Image recognition has been widely used in various fields of applications such as human—computer interaction, where it can enhance fluency, accuracy, and naturalness in interaction. The need to automate the decision on human expression is high. This paper presents a technique for emotion recognition and classification based on a combination of deep-learned and handcrafted features. Residual Network (ResNet) and Rotation Invariant Local Binary Pattern (RILBP) features were combined and used as features for classification. The aim is to classify, identify, and make judgment on facial images from dark-skinned facial images. Facial Expression Recognition 2013 (FER2013) and self-captured dark-skinned datasets were used for the experiment and validated. The result showed 93.4% accuracy on FER dataset and 95.5% on self-captured dataset, which proved the efficiency of the proposed model.

Keywords

  • emotion recognition
  • facial expression
  • ResNet learned features
  • facial emotion
  • self-constructed features

1. Introduction

The comprehension of pictures and classification is a very simple job for humans but a costly task in the case of computers [1]. Computer vision gives the computer similar capability to understand information from images as the human brain [2]. Identifying and classifying human facial expression is a challenging and interesting research area; it involves understanding the facial features and their behaviors. A number of facial characteristics must be retrieved from the expression of a certain individual in order to perform expression recognition. Emotion recognition has been widely used in several fields of applications like security surveillance, teaching, and neuromarketing. Classifying facial features into one of the many different categories of emotion is necessary for emotion recognition [3].

Emotion is a state of mind that includes multiple behaviors, acts, thoughts, and feelings; throughout communication, emotion plays a central role [4]. According to Pal and Sudeep [5], emotion recognition is the process of identifying human emotion, most typically from facial expressions. The different types of expressions namely joy, sadness, surprise, angers human emotions are spontaneous, consciously felt mental states that are accompanied by physical changes in the muscles of the face that suggest facial expression. Happy, sad, angry, disgusted, fearful, surprised, and neutral emotions are only a few crucial facial expressions that are used to recognize emotion [3].

The term Deep Neural Network (DNN) or Deep Learning (DL) refers to multi-layered Artificial Neural Network (ANN). This has been considered one of the most discussed Artificial Intelligence (AI) techniques in image classification in the last few decades, with strong instruments, and is very common in the literature as it has a large amount of data to manage. DL achieved its successes over the classical machine learning models because of the multiple deeper layers [6]. There are several DL models that are trained using large labeled datasets and neural network architectures that learn features automatically from the data without using manual feature extraction.

Recent technological advances have led to the use of the Artificial Intelligence System, as these systems are capable of understanding and realizing emotion recognition through facial features. This is also an attempt to prove the existence of the latest technical advances for human-computer interaction using Deep Learning or Convolution Neural Network (CNN) models [7]. Recently, Deep Learning has become a vital tool for numerous applications; they are made up of various processing layers for data representations with various degrees of abstraction. These layers are composed of simple but non-linear modules, each transforming the representation at one level (starting with the raw input) into a representation at a higher, slightly more abstract level [8]. Deep-learning techniques perform well in tackling a wide range of computer vision issues that cannot be solved by conventional machine-learning methods. The accuracy of deep-learning applications has surpassed that of traditional applications in a number of computer-vision tasks, along with breaking previous records for tasks like image recognition [9]. Though deeper neural networks are more difficult to train, He and Zhang [10] offered a Residual learning framework (ResNet) to make it simpler to train networks that are much deeper than those previously employed. Instead of learning unreferenced functions, the study deliberately reformulated the layers to learn residual functions with reference to the layer inputs. Zhang offered in-depth empirical proof that these residual networks are simpler to optimize and can improve accuracy over considerably increased depth. The current popular CNN algorithm, combined with the ResNet-50 residual network, has achieved a good effect in the multi-classification tasks [11].

Image recognition, most especially the need to automate the decision on human expression, is on the high need, bearing in mind that determining the features to extract is next to impossible because of the variance in the facial features of different races and gender and individual specific differences like accident victims or those disfigured from birth [12].

In this research, image classification is carried out using a combination of deep-learned and handcrafted features for classification of facial expressions. A Rotation Invariant Local Binary Pattern (RILBP) and a pre-trained ResNet representation of images are employed for feature extraction from the set of data, which have been proven to increase accuracy in training deeper neural networks by eliminating the problem of degradation and vanishing/exploding gradient, which has been largely addressed by introducing a deep residual network and normalized initialization.

The limited work done with images of some races, the Negroid specifically, leads to little or no dataset of such in recent time, which, in turn, affects the accuracy of results if tested with this race [13]. Hence, automating this process would go a long way in balancing the request and service, and also, allowing the dataset to be trained by deep-learning algorithm is most likely the optimized solution in solving this problem. This study will also generate dataset that will help handle dark-race emotion classification.

The remainder of this paper is organized as follows: Section 2 provides the background of the study; Section 3 reviews the related works. The proposed method is introduced and detailed in Section 4. The experiments and results are illustrated in Section 4. Finally, Section 5 provides the conclusion and some perspectives for future researches.

Advertisement

2. Background of the study

Residual Network and Rotation Invariant Local Binary Pattern background subjects will be discussed in this section.

2.1 Residual networks architecture

The accuracy of the CNN-based emotion classification system has been improved through pre- or post-processing and the development of new algorithms and models in the architecture. Szegedy and Liu [14] reveal that network depth (stacking more layers) is of great importance, but this comes with a problem of vanishing/exploding gradients, which hamper convergence from the beginning. Normalized initialization and immediate normalization layer solve these problems [15].

In deep-learning models, more layers are added to learn more complex problems, but this addition leads to degradation of the performance and saturated accuracy; the degradation does not cause overfitting [16], When a network overfits, training error decreases, while test errors increase; with degradation, higher training error is reported as deeper networks are difficult to train [17].

He and Zhang [10] introduced the deep ResNet learning framework made from residual blocks to address these problems. The core idea of the residual block is the skip connection in which there is a connection that skips one or more layers. Skipping effectively simplifies the network by using fewer networks in the initial training stage. ResNet has different architectures in which their difference is in the number of layers such as ResNet-34, which uses 34 layers; ResNet-152 with 152 layers; and ResNet-18 with 18 layers. These are plain network architectures inspired by VGG-19 in which the shortcut connection is added [16]. ResNet is made up of convolutional layers, residual blocks, and fully connected layers. It has the concept of residual learning in which the subtraction of a feature is learned from the input of that layer by using shortcut connections. It has proven that the residual learning can improve the performance of model training and also revolve the problem of degrading accuracy in deep network [18].

ResNet makes it simpler to train networks that are much deeper than those previously employed. Instead of learning unreferenced functions, they deliberately reformulated the layers to learn residual functions with reference to the layer inputs. They offer in-depth empirical proof that these residual networks are simpler to optimize and can improve accuracy over considerably increased depth. The current popular convolutional neural network algorithm, combined with the ResNet-50 residual network, has achieved a good effect in the multi-classification task [11]. ResNet-18 model has been used by Huang and Liu [19] in identification of different grades of aluminum scrap with improved identification efficiency and reduced equipment cost. The ResNet-18 network model trained the three different datasets using the RGB, HSV, and LBP, and the results showed that RGB was the best dataset. Authors concluded that with hyperparameter optimization of the ResNet-18 model, the accuracy of final classification and recognition could reach 100% and effectively achieve the classification of different grades of aluminum scrap.

Figure 1 shows image of the ResNet-18 architecture; the structure indicates an aluminum block image with an input size of 224 pixels × 224 pixels × 3 channels. In the neural network structure, conv represents the convolutional layer, which uses 3×3 filters; downsampling is performed by the convolution layer with a stride of 2.

Figure 1.

Architecture of ResNet [20].

Max pool is the maximum pooling layer, avg. pool is the average pooling layer, and FC is the fully connected layer, such that the size of the convolutional kernel is 7 × 7conv, the number of channels is 64, and the step size is 2 [19]. A total of eighteen layers exist in the architecture (17 convolutional layers, a fully connected layer, and an additional softmax layer to perform classification task). Throughout the network, residual shortcut connections are inserted between layers. There are two types of connections; the first is denoted by solid lines and is used when input and output have the same dimensions. The second type of connections, denoted by dotted lines, is used when dimensions increase. The layers are stacked to learn a residual mapping; the mapping function, denoted by H(x) and depicted in Eq. (1), is fitted by a few stacked layers. The hypothesis behind residual layer is if several nonlinear layers can asymptotically estimate a challenging mapping function, then they do the same for the residual function denoted as F(x) in Eq. (2) [21].

The underlying mapping is given by:

H(x)=F(x)+xE1

The residual function is given by

F(x)=H(x)xE2

With x representing the input layer.

2.2 Rotation invariant local binary pattern (RILBP)

The RILBP is based on uniform local binary pattern (LBP) histograms. LBP descriptor is widely used in texture analysis because of its computational simplicity and robustness in illuminating changes. The LBP approach labels the image pixels by thresholding the 3×3 neighborhood of each pixel with the center value and summing the thresholded values weighted by powers of two [22]. It is the extension of LBP through the use of different neighborhood sizes. Rotation invariant texture analysis provides texture features that are invariant to the rotation angle of input texture image. These features are used to train different classifiers such as neural networks, Nearest Neighbor, and SVM [23]. Further details are as reported in [24].

Advertisement

3. Review of related literature

Emotion identification aims at recognizing a human’s emotions; the emotion may be taken either from face or from verbal contact. Bodapati and Veeranjaneyulu [25] recognized human emotions from facial expressions using Extended Cohn-Kanade (CK+) benchmark dataset. Based on the experimental results, authors stated that the images were better portrayed by unsupervised features as compared to the handmade features. When face-detection algorithm was used, the accuracy was 86.04%, and without using the face-detection algorithm, the proposed model gave an accuracy of 81.36%.

To maximize the efficiency of the emotional recognition system, Cai and Hu [26] suggested a multimodal model of emotion recognition from speech and text. CNN and long-term short-term memory (LSTM) were combined in a form of binary channels to learn the features of acoustic emotion; textual features were captured with an effective bidirectional long short-term memory (Bi-LSTM) network. To learn and identify the fusion features, they used a deep-learning neural network (DNN). Experiments were carried out on the Interactive Emotional Dyadic Motion Capture (IEMOCAP) database yielding an overall increased accuracy of text recognition by 6.70%, and the accuracy of speech emotion recognition increased by 13.85%. Their experimental findings show that the multimodal recognition performance is higher than that of the single modal and outperforms other multimodal models reported on the test datasets with an accuracy of 71.25%.

Fei and Jiao [27] proposed real-time facial expression classification based on voting mechanism to increase recognition rates of facial expression classification in real-time; various models of neural networks were designed to learn the facial features. Experiment showed that the average recognition rates for fer2013, CK+, and JAFFE database were 74.58, 100, and 100%, respectively. Comparing the results to other models, their methods of recognition had superior performance, improved recognition, and algorithm robustness.

Ansari and Singh [28] presented the implementation of deep-learning model of CNN. The architecture was an adaptation of an image-processing CNN, programmed in Python using Keras model-level library and TensorFlow backend. For five emotions (happiness, fear, sadness, neutral, anger), the model achieved a mean accuracy of 79.33%, which was comparable with performances reported in scientific literatures.

Sujanaa and Palanivel [29] used datasets comprising of mouth images in the form of video frames to categorize emotions into happy, normal, and surprise. Histogram Oriented Gradient (HOG) and LBP were used to extract features, while the SVM and one-dimensional neural network were trained to detect these emotions, and accuracy of 97.44 and 98.51% were achieved, respectively.

Minaee and Minaei [30] proposed a deep-learning approach based on attentional convolutional network that concentrated on important sections of the face and made substantial improvements on multiple datasets, including Facial Expression Recognition 2013 (FER2013), Cohn-Kanade (known as CK+), Facial Expression Research Group Database (FERG), and Japanese Female Facial Expression (JAFFE), compared with previous models. A visualization technique was employed based on the performance of the classifier, and it was able to identify key facial regions for different emotions, showing sensitivity to different parts of the face.

Haar cascade classifier and CNN model were used by Shirisha and Buddha [31]. The study reported an accuracy of 62% from facial emotion detection on FER 2013 datasets and suggested use of transfer learning, more datasets, and different combinations in designing convolution layers.

Kalaivani and Sathyapriya [12] addressed methods for extraction and detection of mouth regions with the use Viola-Jones and image-cropping techniques for facial expression detection. The mouth area was extracted, and facial emotions were graded according to white pixel values in the face’s picture. Edge-based segmentation and operation was applied to extract the mouth region features. By measuring the area of the mouth region and from the shape and size of the region, the expression was detected.

According to Kim and Saurous [32], machine-based emotion recognition is a daunting job, but it does have great potential to allow empathic human—machine communication. The authors proposed a model that incorporated features proven to be useful for emotion recognition and DNN to leverage temporal information when recognizing emotional status. A Berlin Emotion Speech Database (EMO-DB) benchmark was used for evaluation, achieving an efficiency of 88.9% recognition rate, outperforming other state-of-art algorithms.

Santhoshkumar and Kalaiselvi [33] analyzed human emotional states by predicting full-body movements using feed-forward deep CNN architecture and Block Average Intensity Value (BAIV) feature. Both models were tested on emotion-action dataset (University of YORK) with 15 forms of emotions. The experimental result showed the better recognition efficiency of the feed-forward deep CNN architecture with 90.09% accuracy compared to BAIV with 80.03% accuracy.

According to Selvapriya and Maria [34], identifying human facial expression is not a simple task due to lighting, facial occlusions, face color/shape, and other circumstances. Social emotional classifications were defined in their research by artificial neural network (ANN), deep learning, and a rich hybrid neural network (HNN). The study also focused on a state-of-the-art hybrid deep-learning approach, incorporating a CNN for individual-frame spatial features and long short-term memory (LSTM) for the temporal features of consecutive frames. Using Matlab, the analyzed methodologies were applied. The CNN model achieved greater classification accuracy compared to the rich HNN and ANN schemes. With a specific number of data, CNN, HNN, and ANN performed with an accuracy of 90, 70, and 58%, respectively. The performance assessment was shown to have specific advantages and disadvantages for each and every process among themselves.

Classification is very important to organize the data, so that it is easily available. Mohamed [35] explored four popular machine-learning and data-mining techniques used for classification such as Decision Tree, ANN, K-Nearest-Neighbor (KNN), and SVM. The study showed that each technique applied different datasets in different places; each technique had its own advantages and disadvantages, and it was discovered that it was very difficult to find a classifier that could identify all the datasets with the same accuracy. SVM reported the highest overall accuracy of all learning algorithms with 76.3%. The other approaches also performed well and could be a fair choice as they were all over 70% accurate. The learning algorithm’s output was highly dependent upon the existence of the dataset. Krishna and Neelima [1] described the classification of images as a classic problem of the fields of image processing, computer vision, and machine learning. They investigated image classification using deep learning and AlexNet. Four test images were chosen from the ImageNet database and were classified correctly with 95% accuracy, and this demonstrated the efficacy of using AlexNet deep-learning-based image classification.

Luna-Jimenez and Kleinlein [36] proposed an automatic emotion-recognizer system that had a speech emotion recognizer (SER) and a facial emotion recognizer (FER). Eight emotions were classified, and they achieved 86.70% accuracy on the RAVDESS dataset using a subject-wise 5-CV evaluation.

Kaviya and Arumugaprakash [37] proposed a human group facial sentiment recognition system using a deep-learning approach. Haar filter was used to detect and extract facial features, with a CNN model developed to recognize facial expressions and classify them into five basic emotional states, namely, happy, sad, anger, surprise, and neutral; then, the predicted group emotions were fed into an audio synthesizer to get audio output. An accuracy rate of 65% was achieved for Facial Expression Recognition (FER)-2013 and 60% for custom datasets.

Babajee and Suddul [38] analyze how the CNN algorithm is used to identify human facial expressions using deep learning. Their system employed a labeled dataset of over 32,298 images with varied facial expressions for training and testing. A noise reduction facial detection subsystem with feature extraction was part of the pre-training process. Without the use of optimization techniques, an accuracy of 79.8% was recorded in recognizing each of the seven basic human emotions.

Awatramani and Hasteer [39] trained the fundamental architecture of CNN to recognize human emotions in children with Autism Special Disorder (ASD). The model was validated using a pre-existing dataset from the literature, and an accuracy of 67.50% was attained.

The Shanghai Jiao Tong University (SJTU) Emotion EEG dataset (SEED) was used with ResNet-50 and Adam optimizer; a CNN model was presented by Ahmad and Zhang [40] to simultaneously learn the features and recognize the emotions of positive, neutral, and negative states of pure electroencephalograms (EEG) signals. Negative emotion had the highest accuracy of 94.86%, while neutral and positive had 94.29 and 93.25%, respectively. An average accuracy of 94.13% was reported; this showed that the model’s classification abilities were excellent and could improve emotion recognition.

Sandhu and Malhotra [41] classified human emotions into subcategories based on the hybrid CNN approach used to recognize them. In order to achieve the best accuracy and loss, the study used the FER13 dataset for emotion recognition and trained the model accordingly. Seven fundamental emotion classes were effectively recognized by the system. The suggested approach was therefore demonstrated to be successful in terms of increased accuracy with an average rate of 88.10% and minimal loss for facial emotion recognition.

Santoso and Kusuma [42] adopted the state-of-the-art models in ImageNet and modified the classification layer with Spinal Net Kabir, Abdar [43], and ProgressiveSpinalNet Chopra [44] architecture to improve the accuracy. The categorization was done using the FER2013 dataset; after the training procedure was completed and its hyperparameter adjusted, an accuracy of 74.4% was achieved. The study in [45] provided an end-to-end system that used residual blocks to identify emotions and improve accuracy in the research field. After receiving a facial image, the framework returned its emotional state. The accuracy obtained on the test set of FERGIT dataset (an extension of the FER2013 dataset with 49,300 images) was 75%.

Zhu and Fu [46] integrated CNN with VGGNet, AlexNet, and LeNet-5. They also introduced optimized Central Local Binary Pattern (CLBP) algorithm into the CNN to construct a CNN-CLBP algorithm for facial emotion recognition. The experiment yielded accuracy of 88.16% for the hybrid CNN-LBP, while LBP, LeNet-5, and VGGNet gave 48.63, 73.22, and 83.17% accuracy, respectively.

Durga and Rajesh [47] proposed a 2D-ResNet convolutional neural network to detect maskable images of facial emotions; better performance metrics were obtained in terms of accuracy of 99.3%, recall of 99.12%, F1 score of 0.98%, and sensitivity of 99.16%. The proposed model reduced the problem of overfitting.

Irhebhude and Kolawole [24] focused on presenting a technique that categorized gender among dark-skinned people. The classification was done using SVM on sets of images gathered locally and publicly. Analysis includes face detection using Viola-Jones algorithm, extraction of Histogram of Oriented Gradient and Rotation Invariant LBP (RILBP) features, and training with SVM classifier. PCA was performed on both the HOG and the RILBP descriptors to extract high dimensional features. Various success rates were recorded; however, PCA on RILBP performed best with an accuracy of 99.6 and 99.8%, respectively, on the public and local datasets.

Irhebhude and Kolawole [48], in their study, implemented an age estimation system from facial images using the Rotation Invariant Local Binary Pattern Descriptor (RILBD) feature, which was combined with Principal Component Analysis (PCA) for feature high dimensional data, and the Support Vector Machine (SVM) algorithm was used for classification. The facial images were grouped into four classes, namely, class 1 (0–10 years), class 2 (11–20 years), class 3 (21–35 years), and class 4 (above 35 years). Experiments were carried out on the local dataset captured within Kaduna metropolis in the northern part of Nigeria and the FGNET dataset, which is publicly available online, to test the performance of the proposed method. They reported that the system achieved overall accuracy result of 95.0 and 95.7% on the two datasets. This was reported to show the impact self-constructed features could have in the overall accuracy of a recognition task.

In this work, the conventional ResNet features will be extracted and combined with RILBP features and used as a feature set to recognize seven different facial emotions.

Advertisement

4. Methodology

The proposed method is based on the combination of information from both the deep-learned features and RILBP, as shown in Figure 2. The methodology consists of four steps: input image, face detection/cropping, feature extraction, and feature classification.

Figure 2.

Proposed methodology.

The proposed methodology is made up of the following steps: input image, face detection & cropping, feature extraction and concatenation, and feature classification and emotion recognition as illustrated in Figure 2.

4.1 Input image

In this step, captured facial images with various sizes were loaded into the system for further processes to take place. The images in its original form were fed to the next stage of the methodology; these images consist of the entire face captured.

For this study, experiments were carried out on self-captured dataset and the FER2013 database [49]. The focus is to recognize and classify seven (7) different emotions from facial images using deep convolutional neural network.

The self-captured database consists of 5509 images of Black faces. The images were gotten from snapshots of different people, both male and female, with the use of a digital camera. The faces consisting of facial details and expressions were saved and used to make up the dataset. The images were of various sizes, consisting of 652 ‘angry’ images, 778 ‘disgust’ images, 519 ‘fear’ images, 1168 ‘happy’ images, 908 ‘neutral’ images, 830 ‘sad’ images, and 654 ‘surprise’ images.

4.2 Face detection & cropping

The image part of the face was detected automatically, and the detected face position was cropped accordingly [50]. The Viola-Jones algorithm, as described in Irhebhude, Kolawole [24], was used for face detection. The feature works with images in squares, with multiple pixels in each box. Per box is then processed, yielding various values, indicating dark and light areas. These values serve as the basis for image processing. To improve image quality, the pictures were cropped and standardized to 224×224×3 (ResNet specifications).

Figure 3 shows sample images of each of the seven classes of emotion in the dataset: angry, disgust, fear, happy, neutral, sad, and surprised. These seven facial emotions were formed by each subject after careful examination of the FER2013 datasets. This was done so that dark-skinned faces can be adequately captured for the purpose of the experiment. Each class has two sample images, one in original form and the other after detection; the detected faces are used for the experiment.

Figure 3.

Sample images of different classes of emotions before and after face detection from self-captured dataset.

The FER2013 dataset consists of grayscale images of 48x48 pixels in size. The faces have been automatically registered so that the face is centered and occupies about the same amount of space in each image. Each face is categorized based on the emotion shown in the facial expression into one of seven categories: angry, disgust, fear, happy, sad, surprise, and neutral images [49]. The training set consists of 28,709 images; the database was created using Google image search API; the FER has more variations, including occlusion, partial faces, eye glasses, and low contrast [30]. Sample images are shown in Figure 4.

Figure 4.

Sample images of different classes of emotions from FER2013 dataset [49].

4.3 Feature extraction

Two feature extraction algorithms, ResNet and RILBP, were used to extract facial emotion features. The ResNet features are extracted from the pooling layer of the input image, while RILBP extracts texture features that are invariant to rotation [23].

ResNet-18 was used in order to extract deep-learned features from the detected and cropped facial images; the network uses an 18-layer plain network architecture as inspired by VGG-19 in which the shortcut connection is added [16]. This network is adapted for the emotion classification task as it is suitable to extract learned image features. Optimization was done with Bayesian optimization, with features automatically extracted before the fully connected layer giving a feature length of 512.

To enhance the description ability, the RILBP descriptor was used; this encodes the local facial features in a multi-resolution spatial histogram and combines the distribution of local intensity with the spatial information [23]. Studies have shown good results for gender recognition and age estimation Irhebhude, Kolawole [24], using the RILBP descriptor. Following the steps as reported by the authors, the technique extracted texture features with a dimension length of 36 for further classification.

Concatenating the RILBP and ResNet features takes benefits from their advantages to yield a good performance. The combined/concatenated features were presented to the SVM classifier.

4.4 Feature classification: SVM classifier

The feature sets extracted are used as input in the classification stage. SVM was used for model training, because of its popularity and simplicity. Optimizable SVM was used to train the classifier to help select the best parameter for classifying facial images into the seven different emotion categories. The best hyper-plane that can separate samples of one class from those of other classes using different types of support vectors was obtained using the SVM method by Dammak, Mliki [23]; the hyperparameter optimization automatically selects the hyperparameter values for classification. This model seeks to minimize the model classification error and returns a model with the optimized hyperparameters. In the proposed method, the optimizable parameters were: the kernel function, box constraint level, multiclass method, and standardize data. The linear kernel function with box-level constraint parameter was set to 1 to prevent overfitting and multiclass method of one-vs-one; model hyperparameters were set to optimize. These parameters were used in training the model, and the minimum classification error plot (Figure 5) shows the optimized results.

Figure 5.

Minimum classification error plot. (a) FER dataset; (b) Self-captured dataset.

Figure 5 illustrates the minimum classification error plot. At each iteration, a different combination of hyperparameter value is tried and updated on the plot with the minimum validation classification error observed up to that iteration; this is indicated in dark blue. On completion of the optimization process, a set of optimized hyperparameters is selected, which is indicated by a red square.

The red square indicates the iteration that corresponds to the optimized hyperparameters. The yellow point indicates the iteration that corresponds to the hyperparameters that yield the observed minimum classification error.

Advertisement

5. Experimental result

To evaluate performance of the proposed technique, 70% of the data was used for training, while 30% was used as test data. The test results were visualized and explained using confusion matrix, scatter plot, and ROC curve.

Experiments were conducted to validate the use of concatenating ResNet and RILBP as feature extractors on each of the datasets. Optimizable SVM was used and hyperparameters were automatically as kernel function: linear, box constraint:1, multiclass method: one-vs-one and standardize data set as true. An accuracy of 93.4% was obtained on the FER dataset and 95.5% on the self-captured dataset; this indicates the percentage of correctly classified observations.

The scatter plot in Figure 6 gives a visual representation of the scatter plot obtained; the plot uses different colors to represent the classes of emotions.

Figure 6.

Scatter plot visualization of sample images. (a) FER dataset; (b) Self-captured dataset.

The scatter plot in Figure 6 helps to visualize the training data and misclassified points for emotion detection on FER dataset (Figure 6a) and self-captured dataset (Figure 6b). Each colored dot represents the plot, which shows a strong relationship between the variables as the data points cluster more tightly. We also see the values tending to rise together, indicating a positive correlation; few outliers are also observed from the plot. The results show a similar pattern in the two datasets. The x indicates the misclassified instances.

The confusion matrix and ROC curve are used to check the performance of the classifier in each class. Figure 7 displays the confusion matrix, showing the number of observations in each cell. The matrix is plotted as the true class against the predicted class; the row represents the true class, and the columns correspond to the predicted class.

Figure 7.

Confusion matrix showing number of observations. (a) FER 2013 (b) Self-captured dataset.

The classes labeled as 1 to 7; angry is represented by class 1, while disgust, fear, happy, neutral, sad, and surprise are represented by classes 2,3,4,5,6, and 7, respectively. The blue diagonal boxes show observations with the correctly predicted class; the off-diagonal cells correspond to the incorrectly classified observations. We can see for the FER dataset (Figure 7a) that class 7 has the least classification error rate, having 222 correct observations as class 7 and total of 10 observations wrongly placed when compared to the other classes. Similarly, from the self-captured data (Figure 7b), classes 6 and 7 had the least error rate, reporting 3 wrong observations each.

Figure 8 illustrates confusion matrix performance of the classifier per class, indicating the True Positive Rate (TPR) and False Negative Rate (FNR). The TPR is the proportion of correctly classified observations per true class. The FNR is the proportion of incorrectly classified observations per true class. From Figure 8(a), class 2 has the highest FNR of incorrectly classified points as 15.2%, which is shown in the FNR column. Class 4 has the highest TPR value of 96.0% correctly classified points in this class. For the self-captured set in Figure 8(b), class 6 recorded the least FNR and highest TPR values of 1.2 and 98.8%, respectively.

Figure 8.

Confusion matrix showing true positive rate and false negative rate. (a) FER 2013 (b) Self-captured dataset.

Figure 9 shows confusion matrix performance per predictive values to investigate the False Discovery Rate (FDR). The Positive Predictive Value (PPV) is the proportion of correctly classified observations per predicted class. The FDR is the proportion of incorrectly classified observations per predicted class. PPV is indicated in blue for the correctly predicted points in each class, and the FDR is shown in orange for the incorrectly predicted points in each class. The class 5 had the highest FDR of 15.5% incorrectly predicted class for the FER dataset, while class 1 had the highest FDR of 11.4% for the self-captured dataset, as shown in Figure 9.

Figure 9.

Confusion matrix of positive predictive values and false discovery rates recognition. (a) FER dataset (b) Self-captured dataset.

The ROC curve in Figure 10 shows the plot of TPR and FPR for classification scores computed by the classifier. For the FER dataset, TPR and FPR of 0.02 and 0.94, respectively, showed that 2% observations were incorrectly classified to other classes, while 94% were correctly classified to their classes. The self-captured dataset recorded a similar pattern with TPR and FPR values of 0.02 and 0.92, respectively. The overall performance is indicated with an Area Under Curve (AUC) value of 0.99 for the two datasets. This shows that the classifier has 99% overall performance; AUC values are in the range of 0 to 1, and larger values indicate better performance of classifier.

Figure 10.

ROC curve showing the classifier performance. (a) FER dataset; (b) Self-captured dataset.

Advertisement

6. Comparison with selected state-of-the-art methods

The proposed technique used in this study showed a more promising result when compared with other selected methods, as presented in Table 1.

AuthorMethodAccuracyDataset
Zhu et al. [46]CNN-CLBP88.61%CK+ and JAFFE
LBP73.22%
VGGNET83.17%
Fei et al. [27]VGG19, ResNet18 and DNN + SVM (VOTE)74.58%FER 2013
CK + JAFFE
100%
100%
Proposed methodRILBP + ResNet1893.3%FERET
95.5%Self-Captured

Table 1.

Comparison with other methods.

With a fusion of handcrafted and deep-learned features, [46] reported 88.61% accuracy, indicating that the proposed technique performed better with 5% improvement in accuracy. Referring to datasets and ResNet methods used, the proposed method achieved a gain of 10% when compared to the study in [27], which used voting technique on the FER dataset. From the experimental results, the study concludes that the proposed method provides a more promising result in terms of accuracy on recognition. This was a result of the combination of more transfer learning and rotation invariant features. The approach showed considerable improvements when compared with existing methods in [27, 46].

Advertisement

7. Conclusion

Facial emotion recognition among dark-skinned people has been addressed in this paper by adopting the technique of concatenating handcrafted and deep-learned ResNet features extracted from facial images. The ResNet transfer learning model was used to extract deep-learned features and combined with RILBP features, which helped capture local features that were invariant to scale and rotation. The method was evaluated on two datasets: self-captured (which comprised of dark-skinned facial images) and FER2013 datasets (which formed the base dataset to validate the technique). The SVM classifier was used for classification into various emotion categories. The study showed that ResNet and RILBP complement each other in terms of achieving a good recognition accuracy. Future work will look into data balancing and generating more datasets, especially of dark-skinned people.

Advertisement

Acknowledgments

Authors will like to appreciate the management of Nigerian Defense Academy for the support and encouragement throughout the period of this study.

Advertisement

Conflict of interest

The authors declare no conflict of interest.

References

  1. 1. Krishna M et al. Image classification using deep learning. International Journal of Engineering & Technology. 2018;7(2):614-617
  2. 2. Patil MP, Chokkalingam S. Deep convolutional neural networks (CNN) for medical image analysis. International Journal of Engineering and Advanced Technology (IJEAT). 2019;8(3S):607-610
  3. 3. Verma N, Tiwari S. A review on facial expression recognition system using deep learning. Journal of Emerging Technologies and Innovative Research (JETIR). 2021;8(7):963-968
  4. 4. Ruiz-Garcia A et al. Deep learning for emotion recognition in faces. In: The 25th International Conference on Artificial Neural Networks (ICANN 2016). Barcelona, Spain: Springer Verlag; 2016
  5. 5. Pal KK, Sudeep K. Preprocessing for image classification by convolutional neural networks. In: 2016 IEEE International Conference on Recent Trends in Electronics, Information & Communication Technology (RTEICT). Bangalore, India: IEEE; 2016
  6. 6. Albawi S, Mohammed TA, Al-Zawi S. Understanding of a convolutional neural network. In: 2017 International Conference on Engineering and Technology (ICET). Antalya, Turkey: IEEE; 2017
  7. 7. Mohammadi F, Abadeh MS. Image steganalysis using a bee colony based feature selection algorithm. Engineering Applications of Artificial Intelligence. 2014;31:35-43
  8. 8. LeCun Y, Bengio Y, Hinton G. Deep learning. Nature. 2015;521(7553):436-444
  9. 9. Ranganathan H, Chakraborty S, Panchanathan S. Multimodal emotion recognition using deep learning architectures. in 2016 IEEE Winter Conference on Applications of Computer Vision (WACV). Lake Placid, NY, USA: IEEE; 2016
  10. 10. He K et al. Deep Residual Learning for Image Recognition. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Las Vegas, NV, USA: IEEE; 2016
  11. 11. Li B, Lima D. Facial expression recognition via ResNet-50. International Journal of Cognitive Computing in Engineering. 2021;2:57-64
  12. 12. Kalaivani G, Sathyapriya S, Anitha DD. A literature review on emotion recognition for various facial emotional extraction. IOSR Journal of Computer Engineering. 2018:30-33
  13. 13. Perez A. Recognizing human facial expressions with machine learning. 2018. Available from: https://thoughtworksarts.io/blog/recognizing-facial-expressions-machine-learning/
  14. 14. Szegedy C et al. Going deeper with convolutions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Boston, MA, USA: IEEE; 2015
  15. 15. He K et al. Spatial pyramid pooling in deep convolutional networks for visual recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence. 2015;37(9):1904-1916
  16. 16. Team GL. Introduction to Resnet or Residual Network, in Great Learning Blog: Free Resources what Matters to shape your Career!; 2023. Available from: https://www.mygreatlearning.com/blog/resnet/
  17. 17. He K, Sun J. Convolutional neural networks at constrained time cost. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Boston, USA: IEEE; 2015
  18. 18. Han SS et al. Deep neural networks show an equivalent and often superior performance to dermatologists in onychomycosis diagnosis: Automatic construction of onychomycosis datasets by region-based convolutional deep neural network. PLoS One. 2018;13(1):e0191493
  19. 19. Huang B et al. Identification and classification of aluminum scrap grades based on the Resnet18 model. Applied Sciences. 2022;12(21):1-16
  20. 20. Wang S et al. Automatic detection and classification of steel surface defect using deep convolutional neural networks. Metals. 2021;11(3):388
  21. 21. Ramzan F et al. A deep learning approach for automated diagnosis and multi-class classification of Alzheimer’s disease stages using resting-state fMRI and residual neural networks. Journal of Medical Systems. 2020;44(2):1-16
  22. 22. Ahonen T et al. Rotation invariant image description with local binary pattern histogram fourier features. In: Image Analysis. Berlin, Heidelberg: Springer; 2009
  23. 23. Dammak S, Mliki H, Fendri E. Gender estimation based on deep learned and handcrafted features in an uncontrolled environment. Multimedia Systems. 2022;Multimedia Systems(1)
  24. 24. Irhebhude AOK, Goma HK. A gender recognition system using facial images with high dimensional data Malaysian. Journal of Applied Sciences. 2021;6(1):27-45
  25. 25. Bodapati JD, Veeranjaneyulu N. Facial emotion recognition using deep CNN based features. International Journal of Innovative Technology and Exploring Engineering. 2019;8(7):1928-1931
  26. 26. Cai L et al. Audio-textual emotion recognition based on improved neural networks. Mathematical Problems in Engineering. 2019;2019(6):1-9
  27. 27. Fei Y, Jiao G. Research on facial expression recognition based on voting model. In: IOP Conference Series: Materials Science and Engineering. Beijing, China: IOP Publishing; 2019
  28. 28. Ansari AA, Singh AK, Singh A. Speech emotion recognition using CNN. International Research Journal of Engineering and Technology (IRJET). 2020;7(6):4302-4308
  29. 29. Sujanaa J, Palanivel S, Balasubramanian M. Emotion recognition using support vector machine and one-dimensional convolutional neural network. Multimedia Tools and Applications. 2021;80(18):27171-27185
  30. 30. Minaee S, Minaei M, Abdolrashidi A. Deep-emotion: Facial expression recognition using attentional convolutional network. Sensors. 2021;21(9):3046
  31. 31. Shirisha K, Buddha M. Facial emotion detection using convolutional neural network. International Journal of Scientific & Engineering Research. 2020;11(3):51-55
  32. 32. Kim JW, Saurous RA. Emotion recognition from human speech using temporal information and deep learning. In: INTERSPEECH. Hyderabad, India: ISCA Medal Talk; 2018
  33. 33. Santhoshkumar R, Geetha MK. Deep Learning Approach for Emotion Recognition from Human Body Movements with Feedforward Deep Convolution Neural Networks. Procedia Computer Science. 2019;152:158-165
  34. 34. Selvapriya M, Maria GP. A review of classification methods for social emotion analysis. International Journal of Scientific Research in Computer Science Engineering and Information Technology. 2018;3(3):1737-1750
  35. 35. Mohamed A. Comparative Study of Four Supervised Machine Learning Techniques for Classification. International Journal of Applied Science and Technology, 2017;7(2):5-18
  36. 36. Luna-Jimenez C et al. A proposal for multimodal emotion recognition using aural transformers and action units on RAVDESS dataset. Applied Sciences. 2022;12(1):1-23
  37. 37. Kaviya P, Arumugaprakash T. Group facial emotion analysis system using convolutional neural network. In: 2020 4th International Conference on Trends in Electronics and Informatics (ICOEI) (48184). Tirunelveli, India: IEEE; 2020
  38. 38. Babajee P et al. Identifying human emotions from facial expressions with deep learning. In: 2020 Zooming Innovation in Consumer Technologies Conference (ZINC). Novi Sad, Serbia: IEEE; 2020. pp. 36-39
  39. 39. Awatramani J, Hasteer N. Facial expression recognition using deep learning for children with autism spectrum disorder. In: 2020 IEEE 5th International Conference on Computing Communication and Automation (ICCCA). Greater Noida, India: IEEE; 2020
  40. 40. Ahmad IS et al. Deep learning based on CNN for emotion recognition using EEG signal. WSEAS Transactions on Signal Processing. 2021;17:28-40
  41. 41. Sandhu N, Malhotra A, Bedi MK. Human emotions detection using hybrid CNN approach. International Journal of Computer Science and Mobile Computing. 2020;9(10):1-9
  42. 42. Santoso BE, Kusuma GP. Facial emotion recognition on FER2013 using VGGSPINALNET. Journal of Theoretical and Applied Information Technology. 2022;100(7):2008-2102
  43. 43. Kabir HD et al. Spinalnet: Deep neural network with gradual input. IEEE Transactions on Artificial Intelligence. 2022;03347:1-10
  44. 44. Chopra, P., Progressivespinalnet architecture for fc layers. arXiv preprint arXiv:2103.11373. 2021
  45. 45. Bah I, Xue Y-Z. Facial expression recognition using adapted residual based deep neural network. Intelligence and Robotics. 2022;2(1):72-88
  46. 46. Zhu D et al. Facial emotion recognition using a novel fusion of convolutional neural network and local binary pattern in crime investigation. Computational Intelligence and Neuroscience. 2022;2022:2249417
  47. 47. Durga BK, Rajesh V. A ResNet deep learning based facial recognition design for future multimedia applications. Computers and Electrical Engineering. 2022;104:108384
  48. 48. Irhebhude ME, Kolawole AO, Abdullahi F. Northern Nigeria human age estimation from facial images using rotation invariant local binary pattern features with principal component analysis. Egyptian Computer Science Journal. 2021;45(1):12-28
  49. 49. Sanbare, M. FER-2013. 2020. Available from: https://www.kaggle.com/datasets/msambare/fer2013
  50. 50. Zahara L et al. The facial emotion recognition (FER-2013) dataset for prediction system of micro-expressions face using the convolutional neural network (CNN) algorithm based raspberry Pi. In: 2020 Fifth International Conference on Informatics and Computing (ICIC). Gorontalo, Indonesia: IEEE; 2020. pp. 1-9

Written By

Martins E. Irhebhude, Adeola O. Kolawole and Goshit Nenbunmwa Amos

Submitted: 05 December 2022 Reviewed: 30 December 2022 Published: 25 January 2023