Application of Machine and Deep Learning Techniques to Facial Emotion Recognition in Infants

Uma Maheswari Pandyan; Mohamed Mansoor Roomi Sindha; Priya Kannapiran; Senthilarasi Marimuthu; Vinora Anbunathan

doi:10.5772/intechopen.109725

Abstract

Infant facial expression recognition is one of the most significant areas of research in the field of computer vision and surveillance parental care. It is essential for both the early diagnosis of medical conditions and intelligent interpersonal interactions. Despite recent improvements in face detection, feature extraction techniques, and expression categorization methods, it is still difficult to develop an automated system employing deep learning methods that achieves the goal of recognizing infant emotions. The prime aim of this chapter is to present a comprehensive framework for recognizing infant emotions using machine learning and deep learning algorithms on the dataset for infant emotions currently accessible. The proposed model directs future research on early detection of infant emotions and has the ability to identify emotional-related medical problems. This article will incorporate the findings on infant emotion recognition required to address the parental supervision and enhance intelligent interpersonal relationships.

Keywords

infant emotion recognition
machine learning
deep learning
facial emotion
surveillance parental care

Author Information

Show +

Uma Maheswari Pandyan*
- Velammal College of Engineering and Technology, Madurai, Tamil Nadu, India
Mohamed Mansoor Roomi Sindha
- Thiagarajar College of Engineering, Madurai, Tamil Nadu, India
Priya Kannapiran
- Thiagarajar College of Engineering, Madurai, Tamil Nadu, India
Senthilarasi Marimuthu
- Thiagarajar College of Engineering, Madurai, Tamil Nadu, India
Vinora Anbunathan
- Velammal College of Engineering and Technology, Madurai, Tamil Nadu, India

*Address all correspondence to: umamahes.p@gmail.com

1. Introduction

Facial expressions, which are a vital aspect of communication, are one of the most important ways humans communicate facial expressions. There is a lot to comprehend about the messages we transmit and receive through nonverbal communication, even when nothing is said explicitly. Nonverbal indicators are vital in interpersonal relationships, and facial expressions communicate them. There are seven universal facial expressions that are employed as nonverbal indicators: laugh, cry, fear, disguise, anger, contempt, and surprise.

From the moment of birth, babies can convey their interest, pain, disgust, and enjoyment through their body language and facial expressions. Around 2–3 months old, babies start smiling spontaneously, and around 4 months old, they start laughing. Although your infant may make eye contact with you, itis likely that crying will be the predominant behavior your baby exhibits. For instance, your baby may scream just because they want to be cuddled or because they are hungry, upset, wet, or uncomfortable.

Facial expressions are one of the key methods that infants communicate their needs and emotions. As a result, it’s critical to comprehend their facial expressions and pay attention to them in order to provide appropriate treatment. Understanding their emotions is essential for early diagnosis and treatment of diseases like Autism Spectrum Disorder (ASD) and Attention-Deficit Hyperactivity Disorder (ADHD). Empirical evidence suggests that early intervention for certain problems affects children’s development in the long run.

Recently, the field of computer vision has accorded facial emotion recognition a lot of attention. However, adult facial expressions are the main focus of the research. Adult and newborn face structures differ from one another. Infants have rounder faces, eyes that are closer together and much bigger, shorter lips, and lips that resemble a “cupid bow.” Their faces feature big fat pads and elastic skin, which prevents folds and wrinkles while allowing them to portray any emotion. Many newborn emotional expressions, such as anxiety, anger, and disgust, are not morphologically the same as emotions used by adults. These factors led to the development of the Baby Facial Action Coding System (Baby FACS), which is dedicated to the analysis of infants’ Action Units (AU) and Emotional Facial Action Coding System (EMFACS).

Automatic facial expression recognition using these universal expressions could be a key component of natural human-machine interfaces, as well as in cognitive science and healthcare practice. Even though humans understand facial expressions almost instantly and without effort, reliable expression identification by machines remains a challenge. In that, Infant facial expression recognition is developing as a significant and technically demanding computer vision difficulty as compared to adult facial expressions recognition. The ability to accurately interpret infant facial expressions is important for the formation of professional parental care through surveillance footage analysis. Since there is a scarcity of infant facial expression data, the recognition is mostly based on the building of a dataset. There are no datasets publicly available or created particularly to analyze the expression of infants. The creation of a dataset for infant facial expression analysis is a big and challenging task. The ability to accurately interpret a baby’s facial expressions is critical, as most of the expressions resemble the same. This process leads to the development of identifying the action behind the scenario. Despite recent advances in face detection, feature extraction procedures, and expression categorization approaches, designing an automated system that accomplishes this objective remains challenging.

This chapter provides an overview of the different datasets that are available for baby emotions. Additionally, it recommends the process for recognize newborn emotions through shot boundary detection, key frame extraction, Face detection algorithm, emotion classification through machine learning and deep learning approaches. The video sequence serves as input to the proposed methodology and is collected from a wide variety of available environments, including videos with known surroundings for infants and adults, cluttered background, stimulated situations, and videos with complex backgrounds. The video sequence is then separated into frames in order to retrieve key frames. From the retrieved key frames, faces are recognized using an integrated imaging technique and the identified faces are then divided into infants and adults using the CNN classifier model.

2. Available datasets

2.1 The city infant faces database

This database consists of 195 infant faces and it include 40 images of neutral infant faces, 54 images of negative infant faces, and 60 images of positive infant faces. High criterion validity and good test-retest reliability may be found in the images. There are 154 portrait images in the database, available in both color and black and white (Figure 1).

Figure 1.
Sample dataset images (city infant faces database) [1].

2.2 Babyexp

The dataset contains 5400 images depicting three different types of infant’s face facial expressions: cry, laugh, and neutral. To appropriately describe additional universal facial expressions, these three expressions must first be recognized. For that process, initially, the images of infant actions such as crying and laughing are gathered from the private database named infant action database. The database contains footage of children performing various actions from which images of various actions have been taken. The images of neutral activity and some images of laughing were collected from the internet (Figure 2).

Figure 2.
Sample dataset images (Babyexp).

2.3 Rebel dataset

It comprises of 50 videos of infants aged 6–10 months that were gathered from the University of Nevada, Las Vegas’ Department of Psychology (UNLV). There are a lot of unlabeled videos of infants in the Rebel collection that need to be labeled [2].

2.4 Tromso infant faces (TIF) database

In addition to rating the images’ intensity, clarity, and valence, over 700 adult images divided them into 7 emotion categories: joyful, sad, disgusted, furious, terrified, astonished, and neutral.

2.5 The child emotion facial expression set

The seven induced and posed universal emotions as well as a neutral expression were utilized to build a video and image database of 4- to 6-year-old children. Participants were involved in video and image shoots intended to evoke certain emotions, and the resulting photos were then judged in two rounds by impartial judges. For each emotion, there were 87 stimuli for neutrality, 363 stimuli for joy, 170 stimuli for disgust, 104 stimuli for surprise, 152 stimuli for fear, 144 stimuli for sadness, 157 stimuli for anger, and 183 stimuli for contempt [3].

2.6 EmoReact

Children between the ages of 4 and 14 make up this multimodal emotion dataset. The collection includes 1102 audio-visual clips with annotations for 17 different emotional states, including 9 complicated emotions like frustration, doubt, and curiosity, as well as neutral and valence [4].

2.7 Child affective facial expression set (CAFE)

The CAFE collection includes 1192 color images of a racially and culturally varied group of children aged 2–8 who posed for six emotional facial expressions: angry, afraid, sad, joyful, astonished, and disgusted [5].

2.8 The multimodal dyadic behavior dataset

A solitary collection of multimodal (video, audio, and physiological) recordings of infants and toddlers’ social and communicative behavior that was collected during a semi-structured play interaction with an adult. According to an IRB process endorsed by the university, the sessions were videotaped in the Georgia Tech Child Study Lab (CSL).

2.9 The NIMH Child Emotional Faces Picture Set (NIMH-ChEFS)

There are 482 images in the database of child faces in 2 different gaze states direct stare and averted gaze including those who are scared, angry, joyful, sad, and neutral [6].

3. Shot boundary detection method

A video is composed of a variety of scenes that capture the order of events, shots, and frames. As a result, it consists of interconnected images taken from various camera angles. The smallest unit of temporal visual information, a shot is composed of a series of related frames continuously recorded by a single camera. These time and space-related acts or events are represented by these frames [7]. In order to manage the immense volumes of video data created by massive multimedia applications, video abstraction technologies were required due to the rapid expansion of network infrastructure and the usage of advanced digital video technology. As a result, users may readily access and retrieve the necessary portions of the video without having to watch the whole thing. The key frame extraction module, where the number representative frames are recognized and chosen, and the shot boundary recognition module, which divides shots from video frames.

To streamline video analysis and processing, shot boundary detection/temporal video segmentation is the technique of dividing video frames into several shots by identifying the border between subsequent video shots. The primary objective of shot boundary detection methods is to identify differences in visual content. These differences between succeeding images are calculated, and a threshold comparison is formed. Three fundamental components make up the shot boundary detection (SBD) method algorithms: frame representation, dissimilarity measure, and thresholding. Finding transitions in the context of abrupt illumination changes and significant camera/object movement is one of these SBD approaches’ biggest challenges, which might result in the extraction of incorrect keyframes.

4. Key frame extraction

Based on shot boundary, visual information, movement analysis, and cluster approach, key frame extraction techniques may be loosely divided into four categories. By removing or deleting the duplicated frames from the source film and extracting a group of representative frames, keyframe extraction is an appropriate technique for communicating effectively the key components of a video clip. These removed keyframes are anticipated to represent and offer thorough visual data for the entire video [8]. To make indexing, retrieval, storage management, and video data recognition more convenient and effective, the keyframe technique is used to reduce the computational cost and amount of data required for video processing. These approaches can be classified into three main classes viz., shot based, sampling-based, and clustering-based techniques.

4.1 Sampling-based technique

This sort of technique, which does not prioritize the video content, chooses representative frames by equally or randomly sampling the video frames from the original video. The idea behind this method is to select every kth frame from the source video [9]. The length of the video determines this value of k. A typical range for a video summary is 5–15% of the entire video. Every 20th frame is chosen as the keyframe in the case of 5% summarizing, whereas every 7th frame is chosen in the case of 15% summarization. Although these keyframes were extracted from the video, they do not accurately depict everything. They can also result in duplicate frames with the same content.

4.2 Shot-based technique

In this method, the shot boundary/transition is initially detected using an effective SBD method. The keyframe extraction method is then carried out after the video frames have been divided into multiple shots. Different key frame selection methods have been covered in various literary categories. The first and last frames of the candidate shot are often chosen as the key frames in the conventional method. These snipped key frames are the shots’ representative frames, which results in a more simplified synopsis of the original video.

4.3 Clustering-based technique

Unsupervised learning techniques such as clustering group together collections of related data points. With this technique, video file frames with comparable visual contents are divided into various numbers of clusters. The frame that is extracted as the key frame from each cluster is the one that is closest to the candidate cluster’s center. The qualities that the frames display, such as color histograms, texture, saliency maps, and motion, define the similarities between them [10]. The fundamental problem with the clustering-based approach is that, before completing the clustering operation, it can be challenging to count the number of clusters in each video file.

5. Face detection algorithm

Object detection is one of the computer technologies that is connected to image processing and computer vision. It is concerned with detecting instances of an object such as human faces, buildings, trees, cars, etc. The primary aim of face detection algorithms is to determine whether there is any face in an image or not.

Viola-Jones: In order to find Haar-like characteristics, this method slides a square of a predefined size across the image. Then, these characteristics can be identified as components of a face.
One-shot detector (SSD): This one overlays the image with a grid and many “anchor boxes,” the latter of which are produced during the training phase. These boxes are used to identify the necessary items’ characteristics and locations, such as faces.
You Only Have to Look Once (YOLO): Because it just takes one “look” at the image to detect all the objects of interest, it boasts better performance than SSD.only needs one “look” at the picture to find all the objects of interest.

6. Classification methods

6.1 Machine learning

Automatic facial expressions classifiers have made significant progress, according to researchers. Facial Action Coding System (FACS) has been developed for classifying facial movements by AU [11]. Traditional machine learning-based classifiers such as Hidden Markov Model, Support Vector Machine (SVM), Bayesian network were proposed for face facial expressions recognition. Audio and video clips are used to recognize and classify emotions by SVM and Decision Level Fusion [12]. Utilizing the compound emotion recognition of children experiencing meltdown crises, a preventive strategy was developed and implemented. Unusual facial expressions linked to complex emotions are clearly connected to the symptoms of meltdowns. Experimental evaluation is done on several deep spatiotemporal geometric features of autistic children’s micro expressions during a meltdown. To choose the qualities that most clearly distinguish compound emotion in a meltdown crisis in autistic children from compound emotion in a normal state, Compound Emotion Recognition performance and several collections of micro expressions features are compared. For learning and categorizing the features extracted from many images, the nearest neighbor method was introduced.

6.2 Deep learning

Deep learning-based face expression detection has gained popularity as a result of the growth of huge data and computer efficiency. YOLOv3-tiny is used to detect the face and body of infants, and it has a classification accuracy of 94.46% for the face and 86.53% for the body of infants. For extracting local temporal and spatial features, a two-stream CNNs model is used. Models based on transfer learning, including VGG16, Resnet 18, and 50, have been suggested for recognizing adult facial emotions [13, 14, 15]. In order to distinguish the newborn’s facial emotions from images, a deep neural network must be built since infant facial expression recognition is necessary in parenting care. The transfer learning model-based techniques suffer from overfitting since there is a dearth of data on newborn facial expressions. Based on IOT edge computing and a multi-headed 1-dimensional convolutional neural network (1D-CNN), a real-time infant facial expression detection system. It was suggested to use face recognition and emotion recognition algorithms to monitor toddlers’ emotions. To lower the number of parameters and save computational resources, this suggests a lightweight network structure constructed using the deep learning approach. A methodology for AI-based facial emotion recognition that uses many datasets, feature extraction methods, and algorithms. The datasets are broken down into three groups: children, adults, and senior citizens in order to better understand the vast application of facial expression identification. Modern CNN models are utilized for preprocessing, feature extraction, and classification while using a variety of techniques. Additionally, it evaluates the benchmark accuracy of various CNN models as well as some architectural traits.

7. Shallow CNN architecture

An eleven-layer shallow convolutional neural network was developed to recognize newborn facial expressions. The suggested shallow network architecture is shown in Figure 3. The network is composed of two groups of convolutional layers, two maxpool layers, a fully connected layer, a SoftMax layer, and a layer for classification output. One convolution layer, batch normalization, and the relu activation function are all present in the group. The input layer is fixed to be 224 × 224 × 3 with zero center normalization. Following that, 64 filters with a 7 × 7 grid size are convolved with the input layer. Batch normalization of the convolution filter output is done for independent learning, and then a relu activation layer is applied to provide linearity.

Figure 3.
The proposed shallow CNN model [16].

Following that, the maxpool operation is used to extract the low-level characteristics. This prevents hazy problems and draws attention to the key details of an infant image. The features are once more convolved with 64 3 × 3 sized filters, 64 batch normalized filters, an activation function, and a maxpool layer. The representation of features is enhanced and the learning parameters are decreased by this structure. The completely linked layer, a SoftMax, and the classification output layer are attached to the structure’s end.

7.1 Training

In order to maintain a constant image size, the data is initially supplemented throughout the training phase. The learning rate, which is inversely proportional to the gradient descent, is one of the most significant adjustments hyperparameters. Dynamic learning rate adjustment has been employed to get the best feature learning of the infant’s facial expressions. For that, the initial learning rate is set at 0.01, and the learning rate is multiplied by 0.1 after every 100 iterations. In order to prevent overfitting, a simple and generalized architecture with stochastic gradient descent and momentum has been developed. It is ideally suited for the new infant images. The best training model is chosen after the model has undergone several modifications.

7.2 Testing

The model is trained and tested using MATLAB 2021b with a GPU processor. The dataset primarily includes the three newborn facial expressions of scream, laugh, and neutral, as seen in Figure 4. Each infant has a different set of facial expressions. However, they do possess a few defining characteristics that make identification difficult. There are roughly 1800 photos in each class. Resizing is one of the data augmentation techniques used to equalise the size distribution because the photos collected in the dataset and on the website are of different sizes. This method increases the diversity and adaptability of the model. Along with validation and testing accuracy, the review process currently includes precision and recall and it is depicted in Eqs. (1), (2) and (3)

Figure 4.
Sample dataset images (TIF) [17].

Accuracy=True Positive+True NegativeTrue Positive+False Positive+True Negative+False NegativeE1

Precision=True PositiveTrue Positivee+False PositiveE2

Recall=True PositiveTrue Positivee+False NegativeE3

Resnet 18, Resnet 50, and VGG 16 are the traditional and benchmark facial expression recognition networks. These are more sophisticated network models that employ complicated elements to deliver the best outcomes for facial expression recognition. As a result, this study compares different topologies to the suggested shallow network. Table 1 displays the suggested shallow network’s architectures with Resnet 50 and VGG 16.

Layer name	Resnet-18	Resnet-50	VGG-16	Shallow CNN
Conv1	7 × 7, 64, stride 2	7 × 7,64, stride 2	3 × 3 max pool, stride 2 3×3,643×3,64×2	7 × 7, 64, stride 2 Batch normalization, Relu, 3 × 3 max pool, stride 2
Conv2	3 × 3 maxpool, stride 2 3×3,643×3,64×2	3 × 3 max pool, stride 2 1×1,643×3,641×1,256×3	3 × 3 max pool, stride 2 3×3,1283×3,128×2	3 × 3, 64, stride 2 Batch normalization, Relu, 3 × 3 max pool, stride 2
Conv3	3×3,1283×3,128×2	1×1,1283×3,1281×1,512×3	3 × 3 max pool, stride 2 33,2563×3,2563×3,256×3	—
Conv4	3×3,2563×3,256×2	1×1,2563×3,2561×1,1024×3	3 × 3 max pool, stride 2 3x3,5123x3,5123x3,512×3	—
Conv5	3×3,5123×3,512×2	1×1,5123×3,5121×1,2048×3	3 × 3 max pool, stride 2 3×3,5123×3,5123×3,512×3	—
	Average pool, 1000-d fc	Average pool, 1000-d fc	fc with 4096 nodes	—
	6-d fc,softmax	6-d fc,softmax	fc with 4096 nodes, softmax with 1000 nods	3-dfc, softmax, classification

Table 1.

Architectures of ResNet-18, ResNet-50, VGG16, and the proposed shallow CNN.

The suggested shallow network’s input picture has a size of 224 × 224 pixels. 64 stride 2 and 7 × 7 convolution kernels are used in the Conv1 layer’s filters. The output size is consequently decreased to 112 112 pixels. The next step is to establish a batch normalization with the same scale, offset, relu activation layer, and max pool layer. The same set of Conv2 layers is created with only a 3 × 3 change in the convolution layer kernel size. This shrinks the output to a 56 × 56 size. To determine the type of infant facial expressions, a fully connected layer with three categories and a classification layer is implemented.

The overall design process contributes in stabilizing learning and significantly reduces the quantity of epochs required to train the networks. It avoids the exponential growth of the compute needed to learn the network. Other networks require more time to execute and train since they are spatially more complicated. The suggested network has a less complex structural level than the other networks in Table 1. By using less hardware resources, it also saves time and makes the training process less challenging. The proposed approach is therefore more computationally effective and yields better performance results. The Pareto principle is utilized to divide the dataset’s photos, with 15% of the images being used for validation and 70% being used for training. The remaining 15% is used only for testing. Thus, 1260 images are selected for training, 270 for validation, and 270 for testing out of a total of 1800 images. The suggested method’s training curve, depicted in Figure 5, comprises data on the loss curve as well as training and validation accuracy.

Figure 5.
Training and validation curve.

The experiments have been conducted using a local dataset that had been trained and validated using current techniques (Table 2). In general, the learning capacity grows along with the number of layers. However, overfitting difficulties could occur if the learning capacity is sufficiently large. It will perform incredibly well during training but poorly during testing. Performance of the proposed network is contrasted with that of the current network, as shown in Table 3. It demonstrates that the suggested strategy produces improved accuracy. The accuracy, precision and recall are calculated the values (True Positive, True Negative, False Positive, False Negative) taken from confusion matrix. For example, when the accuracy of 1260 samples is calculated, the proposed model gets True Positive of 1136 and True Negative of 94, and it obtained an average training accuracy of 97.16%.

	Laugh		Cry		Neutral		Overall
	Precision	Recall	Precision	Recall	Precision	Recall	Average accuracy
VFESO-DLSE [18]	93.18	78.5	85.07	96.27	86.71	88.86	93.6
Shallow CNN	97.8	98	96	97.2	98	97.8	97.4
Shallow CNN (10 fold validation)	95.6	96.4	95.3	96.8	97	96.2	96.21

Table 2.

Performance comparison (pretrained CNN vs. proposed shallow CNN).

	Laugh		Cry		Neutral		Overall
	Precision	Recall	Precision	Recall	Precision	Recall	Testing accuracy	Average accuracy
Resnet-18	91.21	94.7	93	92.7	89	93.2	96	94
Resnet-50	94.3	95.6	91.74	90.6	94.3	95.6	92.6	90.26
VGG-16	92.1	94	89.91	90.3	94.26	94	95	93.6
Shallow CNN	97.8	98	96	97.2	98	97.8	97.8	97.4

Table 3.

Performance comparison of proposed method vs. VFESO-DLSE [18].

On the local dataset and the BabyExp dataset, respectively, are used to operate the suggested network’s outcomes. Table 3 compares these two networks and demonstrates that the suggested approach performs better. The suggested approach is shallow in comparison to VEFSO-DLSE [18], CNN gets a better average accuracy result of 97.4% and a cross-validation accuracy of 96.21%. The computational advantage of the proposed method achieves through the compact in size with the floating point operations per second of 1.54M. The experimental finding demonstrates that the proposed strategy extracts more useful features than alternative approaches. Similar relationships between actual and anticipated facial expressions are demonstrated by other models. The matrix shows that certain neutral photos are incorrectly grouped with other images. The similarity between the images, which can be seen by examining them, increases the complexity. The suggested approach, however, produces greater accuracy while avoiding the overfitting issue.

8. Conclusion

In the field of computer vision, the ability to recognize baby emotion is significant as it provides prognostic data for diagnosing ADHD and ASD. This chapter outlines the methods for identifying and preventing these conditions in the earlier stage. It also provides empirical support for infant development by learning subtle information from their faces. Infant facial expression recognition study raises some significant issues, such as the transfer learning model’s decreased learning capacity and the recognition system’s low stability. A thorough framework for early medical condition diagnosis and parental oversight using machine learning and deep learning techniques is presented in this chapter. The suggested network resolves these problems by recommending a two-stage model with a shallow neural network to save space. The minimal quantity of data generated from the videos and obtained from the websites is used in the suggested shallow network model. With 97.8% accuracy throughout testing, this model performs well. It also needs less time to train because it has the ideal learning capacity. Therefore, the suggested chapter offers superior intelligent interpersonal interactions and is well suited to the field of parental surveillance and care.

References

1. Webb R, Ayers S, Endress A. The city infant faces database: a validated set of infant facial expressions. Behavior Research Methods. 2018;50(1):151-159. DOI: 10.3758/s13428-017-0859-9. PMID: 28205132; PMCID: PMC5809537
2. Huguet Cabot P-L, Navigli R. REBEL: Relation Extraction By End-to-end Language generation, 2021, Findings of the Association for Computational Linguistics: EMNLP. 2021
3. Gioia NJ, Alexandra Caldas OA,Rinaldo Focaccia S, Renne Gerber LV, Elisa Harumi L, D’Antino MEF. The child emotion facial expression set: A database for emotion recognition in children. Frontiers in Psychology. 2021;12:1664-1078. Available from: https://www.frontiersin.org/articles/10.3389/fpsyg. 2021.666245
4. Nojavanasghari B, Baltrušaitis T, Hughes C, Morency L-P. Emoreact: A Multimodal Approach and Dataset for Recognizing Emotional Responses in Children. In: Proceedings of the ACM International Conference on Multimodal Interaction (ICMI). 2016
5. Vanessa L, Cat T. The Child Affective Facial Expression (CAFE) set: validity and reliability from untrained adults. Frontiers in Psychology. 2015;5:1664-1078. Available from: https://www.frontiersin.org/articles/10.3389/fpsyg.2014.01532
6. Egger HL, Pine DS, Nelson E, Leibenluft E, Ernst M, Towbin KE, Angold A. The NIMH Child Emotional Faces Picture Set (NIMH-ChEFS): a new set of children’s facial emotion stimuli. International Journal of Methods in Psychiatric Research. 2011;20(3):145-156. DOI: 10.1002/mpr.343. PMID: 22547297; PMCID: PMC3342041
7. Chakraborty S, Thounaojam DM, Sinha N. A shot boundary detection technique based on visual colour information. Multimedia Tools and Applications. 2021;80:4007-4022. DOI: 10.1007/s11042-020-09857-8
8. Sheena CV, Narayanan NK. Key-frame extraction by analysis of histograms of video frames using statistical methods. Procedia Computer Science. 2015;70:36-40. ISSN 18770509. DOI: 10.1016/j.procs.2015.10.021
9. Kingston Z, Moll M, Kavraki LE. Sampling-based methods for motion planning with constraints. Annual Review of Control, Robotics, and Autonomous Systems. 2018;1:159-185
10. Ming Z, Bugeau A, Rouas J, Shochi T. Facial action units intensity estimation by the fusion of features with multi-kernel Support Vector Machine. In: 2015 11th IEEE International Conference and Workshops on Automatic Face and Gesture Recognition (FG). 2015. pp. 1-6
11. Lee Y, Kim KK, Kim JH. Prevention of Safety Accidents through Artificial Intelligence Monitoring of Infants in the Home Environment. In: International Conference on Information and Communication Technology Convergence ICTC). 2019. pp. 474-477
12. Li B, Lima D. Facial expression recognition via ResNet-50. International Journal of Cognitive Computing in Engineering. 2021;2:57-64. ISSN 2666-3074
13. Altamura M, Padalino FA, Stella E. Facial emotion recognition in bipolar disorder and healthy aging. Journal of Nervous and Mental Disease. 2016;204(3):188-193
14. Majumder A, Behera L, Subramanian V. Automatic facial expression recognition system using deep network-based data fusion. IEEE Transactions on Cybernetics. 2016:1-12. DOI: 10.1109/TCYB.2016.2625419
15. Lin Q, He R, Jiang P. Feature Guided CNN for Baby’s Facial Expression Recognition. 2020. Article ID 8855885. DOI: 10.1155/2020/8855885
16. Uma Maheswari P, Mohamed Mansoor Roomi S, Senthilarasi M, Priya K, Shankar Mahadevan G. Shallow CNN Model for Recognition of Infants Facial Expression. In: 4^th International Conference on Machine Intelligence and Signal Processing, MISP. 2022
17. Maack JK, Bohne A, Nordahl D, Livsdatter L, Lindahl AAW, Overvoll M, et al. The Tromso Infant Faces Database (TIF): development, validation and application to assess parenting experience on clarity and intensity ratings. Frontiers in Psychology. 2017. Sec. Quantitative Psychology and Measurement. DOI: 10.3389/fpsyg.2017.00409
18. Lin Q, He R, Jiang P. Feature Guided CNN for Baby’s Facial Expression Recognition, Complexity, Hindawi, Volume 2020, Article ID 8855885, 10. pp. 2020

[1] 1. Webb R, Ayers S, Endress A. The city infant faces database: a validated set of infant facial expressions. Behavior Research Methods. 2018;50(1):151-159. DOI: 10.3758/s13428-017-0859-9. PMID: 28205132; PMCID: PMC5809537

[2] 2. Huguet Cabot P-L, Navigli R. REBEL: Relation Extraction By End-to-end Language generation, 2021, Findings of the Association for Computational Linguistics: EMNLP. 2021

[3] 3. Gioia NJ, Alexandra Caldas OA,Rinaldo Focaccia S, Renne Gerber LV, Elisa Harumi L, D’Antino MEF. The child emotion facial expression set: A database for emotion recognition in children. Frontiers in Psychology. 2021;12:1664-1078. Available from: https://www.frontiersin.org/articles/10.3389/fpsyg. 2021.666245

[4] 4. Nojavanasghari B, Baltrušaitis T, Hughes C, Morency L-P. Emoreact: A Multimodal Approach and Dataset for Recognizing Emotional Responses in Children. In: Proceedings of the ACM International Conference on Multimodal Interaction (ICMI). 2016

[5] 5. Vanessa L, Cat T. The Child Affective Facial Expression (CAFE) set: validity and reliability from untrained adults. Frontiers in Psychology. 2015;5:1664-1078. Available from: https://www.frontiersin.org/articles/10.3389/fpsyg.2014.01532

[6] 6. Egger HL, Pine DS, Nelson E, Leibenluft E, Ernst M, Towbin KE, Angold A. The NIMH Child Emotional Faces Picture Set (NIMH-ChEFS): a new set of children’s facial emotion stimuli. International Journal of Methods in Psychiatric Research. 2011;20(3):145-156. DOI: 10.1002/mpr.343. PMID: 22547297; PMCID: PMC3342041

[7] 7. Chakraborty S, Thounaojam DM, Sinha N. A shot boundary detection technique based on visual colour information. Multimedia Tools and Applications. 2021;80:4007-4022. DOI: 10.1007/s11042-020-09857-8

[8] 8. Sheena CV, Narayanan NK. Key-frame extraction by analysis of histograms of video frames using statistical methods. Procedia Computer Science. 2015;70:36-40. ISSN 18770509. DOI: 10.1016/j.procs.2015.10.021

[9] 9. Kingston Z, Moll M, Kavraki LE. Sampling-based methods for motion planning with constraints. Annual Review of Control, Robotics, and Autonomous Systems. 2018;1:159-185

[10] 10. Ming Z, Bugeau A, Rouas J, Shochi T. Facial action units intensity estimation by the fusion of features with multi-kernel Support Vector Machine. In: 2015 11th IEEE International Conference and Workshops on Automatic Face and Gesture Recognition (FG). 2015. pp. 1-6

[11] 11. Lee Y, Kim KK, Kim JH. Prevention of Safety Accidents through Artificial Intelligence Monitoring of Infants in the Home Environment. In: International Conference on Information and Communication Technology Convergence ICTC). 2019. pp. 474-477

[12] 12. Li B, Lima D. Facial expression recognition via ResNet-50. International Journal of Cognitive Computing in Engineering. 2021;2:57-64. ISSN 2666-3074

[13] 13. Altamura M, Padalino FA, Stella E. Facial emotion recognition in bipolar disorder and healthy aging. Journal of Nervous and Mental Disease. 2016;204(3):188-193

[14] 14. Majumder A, Behera L, Subramanian V. Automatic facial expression recognition system using deep network-based data fusion. IEEE Transactions on Cybernetics. 2016:1-12. DOI: 10.1109/TCYB.2016.2625419

[15] 15. Lin Q, He R, Jiang P. Feature Guided CNN for Baby’s Facial Expression Recognition. 2020. Article ID 8855885. DOI: 10.1155/2020/8855885

[16] 16. Uma Maheswari P, Mohamed Mansoor Roomi S, Senthilarasi M, Priya K, Shankar Mahadevan G. Shallow CNN Model for Recognition of Infants Facial Expression. In: 4^th International Conference on Machine Intelligence and Signal Processing, MISP. 2022

[17] 17. Maack JK, Bohne A, Nordahl D, Livsdatter L, Lindahl AAW, Overvoll M, et al. The Tromso Infant Faces Database (TIF): development, validation and application to assess parenting experience on clarity and intensity ratings. Frontiers in Psychology. 2017. Sec. Quantitative Psychology and Measurement. DOI: 10.3389/fpsyg.2017.00409

[18] 18. Lin Q, He R, Jiang P. Feature Guided CNN for Baby’s Facial Expression Recognition, Complexity, Hindawi, Volume 2020, Article ID 8855885, 10. pp. 2020