Facial expression recognition process.
Emotion recognition enables real-time analysis, tagging, and inference of cognitive affective states from human facial expression, speech and tone, body posture and physiological signal, as well as social text on social network platform. Recognition of emotion pattern based on explicit and implicit features extracted through wearable and other devices could be decoded through computational modeling. Meanwhile, emotion recognition and computation are critical to detection and diagnosis of potential patients of mood disorder. The chapter aims to summarize the main findings in the area of affective recognition and its applications in major depressive disorder (MDD), which have made rapid progress in the last decade.
- emotion recognition
- computational modeling
- machine learning
- depressive disorder
Making computers capable of emotional computing was first proposed by Minsky (one of the founders of artificial intelligence) of the MIT. In his book The Society of Mind he proposed that “The question is not whether intelligent machines can have any emotions, but whether machines can be intelligent without emotions” . Picard  proposed the concept of affective computing (AC) in 1995. Her monograph “Affective Computing” published in 1997 defined affective computing as “calculation related to, derived from or capable of emotions.” She divided the research content of affective computing into nine aspects: mechanism of emotion, acquisition of emotion information, recognition of emotion pattern, modeling and understanding of emotion, synthesis and expression of emotion, application of emotion computing, interface of emotion computer, transmission and communication of emotion, and wearable computer. Among these aspects, the practical research of emotion recognition is largely based on theories of mechanism of emotion and acquisition of emotion information.
The mechanism of emotion is based on phenomenal and mechanistic views of emotion. The phenomenal views typical involved two approaches: discrete and dimensional views of emotion. The former proposed that emotion can be labeled as a limited set of basic emotions which could be combined into complex emotions. This method is problematic because the labels for emotions may be too restrictive to reflect complex emotions. Additionally, these labels may be culture dependent which could not reflect common substrates of different affective labels. The latter proposed that emotions can be distributed in a multidimensional space which continuously evolves. Two common dimensions are valence (pleasantness) and arousal (activation level). The emotion recognition algorithms using emotion representation based on emotional labels are intuitive which are ambiguous for computer processing. Additionally, recognition of emotion pattern involves the classification of emotional data according to a large group of labels. For these reasons, researchers developed a number of dimensional model of emotions, such as Russell’s circumplex model, Whissell’s evaluation-activation space model, and Plutchik’s wheel of emotions .
According to the mechanistic views of emotion, emotion pattern recognition not only relies on semantic labels but also physiological signals which originate in the peripheral nervous system (PNS) and central nervous system (CNS) dynamics . (1) The PNS emotion patterns. The PNS included the autonomic and the somatic nervous systems (ANS and SNS). According to Schachter and Singer’s peripheral theories of emotion (or cognition-arousal theory), people assess their emotional state by physiological arousal. Emotion states are inherent in these physiological dynamics and feasibly recognized by using PNS physiological data, according to the work from the lab led by Picard. Ekman and colleges provided the first evidence of PNS differences (including hand temperature, heart rate, skin conductance, and forearm tension) among four negative emotions . However, their algorithms are based on intentionally expressed emotion and are user dependent, which may restrict generalization to other users . (2) The CNS emotion patterns. The large majority of computational models of emotion stem from appraisal theory of emotion, which emphasized the CNS process of emotion. Frijda criticized the arousal theory of emotion and proposed that awareness of autonomic responding is not prerequisite for emotional experience or behavior. The differentiation of the emotions is explained as the result of the sequential appraisal for affective stimulus. Scherer suggests that there may be as many emotions as there are different appraisal outcomes. Thus there exists the minimal set of appraisal criteria necessary to the differentiation of primary emotional states. However, it should be noted that physiological changes is not only determined by appraisal meaning but also by factors outside of the appraisal or emotion realm. Therefore, there is not adequate evidence for consistent and specific PNS response during emotional episodes .
Practically, the acquisition of emotional information is required for emotion recognition. Emotional information characteristics included a variety of physiological or behavioral reactions concurrent with emotional state changes, including internal and external emotional features. (1) Internal emotional information. It refers to physiological reactions that cannot be detected from the outside of human body, such as the electrical or mechanical/chemical output of human brain activity (EEG), heart muscle activity(ECG, heart rate, pulse), skeletal muscle activity (EMG), breathing activity (respiration), and blood vessel activity (blood pressure, hemangiectasis). (2) External emotional information. It refers to the reactions that can be directly observed from the appearance, such as facial expression, speech, and posture. The extraction of common features for highly individualized emotion information constitutes the fundamental basis of emotion recognition. A great amount of features could be extracted from internal and external emotional signals, by calculating their mean, standard deviation, transformation, wave band power and peak detection, and others.
2. Methods for emotion recognition
The main methods for emotion recognition involve the following emotion indexes: (1) emotional behavior, namely, facial expression recognition, speech emotion recognition, and posture recognition (see Sections 2.1–2.3); (2) physiological pattern, which means objective emotional index after measuring PNS and CNS physiological signals (see Section 2.5); and (3) psychological measures and multimodal emotion signals, such as textual information and multimodal emotion information (see Sections 2.3 and 2.6).
2.1 Facial expression recognition
Faces may be one of the most important methods for visual communication of emotion. Though started from the 1970s, facial expression recognition is the most studied field in natural emotions machine recognition, especially in the USA and Japan, wherein studies on facial expression recognition have grown to be a hotspot of AI emotion recognition. In 1971, American psychologists Ekman and Friesen categorized facial expression into six types: anger, disgust, happiness, fear, surprise, and sadness. They also established the Facial Action Coding System (FACS) in 1978 , which is the earliest research of facial expression recognition. Facial expressions were deemed as observable indicators of internal emotional states, which make emotion differentiation possible.
Currently, the most-used facial expression databases included Ekman’s FACS and its updated version, automated facial image analysis (AFIA) developed by Carnegie Mellon University, Japanese female expression database JAFFE and its expansion set in Japan ATR Media Information Science Laboratory, Cohn-Kanade expression database, CK+ expression database, and Rafd facial expression database established by CMU Robotics Research Institute, USA. Common facial expression picture libraries in China include the USTC-NVIE image library , the CFAPS facial emotion stimulating materials , and the Chinese facial expression intensity grading picture library .
The facial expression recognition included the following steps: (1) facial image acquisition, (2) image preprocessing, (3) feature extraction, and (4) emotion classification (Table 1).
|Processes||Sub-processes and related work|
|Facial image acquisition||Facial images are obtained from images and videos, including static expressions and dynamic expressions|
|Image preprocessing||Face detection and positioning, face adjustment, editing, scale normalization, histogram equalization, dimming, light compensation, homomorphic filtering, graying, Gaussian smoothing|
|Feature extraction||(a) Static image. Gabor wavelet transformation, local binary patterns (LBP), scale-invariant feature transformation (SIFT), discrete cosine transformation (DCT), regional covariance matrix.|
(b) Dynamic image. Optical flow method, difference image method, feature point tracking, model-based method, elastic graph matching.
|Emotion classification||Canonical correlation analysis, sparse representation classification, expert rule-based methoda|
Apart from Ekman’s discrete emotion model, facial expression recognition was also conducted under the other emotional models (such as dimensional model). Ballano et al. proposed a method for continuous facial affect recognition from videos based on evaluation-activation 2D model proposed by Whissell . The evaluation dimension defines the valence of emotion, while the activation dimension defines the action tendencies (e.g., active versus passive) under the emotional state. Their model extended the emotion information to continuous emotional trajectory.
Micro-expressions are quick, unconscious, and spontaneous facial movements that occur when people experience strong emotions. The duration of micro-expression is about 1/25 to 1/2 s. The fleeting micro-expression has small movement and does not appear in the upper face and the lower face at the same time, so it is quite difficult to observe and recognize correctly. Therefore, the collection and selection of micro-expression data sets are very important. Micro-expression recognition requires (1) the image acquisition and preprocessing of face image, (2) the detection of micro-expressions from the face and the extraction of its features, and (3) classifying and recognizing the categories of the micro-emoticon. Different research teams have developed different automatic micro-expression recognition systems and established databases.
Polikovsky et al.  explored the 3D gradient histogram method for feature extraction of facial micro-expressions in video sequences based on the Polikovsky expression library. They proposed a new approach to capture micro-expression using 200fps high-speed camera. Shreve et al.  established the USF-HD database and applied the optical flow method for automatic micro-expression recognition research. They developed a method of automatically spotting continuously changing facial expressions in long videos. The University of Oulu in Finland developed the spontaneous micro-expression corpus (SMIC) and SMIC2. Yan et al.  improved the micro-expression elicitation paradigm and developed the Chinese micro-expression database CASME. Later on they further expanded the sample number, improved the frame rate and image quality of CASME, and created CASMEII. They differentiated full suppression of facial movements from self-perceived suppression of facial movements. The micro-expressions were elicited in a well-controlled laboratory context and had high temporal resolution (200 frames/s). The best performance is 63.41% for 5-class classification.
2.2 Speech emotion recognition
As the easiest, most basic, and direct way of information communication, speech contains rich emotional information. Speech could not only convey semantic information but also reveal speaker’s emotional state, for instance, a person may have a voice with high volume, heavy tones, and accelerated speed when getting angry, but sullen intonation and slow speed when feeling sad. Therefore, in order to make the computer understand people’s emotions better and interact more naturally with people, it is necessary to study speech emotions. Speech emotion recognition is widely applied in man-machine interaction, such as automatic customer service system, which can transfer emotional users to manual service ; it monitors the driver’s emotional fluctuations based upon his speech speed and volume to remind him of staying calm, thus preventing him from a car accident ; it helps the disabled to speak ; and it detects emotional state of patients with mental disorders based upon context analysis .
Most of the studies use prosodic features as characteristic parameters of speech emotion recognition. For example, Gharavian et al.  extracted parameters such as fundamental frequency, resonance peak, and Mel coefficient and then analyzed the correlation among them. The obtained 25-dimensional vectors were classified by FAMNN classification algorithm to gain a more credible emotion recognition result. Devi et al.  summarized speech signal preprocessing techniques, common short-term energy, MFCC features, and their applications in speech emotion recognition. Zhang et al.  use a multilayer deep belief network (DBN) to automatically extract the emotional features in speech signals, piece together consecutive multi-frame speeches to form an abstract high-dimensional feature, use features trained by the deep belief network as the input end of the extreme learning machine (ELM) classifier, and ultimately establish a speech emotion recognition system. Zhu et al.  propose a track-based space-time spectral signature speech emotion recognition method and obtain relatively accurate results. Liu and Qin  study the application of speech emotion recognition in manned space flight, establish a stress emotion corpus, and build speech emotion recognition model and software through feature extraction and Gaussian mixture model (GMM) to verify the accuracy of speech emotion recognition.
The emotional speech features extracted in the abovementioned study are mostly targeted at personalized speech emotion recognition, while the feature extraction for non-personalized speech emotion recognition is still a challenge. Recent efforts have been made toward development of large corpus . The current speech emotion recognition study is limited by lack of unified, public and standard mandarin emotion corpus, as well as an authoritative and unified standard for building emotion corpus. Many researches are conducted based on self-recorded databases which vary in terms of age, gender, number of participants, text information, and the scale of the final corpus, making it difficult to compare between different research results. Furthermore, most of the studies are conducted based on discrete emotions, taking into consideration limited emotional dimension corpus.
2.3 Posture emotion recognition
Posture refers to the expressional actions of other parts of human body than face. It can coordinate or supplement speech content and effectively convey emotional information. Postures can be divided into body expression and gestures. Body expression is one of the ways to express emotions. People would have different postures under different emotional state, such as belly laugh when happy, arched shoulders when scared, and being fidgeted when nervous. Postures such as raising hands and akimbo can express individual emotions. People may have different postures at different emotional state and level, hence it is possible to analyze and predict emotional state by observing different expressions and intensity of the expression. Researchers have pointed out at early times that posture and movements can only reflect intensity of emotion, but not the essence and type of the emotions. Later, some put forward that posture is conducive to the expression of emotional intensity although it cannot reflect accurately emotional state. Some scholars have studied the ability of subjects to understand six basic postures, the subjects were expressionless throughout the test, and the result showed that posture can be used to identify certain emotional state, such as sorrow and fear.
Generally, there are two posture recognition methods: (1) recognizing affective content of daily behavior through analysis and (2) using the temporal and spatial characteristics of gestures (such as the rhythm, amplitude, and strength of the motion) to analyze the affective content. For example, Castellano et al.  proposed a method for recognizing emotions based on human motion indicators (such as amplitude, velocity, and mobility) and establishing emotional models with image sequences and motion test indicators; Bernhardt and Robinson  used segmentation techniques to quantify high-dimensional motion into a set of simple motion data, extract motion features, and pair them with corresponding emotions; Liu et al.  classified the body movement and combined motion and velocity parameters for weighting function calculation to identify the emotion expressed by certain movement. Shao and Wang  extracted two 3D texture features by processing the image sequences of body movement and used this as a basis for emotion classification. The recognition rate can reach 77.0% in experiment which tests seven common natural emotions in the FABO database.
The posture recognition process mainly includes four steps: motion data acquisition, preprocessing, motion feature extraction, and emotion classification. Firstly, motion data collection. Generally, there are two types of motion data collection methods: (1) contact type which is a wearable device embedded with various sensors, such as electronic gloves and data shoe covers, and (2) noncontact type, which generally obtains image information through the camera. The contact recognition technology has high equipment cost, uncomfortable user experience, and goes against the objective of natural man–machine interaction. Secondly, data preprocessing. This generally includes human body detection, image denoising, image segmentation, image binarization processing, time window, filtering processing, and others. Among them, human body detection mainly includes basic image segmentation, background difference method, interframe difference method, optical flow method, and energy minimization method. Thirdly, motion feature extraction. Generally speaking, motion features can be divided into four categories: (1) static features which include size, color, outline, shape, and depth; (2) dynamic features which include speed, optical flow, direction, and trajectory; (3) spatiotemporal features which include spatiotemporal context, spatiotemporal shape, and spatiotemporal interest points; and (4) descriptive features which include scenes, attributes, objects, and poses. There are three types of most-used methods for motion feature extraction, namely, time domain analysis, frequency domain analysis, and time-frequency domain analysis. Fourthly, emotional classification. Other classifiers than the commonly used ones are dynamic time warping, dynamic programming, potential Dirichlet distribution, probabilistic latent semantic analysis, context-free grammar, finite state machines, conditional random fields, and others.
2.4 Textual emotion recognition
Emotions are not exactly linguistic constructs. However the most convenient to emotion is through language. With the advent of social media, social media platforms are becoming a rich source of multimodal affective information, including text, videos, images, and audios. One of them is textual analysis. Affect recognition from text analysis is often used for a public opinion mining. The process of text recognition contains four steps: material collection, text preprocessing, feature extraction, and emotion classification. (1) The first step is material collection. Web crawlers are commonly used to collect materials from blogs, e-commerce sites, and news sites. (2) The second step is text preprocessing, which includes word segmentation, part-of-speech tagging, tag filtering, affix trimming, simplification and replacement, and so on. (3) The third step is feature extraction. Main text features involve words, phrases, n-gram, concepts, and others. Words containing general features can be automatically extracted, while others need to be identified by human efforts before creating emotional glossary. Other methods used are frequent pattern mining techniques and associated rule mining techniques. (4) The fourth step is emotion classification. In addition to some commonly used classifiers, it also includes central vector classification, maximum entropy, emotion-based words labeling, and word frequency-weighted statistics.
Domestic researches on text recognition mainly center around emotion recognition of social platforms such as microblog. For example, Hao et al.  proposed a microblog emotion recognition method based on wording features of microblogs and verified its validity. Hao et al.  proposed a classification method based on supervised learning for the classification and prediction of emotional polarity in microblogs, and the accuracy of the experimental analysis reached 79.9%. Huang et al.  proposed a multifeature fusion-based microblog theme and emotion mining model TMMMF and verified its validity; Zhang et al.  proposed a joint model of microblog emotion recognition and emotion incentive extraction based on neural network. The experiment shows that the F value of the model in the emotion incentive extraction task is 82.70% and the F value in the emotion recognition task is 74.74%.
2.5 Physiological model recognition
William James  proposed that emotions derive from peripheral physiological responses. Kreibig  examined the patterns of autonomic nervous system activity under different emotions, showing the specificities in different physiological responses. For example, fear would cause accelerated heart rate and respiratory rhythm and strengthened galvanic skin response. The theory confirms the role of autonomous physiological activities in emotional expression but ignores the role of the brain center in emotions. In 1929, Cannon questioned James’s theory and came up with the Cannon-Bard theory (also known as the thalamus theory) with Bard. According to this theory, emotions and their corresponding physiological changes occur simultaneously, both of which are controlled by the thalamus, and the central brain determines the nature of emotions, which affirms the central nervous system’s role in regulating and controlling emotions. In conclusion, the occurrence of emotions is accompanied by certain degree of physiological activation of the central and peripheral nervous system. This provides a theoretical basis for studying emotion recognition in different physiological patterns.
Early studies mainly focused on the PNS physiological signals such as skin temperature, blood pressure, electrocardiogram, electromyography, respiratory action, galvanic skin response, and blood volume fluctuation for emotion recognition. Picard et al.  collected four physiological signals of galvanic skin response, blood volume fluctuation, electromyographic signal, and respiratory action under different emotional states and reached 81% in terms of recognition accuracy for eight emotions. Kim and Andre  developed a short-term monitoring emotion recognition system based on physiological signals of multiple users. They used support vector machine (SVM) to classify and identify four emotions including sadness, depression, surprise, and anger, achieving a classification rate at 95%. Yan et al.  collected a variety of physiological signals through multipurpose polygraph MP150: used Fisher, k-NN, and other intelligent algorithms for feature extraction and analysis; and identified six basic emotional states with recognition rate being at 60–90%. Li et al.  proposed emotion recognition based on recursive quantitative analysis of physiological signals. They extracted 10 sets of nonlinear features from the recursive graphs of skin conductance signals, myoelectric signals, and respiratory signals and achieved higher emotion recognition rate. Jin et al.  used the updated LSTSVM for emotion recognition based on the physiological signals of electroencephalography, skin conductance, myoelectricity, and respiration and obtained higher recognition accuracy.
In recent years, with the development of neurophysiology and the rise of brain imaging technology, CNS brain signals have attracted the attention of researchers and been used in emotion recognition because of their high temporal resolution and strong functional specificity. In the early stage of the study, the most common measurement index was electroencephalogram (EEG). Some scholars pointed out that the frontal brain asymmetry is closely related to emotional valence. Studies have shown that high-frequency parts of EEG can reflect people’s emotional and cognitive states and the γ and β bands can better tell the change of emotional state than the low-frequency band . Jie et al.  realized the recognition of high and low arousal and high and low pleasure through nonlinear feature sample entropy. Duan et al.  used differential entropy in machine classification learning for emotion recognition, and the classification accuracy rate was up to 84.22%. It is shown that as a nonlinear EEG feature, differential entropy shows higher classification efficiency. Later, some scholars combined spontaneous physiological signals with EEG and used comprehensive information to improve the recognition rate [43, 44].
However, the EEG acquisition process is relatively complicated and often has the interference with external noise and electromyography. The cerebral blood oxygen parameter measurement method based on functional near-infrared spectroscopy (NIRS) is gaining greater popularity in emotion recognition because of its high portability, insensitivity to noise and motion, and high possibility for long-term continuous measurement. Tai and Chau  extracted the time domain features of prefrontal signals during affective states to identify positive and negative emotions elicited by emotional pictures. The recognition rate of 13 subjects was within the range of 75.0–96.67%.
The most critical steps in emotion recognition based on physiological signals are signal preprocessing, feature extraction and optimization, and classification identification.
Emotion signal preprocessing. This step mainly retains valid data segments during emotion induction process at its highest level and then removes noise and artifacts from the signal. The artifact removal methods mainly include filtering, normalization, independent component analysis, and so on. (a) Filters with different frequency band parameters, such as adaptive filters and Butterworth filters, are commonly used for denoising physiological signals, such as smoothing filtering of the galvanic skin to remove high-frequency glitch. (b) Normalization could reduce the adverse effects of baseline individual differences on emotion recognition . (c) Independent component analysis or principal component analysis may remove electro-oculogram and artifacts .
Feature extraction. There are four main types of features: time domain, frequency domain, time-frequency, and nonlinear features.
Time domain. Time domain feature extraction is found first and relatively simple. It obtains information in amplitude, mean value, standard deviation, partiality, and kurtosis by analyzing the time domain waveform of signal. In this processing, less information is lost. Common time domain analysis methods include zero-crossing analysis, histogram analysis, analysis of variance, correlation analysis, peak detection, waveform parameter analysis, and waveform recognition. Emotion recognition studies using cerebral blood oxygen parameters more often involve time domain feature analysis and extraction.
Frequency domain. Frequency domain feature extraction is based on power spectrum analysis and widely used in analysis of ECG, respiration, EEG, and other signals, such as power spectrum ratio, power spectrum energy, and sub-band power spectral density in different frequency bands.
Time-frequency feature. The time-frequency feature extraction considers joint distribution information in terms of time domain and frequency domain. This method describes the changing relationship between signal frequencies and time and contains more comprehensive contents. Commonly used analysis methods are wavelet transform, short-time Fourier transform, Hilbert-Huang transform, and others. Wavelet transform has multiresolution, adjustable sliding time window, has good resolution in both time domain and frequency domain, and has become an effective tool for analyzing nonstationary signals, such as EEG, ECG, EMG, and other signals underlying emotion processes.
Nonlinear feature. EEG signals are created in complex limbic system with noticeable nonlinearity and chaos characteristic, so the extraction of EEG features is more complex and diverse than other physiological signals. In recent years, the analysis of nonlinear features such as entropy, correlation dimension, and fractal dimension has gradually increased in the study of emotional EEG recognition. Konstantinidis et al.  calculated the correlation dimension of emotional EEG for online recognition research; Liu et al.  extracted the nonlinear features such as the fractal dimension of EEG to obtain the ideal recognition effect and built an online application.
2.6 Multimodal emotion recognition
Most recent researches have focused on multimodal emotion recognition using visual and aural information. Human expression of emotion is mostly multimodal, including visual, audio, and textual modalities for effective communication . Furthermore, physiological signals can reveal emotional state objectively, even if the subject conceals his/her expression of emotion due to complex reasons. Hence emotion recognition integrating multiple modalities has gained increasing attention, and research hotspot has shifted from single modality to multimodal emotion recognition in practical applications. D’Mello and Kory  used statistical methods to compare the accuracy of single modality and multimodal on different databases. Multimodal expression recognition was superior to single modality performance in the experiments. The McGurk  phenomenon reveals that in the process of brain perception, different senses are automatically combined unconsciously to process the information, and any lack or inaccuracy of sensory information will lead to deviations in the brain’s understanding of external information. Therefore, multimodal feature fusion recognition technology has become a research hotspot in the past few years.
The widely used multimodal emotion databases are HUMAINE database , the Belfast database , the large-scale audiovisual database SEMAINE , the IEMOCAP emotional database , the audiovisual database eNTERFACE , the Acted Facial Expression in the Wild database (AFEW)  composed of audio and video clips from English movies and TV programs, and the Chinese multimodal emotional data set CHEAVD .
Multichannel information fusion levels can be divided into three categories: data layer, feature layer, and decision layer: (1) Data layer fusion refers to the fusion of collected raw data and then extracting feature vector from the fused data, finally classifying the emotion; (2) feature layer fusion refers to conducting preprocessing and feature extraction of the collected data of each channel first, then obtaining the feature vector by fusing extracted emotion features, and then finally classifying the emotion; and (3) decision layer fusion refers to making separate emotion classification decision for collected data of each channel and then fusing the single modality recognition result to obtain the final classification result. The commonly used information fusion methods are D-S evidence theory, artificial neural network, fuzzy set theory, Bayesian inference, cluster analysis, expert system method, and others.
Current studies on postures mainly concentrate on bimodal emotion recognition of facial expressions and postures. Castellano et al.  conducted a comparative study of the processing of body language and facial expression and found that body language and facial expression have similar visual processing mechanisms. The two are highly similar in terms of event-related potential (ERP) components, psychological functions, and influencing factors and are partially overlapping or adjacent to each other in potential neural bases. Gunes and Piccardi [59, 60] conducted long-term research on bimodal emotion recognition of facial expressions and postures and established the Bi-modal Face and Body Gesture Database for Automatic Analysis of Human Nonverbal Affective Behavior (FABO). Yan et al.  studied video-based bimodal emotion recognition of facial expression and postures and proposed an emotion recognition method based on bilateral sparse partial least squares which has low computational complexity but low recognition rate. In order to tell human emotions through video data, Wang and Shao  extracted emotional features of facial expression and body movements from the FABO database, used a fusion algorithm based on canonical correlation analysis (CCA) to fuse two features, and then used nearest neighbor classifier and support vector machine for emotion recognition. After using updated sparsity preserving CCA (SPCCA), they combined emotion features of facial expressions and body movements, achieving an emotion recognition rate at 90.48%. Wang et al.  focused on the problem of high computational complexity in video emotion recognition and proposed a bimodal emotion recognition method based on temporal-spatial local binary pattern moment (TSLBPM) which has been proven effective. Jiang et al.  proposed a spatiotemporal local ternary orientational pattern (SLTOP) feature description method and cloud-weighted decision fusion classification method for bimodal emotion recognition of facial expressions and postures in video sequences, achieving better recognition result than other classification recognition methods in the comparative experiments.
3. Application of emotion recognition in depressive disorder
Depressive disorder is characteristic of negative mood and anhedonia, which are two core symptoms for diagnosis of the disease. Traditionally, the clinical diagnosis for depression requires the clinicians to assess the severity of depressive symptoms according to verbal statements of patients as well as nonverbal indicators such as voices (pitch, speaking speed, and volumes) and facial expressions. Additionally, structured questionnaires (such as Beck Depression Inventory, Hamilton Depression Rating Scale) have been developed and validated in clinical populations to assess the severity of depressive symptoms. However, the physiological biomarkers of depression are still unclear. Since the 1950s the consensus has emerged that psychiatric diagnoses could be defined according to relevant biological characteristics. However, the empirical diagnostic categories such as depressive disorder failed to be reified and objectified by valid biological measures . The Research Domain Criteria (RDoC) initiative attempted to link physiologic mechanisms (esp. circuit level) to dimensional constructs (e.g., positive/negative valence) rather than diagnostic categories (e.g., MDD), with the potential for alternative diagnostic processes .
3.1 Physiological emotion recognition
Ample evidence showed that specific brain regions including the PFC, amygdala, anterior cingulate, and insula play a major role in the neuropathological basis of affective disorders. Recent meta-analyses found evidence which is against the locationist account of emotion and suggested that brain regions corresponding to basic psychological operations are involved in emotion processing across emotional categories and are not specifically localized to discrete brain networks . With its advantages in superior soft tissue contrast, high spatial resolution, and noninvasive detection, magnetic resonance imaging (MRI) has become a promising tool for detection of neurological alterations in mental disorders such as depression.
Using an experimental therapeutics approach coupled with machine learning, Liu et al. investigated the effect of a pharmacological challenge aiming to enhance dopaminergic signaling on whole-brain’s response to reward-related stimuli in MDD. Artificial intelligence technology combined with MRI technology was used to find the objective biological markers of depression. The brain regions with diagnostic value included anterior cuneate lobe, cingulate gyrus, inferior marginal angular gyrus, insular, thalamus, and hippocampus. The brain regions with preventive value included the precuneus, postcentral gyrus, dorsolateral prefrontal lobe, orbitofrontal lobe, and middle temporal gyrus. The brain regions with predictive therapeutic response included the precuneus, cingulate gyrus, inferior marginal angular gyrus, middle frontal gyrus, middle occipital gyrus, inferior occipital gyrus, and lingual gyrus .
Studies have shown that machine learning and deep learning techniques have been widely used in the diagnosis, prevention, and treatment of depression and other neuropsychiatric diseases in recent years. Abnormal brain regions may be used as predictors of diagnostic and therapeutic responses. Research hotspot mainly focused on cortical areas rather than the midbrain limbic system or dopamine system. Collectively, the literature review suggested that the cingulate gyrus and precuneus may be the most important candidate brain regions among the objective biological markers of depression. Due to complex pathophysiological changes and etiological heterogeneity of depression, combining imaging biomarkers with other indicators (e.g., biochemical, genetic) is necessary to achieve more objective assessment of course and prognosis of depression .
3.2 Textual emotion recognition
With the growing amount of emotional information from social media, including text, photos, and videos, emotion recognition through multimodal information using machine learning technique is becoming a trend. Absolutist thinking represents a form of cognitive distortion typical of anxiety and depression. Al-Mosaiwi and Johnstone conducted a text analysis of 63 Internet forums (over 6400 members) using the Linguistic Inquiry and Word Count software to examine absolutist thinking. The results suggested that absolutist words, rather than negative emotion words, tracked the severity of affective disorder forums. They found elevated levels of absolutist words in depression recovery forums. This suggests that absolutist thinking may be a vulnerability factor for relapse of affective disorder .
The project of Proactive Suicide Prevention Online (PSPO) identified suicide-prone individuals to provide further crisis management. A microblog group was identified as a high-risk population, who commented around a Sina microblogger who committed suicide. They were assessed for suicidal thought and behavior. The frequency of death-oriented words significantly decreased after the intervention, while the frequency of future-oriented words significantly increased. This model may help people with suicidal thoughts and behaviors but with a low motivation to seek help .
The modeling of textual and visual features from Instagram photos successfully identified individuals diagnosed with depression. The results showed that depressed people are more likely to upload photos that are bluer, grayer, and darker. The human rating of photo attributes (happiness, sadness, interestingness, and likability) is a weak predictor of depression . These findings suggest new avenues for early screening and detection of mental illness.
3.3 Facial expression and speech recognition
The physiological approaches using specific sensors for emotion signals have the advantage of being more precise, but are generally more costly and need more effort in clinical context. Facial and speech information is more applicable in these natural environments. Chronic stress, anxiety, and depressive states are three intertwined processes which constitute the vicious circle in common affective disorders such as depression. Chronic stress may induce autonomic responses concurrent with anxiety states, and anxiety may lead to depressive states when stress continues and coping strategies are ineffective. Gavrilescu and Vizireanu for the first time proposed a neural network-based architecture for predicting levels of stress, anxiety, and depression based on FACS in a nonintrusive and real-time manner. Their method allows the experts to monitor the three emotional states in real time. Additionally, 93% accuracy was achieved discriminating between healthy individuals and those with major depressive disorder (MDD) or post-traumatic stress disorder (PTSD) . This method is an attractive alternative to traditional self-report measurements based on questionnaires.
A new approach to predict the depressive symptoms with Beck Depression Inventory II (BDI-II) scores from video data is proposed based on the deep convolutional neutral networks (DCNN). The proposed framework is designed to capture both the facial appearance and dynamics in the video data by integrating two deep networks into one. The method could predict with over 80% accuracy depressive behavior, achieving a comparable performance to most methods combing video and audio data . Thus their method provided a more efficient and convenient way of prediction than multimodal methods.
Harati et al. used several metrics of variability to extract unsupervised features from video recordings of patients before and after deep brain stimulation (DBS) treatment for major depressive disorder (MDD). Their goal was to quantify the treatment effects on emotion indicated with facial expression. Their preliminary results indicate that unsupervised features learned from these video recordings using dynamic latent variable model (DLVM) based on multiscale entropy (MSE) of pixel intensities can distinguish different phases of depression and recovery . Therefore, their methods may provide more precise markers of treatment response.
As a relatively objective and easily available variable, speech has potential value in the diagnosis of depression. The acoustic analysis of patients with mental illness showed that there is greater than moderate correlation between speech-related variables and symptom indicators [76, 77]. Pan et al. build a speech-based depression recognition model with logical regression (LR) classification methods. The results show that the speech recognition accuracy reached 82.9% . They found that four voice features (PC1, PC6, PC17, PC24, P < 0.05, corrected) made significant contribution to depression and that the contribution effect of the voice features alone reached 35.65%. These results demonstrate that voice features have great potential in applications such as clinical diagnosis and prediction.
3.4 Multimodal emotion recognition
Facial, video, and textual information are the most available affective information in clinical context. Therefore, recent studies explored multimodal emotion recognition methods to improve the accuracies and specificities when the multimodal emotion information was input as predictors. Haque et al. present a machine learning method for measuring the severity of depressive symptoms. Their multimodal method uses 3D facial expressions and spoken language, commonly available from modern cell phones. It demonstrates an average error of 3.67 points (15.3% relative) on the clinically validated Patient Health Questionnaire (PHQ) scale. For detecting major depressive disorder, their model demonstrates 83.3% sensitivity and 82.6% specificity . Yang et al. proposed new text and video features and hybridizes deep and shallow models for depression estimation and classification from audio, video, and text descriptors. They demonstrated that the proposed hybrid framework effectively improves the accuracies of both depression estimation and depression classification . SimSensei Kiosk was a virtual human interviewer which aims to automatically assess the verbal and nonverbal behaviors indicative of depression, anxiety, or post-traumatic stress disorder (PSTD). A multimodal real-time sensing system was used to simultaneously capture different modalities (e.g., smile intensity, 3D head position and orientation, intensity or lack of facial expressions like anger, disgust, and joy) to model the relation between mental states and human behavior .
This chapter summarized the recognition of human affect based on internal and external signals of emotion, which has gained intensive attention in research fields such as artificial intelligence, psychology, cognitive neuroscience, and physiology. The reviewed empirical researches rarely deal with “social emotions” such as guilt, shame, and embarrassment. Instead of the more traditional cognitive and biological perspectives of emotion, the sociological perspective focused on functions of emotions to control social interactions and sustain the social order. Future studies need to deal with its extension to social emotions and the relevant theoretical foundations.
Emotion recognition is based on discrete and dimensional views of emotion, with underlying CNS and PNS dynamics. Single modal as well as multimodal emotion recognition rely on facial, speech, posture, physiological, and textual emotional information, which could function separately or concurrently. Integrating multimodal emotion information for emotion recognition remains challenging, and much research is needed about the way they relate to human affect.
Furthermore, the application of emotion recognition in depressive disorder may pave an avenue for more precise diagnosis of the syndrome and prediction of its disease course. Identifying specific physiological substrates of depressive disorder, combined with emotion classification technique such as machine learning, may help identify the dimensional constructs of RDoC, which are implicit in the clinical phenomena of depressive disorder.
Conflict of interest
The authors declare no conflict of interest.