Automatic Speech Emotion Recognition Using Machine Learning

This chapter presents a comparative study of speech emotion recognition (SER) systems. Theoretical definition, categorization of affective state and the modalities of emotion expression are presented. To achieve this study, an SER system, based on different classifiers and different methods for features extraction, is developed. Mel-frequency cepstrum coefficients (MFCC) and modulation spectral (MS) features are extracted from the speech signals and used to train different classifiers. Feature selection (FS) was applied in order to seek for the most relevant feature subset. Several machine learning paradigms were used for the emotion classification task. A recurrent neural network (RNN) classifier is used first to classify seven emotions. Their performances are compared later to multivariate linear regression (MLR) and support vector machines (SVM) techniques, which are widely used in the field of emotion recognition for spoken audio signals. Berlin and Spanish databases are used as the experimental data set. This study shows that for Berlin database all classifiers achieve an accuracy of 83% when a speaker normalization (SN) and a feature selection are applied to the features. For Spanish database, the best accuracy (94 %) is achieved by RNN classifier without SN and with FS.


Introduction
Emotion plays a significant role in daily interpersonal human interactions. This is essential to our rational as well as intelligent decisions. It helps us to match and understand the feelings of others by conveying our feelings and giving feedback to others. Research has revealed the powerful role that emotion play in shaping human social interaction. Emotional displays convey considerable information about the mental state of an individual. This has opened up a new research field called automatic emotion recognition, having basic goals to understand and retrieve desired emotions. In prior studies, several modalities have been explored to recognize the emotional states such as facial expressions [1], speech [2], physiological signals [3], etc. Several inherent advantages make speech signals a good source for affective computing. For example, compared to many other biological signals (e.g., electrocardiogram), speech signals usually can be acquired more readily and economically. This is why the majority of researchers are interested in speech emotion recognition (SER). SER aims to recognize the underlying emotional state of a speaker from her voice. The area has received increasing research interest all through current years. There are many applications of detecting the emotion of the persons like in the interface with robots, audio surveillance, web-based E-learning, commercial applications, clinical studies, entertainment, banking, call centers, cardboard systems, computer games, etc. For classroom orchestration or E-learning, information about the emotional state of students can provide focus on the enhancement of teaching quality. For example, a teacher can use SER to decide what subjects can be taught and must be able to develop strategies for managing emotions within the learning environment. That is why learner's emotional state should be considered in the classroom.
Three key issues need to be addressed for successful SER system, namely, (1) choice of a good emotional speech database, (2) extracting effective features, and (3) designing reliable classifiers using machine learning algorithms. In fact, the emotional feature extraction is a main issue in the SER system. Many researchers [4] have proposed important speech features which contain emotion information, such as energy, pitch, formant frequency, Linear Prediction Cepstrum Coefficients (LPCC), Mel-frequency cepstrum coefficients (MFCC), and modulation spectral features (MSFs) [5]. Thus, most researchers prefer to use combining feature set that is composed of many kinds of features containing more emotional information [6]. However, using a combining feature set may give rise to high dimension and redundancy of speech features; thereby, it makes the learning process complicated for most machine learning algorithms and increases the likelihood of overfitting. Therefore, feature selection is indispensable to reduce the dimensions redundancy of features. A review for feature selection models and techniques is presented in [7]. Both feature extraction and feature selection are capable of improving learning performance, lowering computational complexity, building better generalizable models, and decreasing required storage. The last step of speech emotion recognition is classification. It involves classifying the raw data in the form of utterance or frame of the utterance into a particular class of emotion on the basis of features extracted from the data. In recent years in speech emotion recognition, researchers proposed many classification algorithms, such as Gaussian mixture model (GMM) [8], hidden Markov model (HMM) [9], support vector machine (SVM) [10][11][12][13][14], neural networks (NN) [15], and recurrent neural networks (RNN) [16][17][18]. Some other types of classifiers are also proposed by some researchers such as a modified brain emotional learning model (BEL) [19] in which the adaptive neuro-fuzzy inference system (ANFIS) and multilayer perceptron (MLP) are merged for speech emotion recognition. Another proposed strategy is a multiple kernel Gaussian process (GP) classification [17], in which two similar notions in the learning algorithm are presented by combining the linear kernel and radial basis function (RBF) kernel. The Voiced Segment Selection (VSS) algorithm also proposed in [20] deals with the voiced signal segment as the texture image processing feature which is different from the traditional method. It uses the Log-Gabor filters to extract the voiced and unvoiced features from spectrogram to make the classification.
In previous work [21], we present a system for the recognition of «seven acted emotional states (anger, disgust, fear, joy, sadness, and surprise)». To do that, we extracted the MFCC and MS features and used them to train three different machine learning paradigms (MLR, SVM, and RNN). We demonstrated that the combination of both features has a high accuracy above 94% on the Spanish database. All previously published works generally use the Berlin database. To our knowledge, the Spanish emotional database has never been used before. For this reason, we have chosen to compare them. In this chapter, we concentrate to improve accuracy; more experiments have been performed. This chapter mainly makes the following contributions: • The effect of speaker normalization (SN) is also studied, which removes the mean of features and normalizes them to unit variance. Experiments are performed under a speaker-independent condition.
• Additionally, a feature selection technique is assessed to obtain good features from the set of features extracted in [21].
The rest of the chapter is organized as follows. In the next section, we start by introducing the nature of speech emotions. Section 3 describes features we extracted from a speech signal. A feature selection method and machine learning algorithms used for SER are presented. Section 4 reports on the databases we used and presents the simulation results obtained using different features and different machine learning (ML) paradigms. Section 5 closes this chapter by analyses and conclusion.

Emotion and classification
This section is concerned with defining the term emotion, presenting its different models. Also for recognizing emotions, there are several techniques and inputs that can be used. A brief description of all of the techniques is presented here.

Definition
A definition is both important and difficult because the everyday word "emotion" is a notoriously fluid term in meaning. Emotion is one of the most difficult concepts to define in psychology. In fact, there are different definitions of emotions in the scientific literature. In everyday speech, emotion is any relatively brief conscious experience characterized by intense mental activity and a high degree of pleasure or displeasure [22,23]. Scientific discourse has drifted to other meanings and there is no consensus on a definition. Emotion is often entwined with temperament, mood, personality, motivation, and disposition. In psychology, emotion is frequently defined as a complex state of feeling that results in physical and psychological changes. These changes influence thought and behavior. According to other theories, emotions are not causal forces but simply syndromes of components such as motivation, feeling, behavior, and physiological changes [24]. In 1884, in What is an emotion? [25], American psychologist and philosopher William James proposed a theory of emotion whose influence was considerable. According to his thesis, the feeling of intense emotion corresponds to the perception of specific bodily changes. This approach is found in many current theories: the bodily reaction is the cause and not the consequence of the emotion. The scope of this theory is measured by the many debates it provokes. This illustrates the difficulty of agreeing on a definition of this dynamic and complex phenomenon that we call emotion. "Emotion" refers to a wide range of affective processes such as moods, feelings, affects, and wellbeing [26]. The term "emotion" in [6] has been also referred to an extremely complex state associated with a wide variety of mental, physiological, and physical events.

Categorization of emotions
The categorization of emotions has long been a hot subject of debate in different fields of psychology, affective science, and emotion research. It is mainly based on two popular approaches: categorical (termed discrete) and dimensional (termed continuous). In the first approach, emotions are described with a discrete number of classes. Many theorists have conducted studies to determine which emotions are basic [27]. A most popular example is Ekman [28] who proposed a list of six basic emotions, which are anger, disgust, fear, happiness, sadness, and surprise. He explains that each emotion acts as a discrete category rather than an individual emotional state. In the second approach, emotions are a combination of several psychological dimensions and identified by axes. Other researchers define emotions according to one or more dimensions. Wilhelm Max Wundt proposed in 1897 that emotions can be described by three dimensions: (1) strain versus relaxation, (2) pleasurable versus unpleasurable, and (3) arousing versus subduing [29]. PAD emotional state model is another three-dimensional approach by Albert Mehrabian and James Russell where PAD stands for pleasure, arousal, and dominance. Another popular dimensional model was proposed by James Russell in 1977. Unlike the earlier three-dimensional models, Russell's model features only two dimensions which include (1) arousal (or activation) and (2) valence (or evaluation) [29].
The categorical approach is commonly used in SER [30]. It characterizes emotions used in everyday emotion words such as joy and anger. In this work, a set of six basic emotions (anger, disgust, fear, joy, sadness, and surprise) plus neutral, corresponding to the six emotions of Ekman's model, were used for the recognition of emotion from speech using the categorical approach.

Sensory modalities for emotion expression
There is vigorous debate about what exactly individual can express nonverbally. Humans can express their emotions through many different types of nonverbal communication including facial expressions, quality of speech produced, and physiological signals of the human body. In this section, we discuss each of these categories.

Facial expressions
The human face is extremely expressive, able to express countless emotions without saying a word [31]. And unlike some forms of nonverbal communication, facial expressions are universal. The facial expressions for happiness, sadness, anger, surprise, fear, and disgust are the same across cultures.

Speech
In addition to faces, voices are an important modality for emotional expression. Speech is a relevant communicational channel enriched with emotions: the voice in speech not only conveys a semantic message but also the information about the emotional state of the speaker. Some important voice feature vectors that have been chosen for research such as fundamental frequency, mel-frequency cepstral coefficient (MFCC), prediction cepstral coefficient (LPCC), etc.

Physiological signals
The physiological signals related to autonomic nervous system allow to assess objectively emotions. These include electroencephalogram (EEG), heart rate (HR), electrocardiogram (ECG), respiration (RSP), blood pressure (BP), electromyogram (EMG), skin conductance (SC), blood volume pulse (BVP), and skin temperature (ST) [32]. Using physiological signals to recognize emotions is also helpful to those people who suffer from physical or mental illness thus exhibit problems with facial expressions or tone of voice.
3. Speech emotion recognition (SER) system 3.1 Block diagram Our SER system consists of four main steps. First is the voice sample collection. The second features vector that is formed by extracting the features. As the next step, we tried to determine which features are most relevant to differentiate each emotion. These features are introduced to machine learning classifier for recognition. This process is described in Figure 1.

Feature extraction
The speech signal contains a large number of parameters that reflect the emotional characteristics. One of the sticking points in emotion recognition is what features should be used. In recent research, many common features are extracted, such as energy, pitch, formant, and some spectrum features such as linear prediction coefficients (LPC), mel-frequency cepstrum coefficients (MFCC), and modulation spectral features. In this work, we have selected modulation spectral features and MFCC, to extract the emotional features.
Mel-frequency cepstrum coefficient (MFCC) is the most used representation of the spectral property of voice signals. These are the best for speech recognition as it takes human perception sensitivity with respect to frequencies into consideration. For each frame, the Fourier transform and the energy spectrum were estimated and mapped into the Mel-frequency scale. The discrete cosine transform (DCT) of the Mel log energies was estimated, and the first 12 DCT coefficients provided the MFCC values used in the classification process. Usually, the process of calculating MFCC is shown in Figure 2.
In our research, we extract the first 12 order of the MFCC coefficients where the speech signals are sampled at 16 KHz. For each order coefficients, we calculate the mean, variance, standard deviation, kurtosis, and skewness, and this is for the other all the frames of an utterance. Each MFCC feature vector is 60-dimensional.
Modulation spectral features (MSFs) are extracted from an auditory-inspired long-term spectro-temporal representation. These features are obtained by emulating the spectro-temporal (ST) processing performed in the human auditory system and consider regular acoustic frequency jointly with modulation frequency. The steps for computing the ST representation are illustrated in Figure 3. In order to obtain the ST representation, the speech signal is first decomposed by an auditory filterbank (19 filters in total). The Hilbert envelopes of the critical-band outputs are computed to form the modulation signals. A modulation filterbank is further applied to the Hilbert envelopes to perform frequency analysis. The spectral contents of the modulation signals are referred to as modulation spectra, and the proposed features are thereby named modulation spectral features (MSFs) [5]. Lastly, the ST representation is formed by measuring the energy of the decomposed envelope signals, as a function of regular acoustic frequency and modulation frequency. The energy, taken over all frames in every spectral band, provides a feature. In our experiment, an auditory filterbank with N ¼ 19 filters and a modulation filterbank with M ¼ 5 filters are used. In total, 95 19 Â 5 ð ÞMSFs are calculated in this work from the ST representation.

Feature selection
As reported by Aha and Bankert [34], the objective of feature selection in ML is to "reduce the number of features used to characterize a dataset so as to improve a learning algorithm's performance on a given task." The objective will be the maximization of the classification accuracy in a specific task for a certain learning algorithm; as a collateral effect, the number of features to induce the final classification model will be reduced. Feature selection (FS) aims to choose a subset of the relevant features from the original ones according to certain relevance evaluation criterion, which usually leads to higher recognition accuracy [35]. It can drastically reduce the running time of the learning algorithms. In this section, we present an  effective feature selection method used in our work, named recursive feature elimination with linear regression (LR-RFE).
Recursive feature elimination (RFE) uses a model (e.g., linear regression or SVM) to select either the best-or worst-performing feature and then excludes this feature. These estimators assign weights to features (e.g., the coefficients of a linear model), so the goal of recursive feature elimination (RFE) is to select features by recursively considering smaller and smaller sets of features. First, the estimator is trained on the initial set of features, and the predictive power of each feature is measured [36]. Then, the least important features are removed from the current set of features. That procedure is recursively repeated on the pruned set until the desired number of features to select is eventually reached. In this work, we implemented the recursive feature elimination method of feature ranking via the use of basic linear regression (LR-RFE) [37]. Other research also uses RFE with another linear model such as SVM-RFE that is an SVM-based feature selection algorithm created by [38]. Using SVM-RFE, Guyon et al. selected key and important feature sets. In addition to improving the classification accuracy rate, it can reduce classification computational time.

Classification methods
Many machine learning algorithms have been used for discrete emotion classification. The goal of these algorithms is to learn from the training samples and then use this learning to classify new observation. In fact, there is no definitive answer to the choice of the learning algorithm; every technique has its own advantages and limitations. For this reason, here we chose to compare the performance of three different classifiers.
Multivariate linear regression classification (MLR) is a simple and efficient computation of machine learning algorithms, and it can be used for both regression and classification problems. We have slightly modified the LRC algorithm described as follow Algorithm 1 [39]. We calculated (in step 3) the absolute value of the difference between original and predicted response vectors (|y À y i |), instead of the Euclidean distance between them (ky À y i k). Support vector machines (SVM) are an optimal margin classifier in machine learning. It is also used extensively in many studies that related to audio emotion recognition which can be found in [10,13,14]. It can have a very good classification performance compared to other classifiers especially for limited training data [11]. SVM theoretical background can be found in [40]. A MATLAB toolbox implementing SVM is freely available in [41]. A polynomial kernel is investigated in this work.

Algorithm 1. Linear Regression Classification (LRC)
Inputs: Class models X i ∈ R qÂp i , i ¼ 1, 2, …, N and a test speech vector y ∈ R qÂ1 Output: Class of y 3. Distance calculation between original and predicted response variables d i y ð Þ ¼ |y À y i |, i ¼ 1, 2, …, N; 4. Decision is made in favor of the class with the minimum distance d i y ð Þ Recurrent neural networks (RNN) are suitable for learning time series data, and it has shown improved performance for classification task [42]. While RNN models are effective at learning temporal correlations, they suffer from the vanishing gradient problem which increases with the length of the training sequences. To resolve this problem, long short-term memory (LSTM) RNNs were proposed by Hochreiter et al. [43]; it uses memory cells to store information so that it can exploit long-range dependencies in the data [17]. Figure 4 shows a basic concept of RNN implementation. Unlike traditional neural network that uses different parameters at each layer, the RNN shares the same parameters (U, V, and W are presented in Figure 4) across all steps. The hidden state formulas and variables are as follows: where x t , s t , and o t are respectively the input, the hidden state, and the output at time step t and U, V, W are parameters matrices.

Experimental results and analysis 4.1 Emotional speech databases
The performance and robustness of the recognition systems will be easily affected if it is not well trained with a suitable database. Therefore, it is essential to have sufficient and suitable phrases in the database to train the emotion recognition system and subsequently evaluate its performance. There are three main types of databases: acted emotions, natural spontaneous emotions, and elicited emotions [27,44]. In this work, we used an acted emotion databases because they contain strong emotional expressions. The literature on speech emotion recognition [45] shows that the majority of studies have been conducted with emotional acted speech. In this section, we detailed the two emotional speech databases used for classifying discrete emotions in our experiments: Berlin Database and Spanish Database.

Berlin database
The Berlin database [46] is widely used in emotional speech recognition. It contains 535 utterances spoken by 10 actors (5 female, 5 male) in 7 simulated emotions (anger, boredom, disgust, fear, joy, sadness, and neutral). This database was chosen for the following reasons: (i) the quality of its recording is very good, and (ii) it is public [47] and popular database of emotion recognition that is recommended in the literature [19].

Spanish database
The INTER1SP Spanish emotional database contains utterances from two professional actors (one female and one male speaker).The Spanish corpus that we have the right to access (free for academic and research use) [48] was recorded twice in the «six basic emotions plus neutral (anger, sadness, joy, fear, disgust, surprise and neutral/normal)». Four additional neutral variations (soft, loud, slow, and fast) were recorded once. This is preferred to other created database because it is available for researchers use and it contains more data (6041 utterances in total). This paper has focused on only seven main emotions from the Spanish database in order to achieve a higher and more accurate rate of recognition and to make the comparison with the Berlin database detailed above.

Results and analysis
In this section, experimentation results are presented and discussed. We report the recognition accuracy of using MLR, SVM, and RNN classifiers. Experimental evaluation is performed on the Berlin and Spanish databases. All classification results are obtained under tenfold cross-validation. Cross-validation is a common practice used in performance analysis that randomly partitions the data into N complementary subsets, with N À 1 of them used for training in each validation and the remaining one used for testing. The neural network structure used is a simple LSTM. It consists of two consecutive LSTM layers with hyperbolic tangent activation followed by two classification dense layers. Features from data are scaled to À1; 1 ½ before applying classifiers. Scaling features before recognition is important, because when a learning phase is fit on unscaled data, it is possible for large inputs to slow down the learning and convergence and in some cases prevent the used classifier from effectively learning for the classification problem. The effect of speaker normalization (SN) step prior to recognition is investigated, and there are three different SN schemes that are defined in [6]. SN is useful to compensate for the variations due to speaker diversity rather than the change of emotional state. We used in this section the SN scheme that has given the best results in [6]. The features of each speaker are normalized with a mean of 0 and a standard deviation of 1. Tables 1-3 show the recognition rate for each combination of various features and classifiers based on Berlin and Spanish databases. These experiments use feature set without feature selection. As shown in Table 1, SVM classifier yields better results above 81%, with feature combination of MFCC and MS for Berlin database. Our results have improved compared to previous results in [21] because we changed the SVM parameters for each type of features to develop a good model.
From Table 1, it can be concluded that applying SN improves recognition results for Berlin database. But this is not the case for the Spanish database, as demonstrated in Tables 2 and 3. Results are the same with the three different classifiers. This can be explained by the number of speakers in each database. The Berlin database contains 10 different speakers, compared to the Spanish database that contains only two speakers and probably the language impact. As regarding the RNN method, we found that combining both types of features has the worst recognition rate for the Berlin database, as shown in Table 3. That is because the RNN model has too many parameters (155 coefficients in total) and a poor training data. This is the phenomena of overfitting. This is confirmed by the fact that when we reduced the number of features from 155 to 59 features, the results show an increase of above 13%, as shown in Table 4. To investigate whether a smaller feature space leads to better recognition performance, we repeated all evaluations on the development set by applying a recursive feature elimination (LR-RFE) for each modality combination. The stability of RFE depends heavily on the type of model that is used for feature ranking at each iteration. In our case, we tested the RFE based on an SVM and regression models; we found that using linear regression provides more stable results. We observed from the previous results that the combination of the features gives the best results. So we applied LR-RFE feature selection only for this combination to improve accuracy. In this work, a total of 155 features were used;  best features were chosen from feature selection. Fifty-nine features were selected by RFE feature selection method based on LR from the Berlin database and 110 features from the Spanish database. The corresponding results of LR-RFE can be seen in Table 4. For most setting using the Spanish database, LR-RFE does not significantly improve the average accuracy. However, for recognition based on Berlin database using the three classifiers, LR-RFE leads to a remarkable performance gain, as shown in Figure 5. This increases the average of MFCC combined with MS features from 63.67 to 78.11% for RNN classifier. These results are illustrated in Table 4. For the Spanish database, the feature combination of MFCC and MS after applying LR-RFE selection using RNN has the best recognition rate which is above 94.01%.  Table 3.

SN
Recognition results using RNN classifier based on Berlin and Spanish databases.
The confusion matrix for the best recognition of emotions using MFCC and MS features with RNN based on Spanish database is shown in Table 5. The rate column lists per class recognition rates and precision for a class are the number of samples correctly classified divided by the total number of samples classified to the class. It can be seen that Neutral was the emotion that was least difficult to recognize from speech as opposed to Disgust which was the most difficult and it forms the most notable confusion pair with Fear.

Conclusion
In this current study, we presented an automatic speech emotion recognition (SER) system using three machine learning algorithms (MLR, SVM, and RNN) to classify seven emotions. Thus, two types of features (MFCC and MS) were extracted from two different acted databases (Berlin and Spanish databases), and a combination of these features was presented. In fact, we study how classifiers and features impact recognition accuracy of emotions in speech. A subset of highly discriminant features is selected. Feature selection techniques show that more information is not always good in machine learning applications. The machine learning models were trained and evaluated to recognize emotional states from   Table 5.
Confusion matrix for feature combination after LR-RFE selection based on Spanish database. these features. SER reported the best recognition rate of 94% on the Spanish database using RNN classifier without speaker normalization (SN) and with feature selection (FS). For Berlin database, all of the classifiers achieve an accuracy of 83% when a speaker normalization (SN) and a feature selection (FS) are applied to the features. From this result, we can see that RNN often perform better with more data and it suffers from the problem of very long training times. Therefore, we concluded that the SVM and MLR models have a good potential for practical usage for limited data in comparison with RNN . Enhancement of the robustness of emotion recognition system is still possible by combining databases and by fusion of classifiers. The effect of training multiple emotion detectors can be investigated by fusing these into a single detection system. We aim also to use other feature selection methods because the quality of the feature selection affects the emotion recognition rate: a good emotion feature selection method can select features reflecting emotion state quickly. The overall aim of our work is to develop a system that will be used in a pedagogical interaction in classrooms, in order to help the teacher to orchestrate his class. For achieving this goal, we aim to test the system proposed in this work.