Quantification of Emotions for Facial Expression: Generation of Emotional Feature Space Using Self- Mapping

The shape (static diversity) and motion (dynamic diversity) of facial components, such as the eyebrows, eyes, nose, and mouth, manifest expression. From the viewpoint of static di‐ versity, owing to the individual variation in facial configurations, it is presumed that a facial expression pattern due to the manifestation of a facial expression includes subject-specific features. In addition, from the viewpoint of dynamic diversity, because the dynamic changes in facial expressions originate from subject-specific facial expression patterns, it is presumed that the displacement vector of facial components has subject-specific features.

The shape (static diversity) and motion (dynamic diversity) of facial components, such as the eyebrows, eyes, nose, and mouth, manifest expression.From the viewpoint of static diversity, owing to the individual variation in facial configurations, it is presumed that a facial expression pattern due to the manifestation of a facial expression includes subject-specific features.In addition, from the viewpoint of dynamic diversity, because the dynamic changes in facial expressions originate from subject-specific facial expression patterns, it is presumed that the displacement vector of facial components has subject-specific features.
On the other hand, although an emotionally generated facial expression pattern of an individual is unique, internal emotions expressed and recognized by humans via facial expressions are considered person-independent and universal.For example, one person may express the common emotion of happiness using various facial expressions, while another person may recognize happiness from these facial expressions.Pantic et al. argued that a natural facial expression always includes various emotions, and that a pure facial expression rarely appears [1].Furthermore, they suggested that it is not realistic to classify all facial expressions into the six basic emotion categories: anger, sadness, disgust, happiness, surprise and fear.Instead, they proposed quantitative classification into many more emotion categories.
Pioneering studies on the quantification of emotions recognized from facial expressions have been conducted in the field of psychology.In particular, the mental space model of Russell et al. is well known: each facial expression is arranged in a space centering on "pleasantness" and "arousal," particularly addressing the semantic antithetical nature of emotion [8].Russell et al. discovered that facial expression stimuli can be conceptualized as a circular arrangement in the mental space described above (the circumplex model).Yamada found a significant correlation between the "slantedness" and "curvedness/openness" of facial components and the "pleasantness" and "arousal" in the mental space [9].This observation highlights the importance of clarifying a correspondence between changes in facial components accompanying emotional expressions (physical parameters) and recognized emotions (psychological parameters).
We address the following issues related to the recognition of emotions from facial expressions.
First, facial expression patterns are considered as physical parameters.Expressions convey personality, and as physical parameters, facial expression patterns vary among individuals.Hence, the classification of facial expressions is fundamentally a problem with an unknown number of categories.Accordingly, the extraction of subject-specific facial expression categories using a common person-independent technique is an important issue.
Second, emotions are considered as psychological parameters.The facial expression pattern of an individual is unique, but as a psychological parameter, emotion is person-independent and universal.Moreover, the grade of a recognized emotion changes according to the grade of physical change in a facial expression pattern.Therefore, it is important to match the amount of physical change in a subject-specific facial expression pattern with the corresponding amount of mental change in order to estimate the grade of emotion.
Previously, we proposed a method for generating a subject-specific feature space to estimate the grade of emotion, i.e., an emotional feature space that expresses the correspondence between physical and psychological parameters [10,11].In this chapter, we improve the abovementioned method.In addition, we develop a method for generating a feature space that can express a level of detailed emotion.

Previous studies
A method for generating a subject-specific emotional feature space using self-organizing maps (SOMs) [12] and counter propagation networks (CPNs) [13] has been proposed in previous studies [10,11].The feature space expresses the correspondence between the changes in facial expression patterns and the degree of emotions in a two-dimensional space centered on "pleasantness" and "arousal."For practical purposes, we created two types of feature spaces, a facial expression map (FEMap) and an emotion map (EMap), by learning facial images using CPNs.When a facial image is fed into the CPN after the learning process, the FEMap can assign the image to a unique emotional category.Furthermore, the EMap can quantize the level of emotion in the image according to the level of change in the facial patterns.
Figures 1 and 2 respectively show the FEMap and EMap generated using the proposed method.Figure 3 shows the recognition result for the expressions of "fear" and "surprise".These results indicate that the pleasantness and arousal values gradually change with changes in facial expression patterns.Moreover, the changes in the pleasantness and arousal values of two individuals are similar, even though their facial expression patterns are different.Developments and Applications of Self-Organizing Maps

Self-Organizing Maps (SOM)
An SOM is a learning algorithm that models the self-organizing and adaptive learning capabilities of the human brain [12].It comprises two layers: an input layer, to which training data are supplied, and a Kohonen layer, in which self-mapping is performed via competitive learning.The learning procedure of an SOM is described below.
1. Let w i,j (t) be the weight from an input layer unit i to a Kohonen layer unit j at time t.Actually, w i,j is initialized using random numbers.
2. Let x i (t) be the data input to the input layer unit i at time t; calculate the Euclidean distance d j between x i (t) and w i,j (t) using (1).
3. Search for a Kohonen layer unit to minimize d j , which is designated as the winner unit.
4. Update the weight w i,j (t) of a Kohonen layer unit contained in the neighborhood region of the winner unit N c (t) using ( 2), where α(t) is a learning coefficient.w i, j (t + 1) = w i, j (t) + α(t)(x i (t) − w i, j (t) ) (2) 5. Repeat processes 2)-4) up to the maximum iteration of learning.

Counter Propagation Network (CPN)
A CPN is a learning algorithm that combines the Grossberg learning rule with a SOM [13].It comprises three layers: an input layer to which training data are supplied, a Kohonen layer in which self-mapping is performed via competitive learning, and a Grossberg layer, which labels the Kohonen layer by the counter propagation of teaching signals.A CPN is useful for automatically determining the label of a Kohonen layer when the category to which training data belongs is predetermined.This labeled Kohonen layer is designated as a category map.The learning procedure of a CPN is described below.
1. Let w i n,m (t) and w j n,m (t) be the weights to a Kohonen layer unit (n, m) at time t from an input layer unit i and from a Grossberg layer unit j, respectively.In fact, w i n,m and w j n,m are initialized using random numbers.
2. Let x i (t) be the data input to the input layer unit i at time t, and calculate the Euclidean distance d n ,m between x i (t) and w i n,m (t) using (3).
(3) 3. Search for a Kohonen layer unit to minimize d n,m , which is designated as the winner unit.4. Update weights w i n,m (t) and w j n,m (t) of a Kohonen layer unit contained in the neighborhood region of the winner unit N c (t) using ( 4) and (5), where α(t), β(t) are learning coefficients, and t j (t) is a teaching signal to the Grossberg layer unit j.
5. Repeat processes 2)-4) up to the maximum iteration of learning.
6.After learning is completed, compare weights w j n,m observed from each unit of the Kohonen layer, and let the teaching signal of the Grossberg layer with the maximum value be the label of the unit.The proposed method consists of the following three steps.First, facial expression images are hierarchically classified using SOMs, and subject-specific facial expression categories are extracted.Next, the CPN is used for data expansion of the facial expression patterns on the basis of the similarity and continuity of each facial expression category.The CPN is a supervised learning algorithm that combines Grossberg's learning rule with the SOM.A category map generated by the method described above is defined as a subject-specific FEMap.Then, a subject-specific emotion feature space is generated.The space matches physical and psychological parameters by inputting the coordinate values to the circumplex model proposed by Russell [8] as teaching signals for the CPN.Then, this complex plane is defined as a subject-specific EMap.

Extraction of facial expression category
The proposed method was adopted to extract a subject-specific facial expression category hierarchically by using an SOM with a narrow mapping space.An SOM is an unsupervised learning algorithm, and it classifies the given facial expression images in a self-organizing manner, according to their topological characteristics.Hence, it is suitable for classification with an unknown number of categories.Moreover, an SOM compresses the topological information of facial expression images using a narrow mapping space, and it performs classification based on features that roughly divide the training data.We speculate that repeating these steps hierarchically renders the classified amount of change in facial expression patterns comparable; hence, a subject-specific facial expression category can be extracted.Figure 5 shows the extraction of a facial expression category, the details of which are provided below.
1.The expression images described in Section 5.1 were used as training data.The following processing was performed for each facial expression.The training data is assumed to constitute N frames.

2.
Learning was conducted using an SOM with a Kohonen layer of five units and an input layer of 40 × 48 units (Fig. 5(a)), where the number of learning sessions was set to 10,000.

3.
The weight of the Kohonen layer W i,j (0 ≤ W i,j ≤ 1) was converted to a value of 0-255 at the end of learning, and visualized images were generated (Fig. 5(b)), where n 1 -n 5 denote the training data classified into each unit.

4.
Five visualized images can be considered as representative vectors of the training data classified into each unit (n 1 -n 5 ).Therefore, a thresholding process was adopted to judge whether a visualized image was suitable as a representative vector.Specifically, for the upper and lower parts of the face shown in Fig. 5(c), a correlation coefficient between a visualized image and classified training data was determined for each unit.The standard deviation of these values was computed.When the standard deviation of both regions was 0.005 or less in all five units, the visualized image was considered to represent the training data, and the subsequent hierarchization processing was cancelled.

5.
The correlation coefficient of weight W i,j between each adjacent unit in the Kohonen layer was computed.The Kohonen layer was divided into two parts between the units of the minimum correlation coefficient (Fig. 5(b)).

6.
The training data (N 1 and N 2 ) classified into both sides of the partition were used as new training data; the processing described above was repeated recursively.Consequently, the hierarchical structure of the SOM was generated (Fig. 5(b) and Fig. 5(d)).

7.
The lowest category of the hierarchical structure was defined as a facial expression category (Fig. 5(e)).Five visualized images were defined as representative images of each category at the end of learning.Then, the photographer of the facial expression images visually confirmed each facial expression category and conducted implication in emotion categories, such as a neutral facial expression and six basic facial expressions.Developments and Applications of Self-Organizing Maps

Generation of facial expression map
The recognition of a natural facial expression requires the generation of a facial expression pattern (mixed facial expression) that interpolates each emotion category.In the proposed method, the representative image obtained in Section 4.1 was used as training data, and data expansion of facial expression patterns between each emotion category was performed using a CPN with a large mapping space.A CPN is adopted because the teaching signal of the training data is known by the processing described in Section 4.1.The mapping space of the CPN comprises more units than the training data, and it has a torus structure because a large mapping space is assumed to enable the CPN to perform data expansion based on the similarity and continuity of the training data.Figure 6 shows the CPN architecture for generating an FEMap.The details of the processing are provided below.1.The CPN structure comprises an input layer of 40 × 48 units and a two-or three-dimensional Kohonen layer.In addition, Grossberg layer 1 of seven units was prepared; a teaching signal of six basic facial expressions and a neutral facial expression were input to it.

2.
The representative images obtained in Section 4.1 were used as training data, and learning was carried out for each subject.As the teaching signal to Grossberg layer 1, 1 was input to units that represent emotion categories of representative images; otherwise, 0 was input.The number of learning sessions was set to 20,000.

3.
The weights (W g1 ) of Grossberg layer 1 were compared for each unit of the Kohonen layer at the end of learning; the emotion category with the greatest value was used as the label of the unit.A category map generated by the processing described above was defined as a subject-specific FEMap.

Generation of emotion map
Although the facial expression patterns of an individual are unique, emotions expressed and recognized from facial expressions by humans are person-independent and universal.Therefore, it is necessary to match the grade of emotion based on a common index for each subject with the grade of change in facial expression patterns described in Section 4.2.The proposed method is based on the circumplex model proposed by Russell as a common index.Specifically, the coordinate values based on the circumplex model are input as teaching signals for the CPN, and the processing described in Section 4.2 is carried out simultaneously.Then, an EMap is generated for matching the grade of change in facial expression patterns with the grade of emotion.Figure 7 shows the procedure for generating the EMap, the details of which are provided below.Developments and Applications of Self-Organizing Maps

1.
Grossberg layer 2 of one unit, which inputs the coordinate values of the circumplex model, was added to the CPN structure (Fig. 7(a)).

2.
Each facial expression stimulus was arranged in a circle on a plane centered on "pleasantness" and "arousal" in the circumplex model (Fig. 7(b)).The proposed method expresses this circular space as the complex plane shown in Fig. 7(c), and complex numbers based on the figure were input to Grossberg layer 2 as teaching signals.For example, when the input training data represents the emotion category of happiness, the teaching signal for Grossberg layer 2 is cos (π/4) + i sin (π/4).

3.
This processing was repeated up to the maximum learning number.

4.
Each unit of the Kohonen layer was plotted onto the complex plane at the end of learning, according to the values of the real and imaginary parts of the weight (W g2 ) on Grossberg layer 2.Then, this complex plane was defined as a subject-specific EMap.

Expression images
In general, open facial expression databases are used in conventional studies [14,15].These databases contain a few images per expression and subject.For this study, we obtained facial expression images of ourselves because the proposed method extracts subject-specific facial expression categories and representative images of each category from large quantities of data.
We discuss a neutral facial expression and six basic facial expressions, namely, anger, sadness, disgust, happiness, surprise, and fear.These expressions are deliberately manifested by a subject.The basic facial expressions were acquired as motion videos, including a process in which the neutral and basic facial expressions were manifested five times; each facial expression was manifested in turns.The motion videos were converted into static images (30 fps, 8-bit gray, 320 × 240 pixels).We processed a region containing facial components.Therefore, a face region image was extracted and normalized according to the following procedures.

1.
A face was detected using Haar-like features [16]; a face region image normalized into a size of 80 × 96 pixels was extracted.

2.
The image was processed using a median filter for noise removal.Then, smoothing processing was performed after dimension reduction of the image using coarse grain processing (40 × 48 pixels).

3.
A pseudo-outline that is common to all the subjects was generated; the face region containing facial components was extracted.

4.
Histogram linear transformation was performed for brightness value correction.Figure 8 shows an example of face region images after extraction and normalization.Table 1 lists the number of acquired frames and the number of frames extracted by the SOM as the training data for the CPN.
The data used in the study was acquired in accordance with ethical regulations regarding research on humans at Akita University, Japan.

Experiment details
This study examined the training data input method and the number of dimensions of the CPN mapping space.In particular, the following were examined.
i. Method 1: Learning was conducted using a CPN with a two-dimensional Kohonen layer of 30 × 30 units.Moreover, training data were randomly selected and input.
ii. Method 2: The Kohonen layer of the CPN was set to 30 × 30 units, as in Method 1.However, the training data for each emotion category were input by the same ratio. iii.
Method 3: Learning was conducted using a CPN with a three-dimensional Kohonen layer of 10 × 10 × 10 units.The training data input method is the same as that of Method 2.

Discussion on training data input method
Tables 2 and 3 list the number of Kohonen layer units on the FEMap for Methods 1 and 2, respectively.Figure 9 and 10 shows the FEMaps and the EMaps generated using Methods 1 and 2.  Table 2 shows that the percentage of the neutral facial expression category is high.Moreover, although a mixed facial expression of a neutral expression and six basic expressions is generated, as shown in Fig. 10(a), a mixture of the six basic expressions is not generated.On the other hand, the number of units of each emotion category on the FEMap is roughly constant, as shown in Table 3, and many mixed facial expressions are generated between the expressions on the EMap, as shown in Fig. 10(b).These results suggest that the input ratio of the training data should be constant for every emotion category to effectively generate many mixed facial expressions.

Discussion on number of dimensions of CPN mapping space
The EMap generated by Method 3 is shown in Fig. 11. Figure 12 shows the enlargement of the happiness region in the EMap generated by Methods 2 and 3.Although the number of Kohonen layer units in Methods 2 and 3 are almost equal (the former is 900 units and the latter is 1,000 units), the generation results of the EMaps differ significantly.In particular, many mixed facial expressions are radially generated from the coordinates of the teaching signal on the circumference, as shown in Fig. 11 and Fig. 12(b).
The number of neighboring emotion categories on the FEMap increases as a result of an increase in the number of dimensions of the CPN mapping space.

Conclusion
In this chapter, we proposed a method for generating a feature space that expresses the correspondence between the changes in facial expression patterns and the degree of emotions.
In addition, we investigated the training data input method and the number of dimensions of the CPN mapping space.The results clearly show that the input ratio of the training data should be constant for every emotion category and the number of dimensions of the CPN mapping space should be extended to effectively express a level of detailed emotion.We plan to experimentally evaluate emotion estimation using the generated feature spaces.

Figure 4
Figure4shows the procedure for generating the FEMap and EMap.

Figure 4 .
Figure 4. Flow chart of proposed method.

Figure 5 .
Figure 5. Extraction procedure of facial expression categories.

Figure 8 .
Figure 8. Examples of facial expression images.

Figure 9 .
Figure 9. Generation results of FEMap using Methods 1 and 2.

Figure 10 .
Figure 10.Generation results of EMap using Methods 1 and 2.

Figure 12 .
Figure 12.Enlargement of the happiness region in the Emap.

Table 1 .
Number of acquired frames and training data.