Bio-Inspired Architecture for Clustering into Natural and Non-Natural Facial Expressions

According to several researchers [2, 27, 36], emotion processing is done in the amygdala, an area located within medial temporal lobes in the brain. Experiments carried out by [35] demonstrated that face recognition process can be achieved with the simple facial features (eyebrows and eyes vs eyebrows) of threatening and friendly faces. These experiments confirm that emotion processing involves other areas like the visual cortex, e.g. V1 [32]. Later, [17] demonstrated that fast, automatic, and parallel processing of unattended emotional faces provide important insights into the specific and dissociated neural pathways in emotion and face perception.


Introduction
Over the past years there has been an increasing surge of interest in automated facial expression analysis but there are several situations where the analysis of the expressions cannot be successfully done with the different classical approaches due to the variations in the environmental conditions (like illumination changes).
A way to face these variations in the environmental conditions is by looking at the existing systems that efficiently locate, extract and analyse the face information. An example of these systems is the brain of the human being.
According to several researchers [2,27,36], emotion processing is done in the amygdala, an area located within medial temporal lobes in the brain. Experiments carried out by [35] demonstrated that face recognition process can be achieved with the simple facial features (eyebrows and eyes vs eyebrows) of threatening and friendly faces. These experiments confirm that emotion processing involves other areas like the visual cortex, e.g. V1 [32]. Later, [17] demonstrated that fast, automatic, and parallel processing of unattended emotional faces provide important insights into the specific and dissociated neural pathways in emotion and face perception.
Although, there are different areas involved in the analysis of facial expressions in the brain of the human being, there is no consensus about how this process is done.
Clues from the psychological point of view, like that proposed by Ekman and Friesen [9], point out the importance of the symmetry in this process and the facial asymmetry (with left part of the face stronger than that of right part) is apparent only with deliberate and non-spontaneous (non-natural) expressions. According to this work, the symmetrical expressions show a true or natural emotion. Nevertheless, the asymmetrical expressions show a false or non-natural emotion.
In this research, we take advantage of psychological point of view to interpret our results issued by symmetrical and asymmetrical face measures when we model the visual areas of brain of the human being. Furthermore, we are inspired by early visual areas for processing the visual stream. Then, the interaction between neurons of visual cortex lead us to introduce our proposed architecture based on the detection of asymmetrical facial expressions and their clustering into natural/non-natural expressions.
In this chapter, we will present an architecture composed of five stages in order to cluster into natural/non-natural facial expressions in a sequence of images.
The first stage process a sequence of images for eyes and mouth corners detection inspired by the sensibility of cone-cells in the eyes. With these corners we build an anthropometrical grid of six vertical bands.
The second stage is a bio-inspired treatment through three steps: • Contrast and orientation detection inspired in the simple neurons in V1 (primary visual cortex) that are modelled by Gabor-like filters.
• Integration and smoothing contours due to the performance of complex neurons in V1.
• The coherence and integration activity of neurons in MT (middle temporal area) inspire us to detect the motion through a temporal interaction of our consecutive complex neurons.
The third stage computes the ratio between the quantity of active neurons and the maximum quantity of neurons in each of six bands built in the first stage. A process inspired in the visual stream integration in IT (infero-temporal area).
The fourth stage compares each symmetrical band with an empirical threshold κ obtained in our experiments.
The end stage clusters in natural/non-natural expressions. This stage uses the results of the fourth stage with a bio-inspired neural network model.
In the following sections, we will present an overview of the state of the art in research of facial expressions. In addition, we present our proposed architecture to analyse the asymmetrical responses of the faces from a sequence of images. We also mention the experiments, discuss the results and finally comment on the future work.

State of the art
The analysis of facial expressions constitute a critical and complex part of our non-verbal social interactions. Therefore, over the past years there has been an increasing surge of interest to create tools that allow to analyse the facial expression automatically. This analysis is a complex task; on one hand facial expression structure of human beings are different from one another. On the other hand, faces are analysed from sequences of images captured in different environments with variations of lighting, pose and scale changes that difficult the extraction of their structure. Traditionally, the analysis of facial expressions has been focused on identify emotions; however, both deformation and face information can be originated by other factors (like verbal communication or fatigue) rather than emotions. The emotions can be evaluated from two main perspectives: psychological, physiological and neurophysiological points of view versus computational point of view.

Psychological and physiological perspective
From the psychological and physiological perspective, the emotions can be analysed considering the symmetry of the expressions. According to Ekman and Friesen [9] through the analysis of symmetry it can be evaluated the veracity (naturalness) of an emotion. In the brain, the information about emotions is processed in the amygdala [2,27,36], an area located within medial temporal lobes in the brain. By using fMRI some researchers [1,13,32] have discovered that in this area the processing is modulated by the significance of faces, particularly with fearful expressions, also with others like gaze direction. Visual search paradigms have provided evidence the enhanced capture of attention by threatening faces. Experiments carried out by [35] demonstrated that the recognition process can be achieved with simple facial features (expressions of eyebrows and eyes together vs only eyebrows) of threatening and friendly faces. These experiments confirm that emotion processing involves other areas like the visual cortex, e.g. V1 [32].

Neurophysiological perspective
According to neurophysiological studies [15,16], the task of facial expression recognition also occurs in some regions in the visual cortex (ventral and dorsal pathways). The neurons in these areas respond mainly to gesticulations and identity. In the case of gesticulations, the neurons in the superior temporal sulcus involve the dynamics of the face. In the case of identity, the neurons in the inferior temporal gyrus imply the analysis of invariant facial features. In both cases the processing is modulated by attentional mechanisms [37]. Later, [17] demonstrated that fast, automatic, and parallel processing of unattended emotional faces, provides important insights into the specific and dissociated neural pathways in emotion and face perception.
In the figure 1 we show the main pathways of processing visual stream in the cortex of the human being. The dorsal pathways analyse the motion and localisation of information, while ventral pathways analyse information about form and colour. Both pathways have the same source, V1. The visual cortex is the last specialised layer of evolution in mammals; the limbic system (not shown in the figure) is the first and primitive layer in the mammals where the amygdala is situated in the temporal lobes. Then, it is covered by visual cortex.

Computational perspective
From the computational point of view, facial expression recognition implies the classification of facial motion and facial structure deformation into abstract representations completely based on visual information. To address this analysis it is necessary to perform two main tasks: face detection and facial feature extraction. In the first one, techniques have been developed to deal with the extraction and normalization of the face considering variations in the pose [10,28] and illumination [3,12]. However, according to [11] the main effort has been focused to feature facial extraction where techniques like appearance-based models [6,18,19] work with significant variations in the acquired face images without normalization.
Basically, the facial feature extraction has been approached by two main streams [11,23]: facial deformation extraction models and facial motion extraction models. Motion extraction approaches focus on facial changes produced by facial expressions, whereas deformation-based methods contrast the visual information against face models to extract features produced by expressions and not by age deformation like wrinkles. The main difference between these two methods is that deformation-based methods can be applied to single images as well as image sequences, and motion methods need a sequence of images. In both cases the facial features may be processed holistically (the faces are analysed as a whole) or locally (focusing on features from specific areas). In the case of deformation methods, the extraction of features relies on shape and texture changes through a period of time. On one hand the Holistic approaches either use the whole face [7,25] or partial information about regions in the face like mouth or eyes [14,21,31]. For the motion extraction methods, the features are extracted by analysing motion vectors: holistically [8,29], and locally [26,34,39] obtained.

Proposed bio-inspired architecture
A general diagram of the proposed methodology is shown in the figure 2. In the beginning, a sequence of images or video is captured in RGB format. Next, a processing based on green and red colours is realized to detect eyes and mouth corners. In this stage, we generate a grid of 6 × 5 based on coordinates of eyes and mouth corners. In this chapter, we only chose the six vertical bands for our experiments, the other combinations for analysis of the antropomorphical grid are subjects our future analysis.
Our bio-inspired treatment obtain the simple and complex neuron responses following the function of basic neurons of visual primary cortex (V1). Next, a temporal integration in complex neuron responses simulates the neurons in middle temporal cortex (MT).
Then, we extract the active neurons for each vertical band and compute their ratio of active quantity neurons. Next, each vertical band is compared to their symmetrical band.
Finally, we detect the asymmetries for each pair of vertical bands and cluster it in three classes applying the self-organizing maps (SOM).
In the next subsections we give a brief explanation of the steps of our approach where we assume that the face has been located correctly.

Description of input image sequences
In our proposed approach, the images of each sequence are in RGB colour, because we use the colour during the pre-processing to detect eyes and mouth corners. The process is described below. We suggest sampling capture to 25 frames per second and 1024 × 768 pixels. High sampling and high resolution is desirable.

Pre-processing
In this stage we realize three steps which are (1) the eyes and mouth detection, (2) we build the grid and finally we utilize only six regions, the six vertical bands (6-RVB) to analyse it (with left side of the stronger than right one). These steps will be briefly explained in the following paragraphs.
In the first step, let I = {I 1 , I 2 , I 3 , . . . I t } be an image face sequence where the eyes and mouth corners are correctly detected using red and green band colours as the sensibility of cones cells in our eyes. We work on the rectangle that contains the face; next, we define the point corresponding to the center of the face (x c and y c ) and two parameters (width and height) for the size of the face. We use these anthropometric measures in the order to define three regions of interest that probably contain the mouth and eyes (left and right). For each region we find the corresponding coordinates of detected corners in both eyes and mouth and these coordinates are used in the next step.
In the second step, we use the coordinates of eyes and mouth corners to build an antropomorphical grid on the face. With the processing we can obtain twenty-five small regions, we also split the central part in two regions such as can see in the figure 2 obtaining thirty small regions (the step anthropometrical segmentation using black line doted).
Finally in this stage, we use only six symmetrical columns (R 1 , R 2 , R 3 , R 4 , R 5 , R 6 ) of proposed grid to analyse expressions in the face.

Bio-inspired treatment
The facial expressions carry much more information that cannot be extracted from traditional facial expression analysis. So we propose, from the bio-inspired point of view, a methodology to analyse the symmetry of facial expressions. The objective with this is to achieve both a perspective, according to brain mechanisms, of the interpretation of facial motion relevance and a methodology that takes advantage of the tolerance to illumination change of the brain mechanisms. This can be divided into three main processing steps: Figure 2. General architecture of bio-inspired proposed model. After video acquisition, a pre-processing generates an anthropometrical grid. Next, a treatment inspired in V1 determines the active neurons. Then, we compare two symmetrical regions: hears, eyes and nose to extract the vector composed of ratio of quantity active neurons. Finally, the asymmetries detection are evaluated with SOM to cluster into natural or non-natural facial expressions.
• Inspiring in simple neurons of V1 (primary visual cortex) and modelled by Gabor-like filters, we extract the simple active neurons in eight different orientations and two different phases.
• Two simple neurons with different phase are merged to generate a complex neuron.
• A temporal integration between two consecutive complex neurons models MT neurons (Middle temporal area). The neurons with a major response to a threshold are considered as active neurons.
In the first proposed step, considering the processing done by the visual cortex, the first neurons that receive the stimuli from the eyes are the simple cells. Physiological evidence [5,38] affirms that neuronal populations in the primary visual cortex (V1) of mammals exhibit contrast normalization. Neurons that respond strongly to simple visual stimuli, such as sinusoidal gratings, respond less well to the same stimuli when they are presented as part of a more complex stimulus which also excites other, neighbouring neurons. The behaviour of these neurons show a preference to specific orientations, which can be modelled computationally by the oriented Gabor-like filters [4,22]: where S θ (x, y, t) are simple cells which is the result of the convolution between a pool of Gabor functions (G θ (x,ŷ)), in our case with 8 orientation and 2 phases, and the image t of the image sequences, (x,ŷ) are the rotational components and (x, y) a position of the image. The Gabor function G θ (x,ŷ) is defined as: wherex = x cos θ + y sin θ,ŷ = −x sin θ + y cos θ, λ is the length-width, θ represent the orientation, ϕ the phase, σ the standard deviation and γ represent the relation of aspect.
In the second step, the responses of two diffent phases of simple cells S θ (x, y, t) are then integrated in the complex cells by using a non-linear model that allows to merge the responses. Then, the complex cells responses are estimated by where C θ (x, y, t) represent complex cells responses, π 2 and − π 2 are the symmetric and anti-symmetric phases, respectively. Then, the eight orientations are integrated of the complex cells responses C θ (x, y, t) for obtain one matrix of the map active neurons complex C(x, y, t) for the image t.
Finally the third step, the neurons in the human brain are connected to MT neurons allowing a temporal processing that is defined as the temporal integration between C(x, y, t) and C(x, y, t − 1) obtaining the map of active complex neurons by where D(x, y, t) is the result of active complex neurons in the image t, C(x, y, t) and C(x, y, t − 1) are the current image and the previous image, respectively. We use the information D(x, y, t) to compute the number of active neurons.

Extraction of active neurons
In this stage, we realize the extraction of active complex neurons of the active map complex neurons (D(x, y, t)). For this, we use the six vertical bands and the map of active neurons. The 6-RVB overlapping in the map of active neurons. So, we compute the quantity of active neurons (QAN) for each region (R 1 , R 2 , R 3 , R 4 , R 5 , R 6 ) and find the region with the maximum quantity of active neurons (QAN). The QAN is the number of active complex neurons in a region of the image.
So, we obtain six vectors of the quantity of active neurons (QAN): one for the QAN obtained and one for QAN expected for a region. The last one is the total possible QAN in a region. The first one computes only the active present neurons such that the ratio is always between 0 and 1. This information is used to compare two symmetric regions to detect temporal asymmetries.

Asymmetry detection
Next stage, we detect the asymmetries for each region (6-RVB) of the map of active complex neurons. To obtain or detect asymmetries in each image during two seconds using 6-RVB we follow the next steps: • First step, we compute the difference between the symmetric regions. For example, the difference between region R 1 and R 6 , that are the extreme regions.
• Then, compute average and standard deviation for each difference of the regions.
• So, we calculate a threshold (a minimum and a maximum) using the average and standard deviation.
• Then, we detect the asymmetries for each difference of regions in the image sequences, using the "rule": for each difference of regions, we verify that the difference is outside the threshold (i.e., minor to minimum and major to maximum). We count each outsider in the sequence, Out.
• Finally, we compute the Out for each pair of regions in the images sequence.
The detected asymmetries (Out) in the facial expression during the image sequences are weighed for each pair of symmetrical regions (R 1 with R 6 , R 2 with R 5 and R 3 with R 4 ). For each database, we proposed an empirical threshold κ that we apply to all image sequence in the same database. This κ considers various aspects as the sampling frequency (we used two seconds in our experiments), the personal conditions, head motion and luminance-capture conditions.

Clustering in natural/non-natural expressions
For each different database we obtain a κ adapted to number of images in each sequence. So, we used the equation 5 for cluster the facial expressions in the natural/non-natural expressions: where C A C E is the ratio of quantity of active neurons (QAN) in a region, T is the total number of images in a typical sequence in the database to compute κ, F is the capture frequency for the sequences in this database and the κ index. With C κ values we cluster in natural expression and non-natural expression.
These values are sending to SOM 1D that performs 5000 epochs, with a η = 1.000 to η = 0.0001 and they were analysed statistically.

Experimental results
We take six columns of proposed face grid. The symmetrical regions are: R1 with R6 (ears), R2 with R5 (eyes) and R3 with R4 (nose).
The tests for our approach took only the colour sequence of images for the CK+ database while for FG-Net and LTI-HIT we take all sequences of images. The last database was split per question in each interview (sequence of images). Figure 3 shows a sequence of image of CK+ database for a non-natural expression. There are 11 asymmetries in both ears and nose regions, and 9 asymmetries in eyes region, the weighted sum is 10 = 11 * 0.3 + 11 * 0.2 + 9 * 0.5. But this ratio is higher than fixed threshold κ = 7, then this sequence is non natural. We tested all the experiments for two seconds independently of sampling frequency (see table 1). We fix κ according table 1 for test our all available databases. In our experiments this threshold tolerates the generated asymmetries by illumination, environment, and personal conditions. We test the databases Cohn-Kanade (CK+) (only colour images) [20], FG-Net [33], and LTI-HIT as shown in the table 2. We confirm that FG-Net is the most natural database. Our classification error was of 12% in average.

Description of databases
We used three databases to test our proposed approach: FG-Net, CK+, and LTI-HIT.

Cohn-Kanade (CK+) database
In 2000 and 2010, The Institute of Electrical & Electronics Engineers at Grenoble, France, created the CK and CK+, respectively (the same CK database but 107 new sequences of which 33 are in color), consists of 123 persons with 7 different expressions per person: anger, contempt, disgust, fear, happy, sadness and surprise. The image sequences vary in duration (i.e. 10 to 60 frames) and incorporate the onset of a neutral frame at the moment of peak formation of the facial expressions. This database was captured to 30 frames per second with a resolution 640 × 480. The final frame of each image sequence was coded using FACS (Facial Action Coding System) which describes person's expression in terms of action units (AUs) [20]. Participants were instructed by an experimenter to perform a series of 23 facial displays including single action units and specified emotion like expressions of joy, surprise, anger, disgust, fear, and sadness. Each display began and ended in a neutral face.  surprise, disgust, sadness and happiness. They also add the neutral expression for each person. This database consists of 19 persons with 21 sequences for each person, then it contains 399 different sequences. The sequences of images were captured to 25 frames per second with a resolution of 640 × 480 pixels [33]. This database allows to observe people react as natural as possible. As a consequence, it was tried to wake real emotions by playing video clips or still images after a short introductory phase instead of telling the person to play a role. This is in contrast to the asymmetry of luminance of the face, which includes not only asymmetry introduced by lighting, but also asymmetry introduced by the face itself.

LTI-HIT database
Finally, in 2010, the Children's Hospital of Tamaulipas created the LTI-HIT database. This database is taken from interviews that consists of 52 persons (11 men and 41 women) out of them 5 persons were put into one type of environment and 47 in the other. Each interview has between 27 and 33 different questions. The participants were between 18 and 66 years old. Each sequence of images were captured to 30 frames per second with resolution of 720 × 480 pixels and duration of 1.6 to 2.5 minutes per interviews. The interviews were made in natural conditions.
The tests for our approach took only the colour sequence of images for the CK+ database while for FG-Net and LTI-HIT we take all sequences images. The last database was split per question.

Remarks
The three databases are very different in its acquisition, illumination and manipulation of capture conditions. For weight these databases, we propose the κ index according to table 1.
In natural conditions, facial changes on the left side are only about 2% greater than overall right-side changes [30]. We establish a level of 1 to 5 for the personal conditions (hair-style,  Table 2. Natural/non-natural percentages according to κ index. The quantity of active neurons is manipulated into two manipulations: statistical analysis of active neurons (DA) and self-organizing maps (SOM). The percentages correspond to the obtained results with 12% ± 8% error with two expert responses.
The figure 4 shows another example of test for a sequence of images in FGNet database. The vertical dotted lines show the asymmetries (only shown the left side). The horizontal red lines are the ratios of quantity, while the horizontal green lines are the difference between symmetrical regions. Finally, the horizontal blue lines are tolerable threshold obtained from this sequence of images. In this case, the boy is suspected to deceitful clues because all the graphics show a higher value (22, 33 and 28 for each pair of vertical band) and the weighed sum is 29.3 = 22 * 0.2 + 33 * 0.5 + 28 * 0.3 that is higher than κ = 18.0 index.
The table 2 resume the obtained ratios for natural/non-natural facial expressions with statistical analysis and SOM. If we see SOM percentages, these percentages show FG-Net as greater in natural facial expressions than the other two databases, CK+ also has greater percentage followed by LTI-HIT with non-natural facial expressions.
For statistical analysis we have the opposed results. This difference is due to concentration of values and the dispersion. With a simple statistical analysis, all responses are separated based on average and standard deviation, while the SOM application shows the centroids guided by the density and dispersion.

Conclusion
We have shown a bio-inspired approach to processing facial gesture for clustering into natural and non-natural facial expressions. It takes advantage of early primary visual areas in the brains of human beings for feature-extraction. Futhermore, it simulates the integration and discrimination of the superior visual areas considering the symmetrical and asymmetrical measures within the respective time span. This process ends in a bio-inspired neural network for clustering into natural and non-natural facial expressions.
The asymmetry between symmetrical regions of the face analysed during two seconds allows us to classify in: natural or non-natural facial expression. The quantity of asymmetries and its relations with different symmetrical regions is applied as a new biometric measure [24] (the precision of measurement is necessary). But we apply it for classification of natural or non natural expressions (the precision is not necessary).
In our bio-inspired proposed architecture, after face detection, an anthropomorphical grid is proposed. For each region, we obtain the ratio of quantity of active neurons (QAN) obtained from the modelled simple and complex neurons by our Gabor-like filters.
This proposed model can work for any database. Furthermore, the database FG-Net and Cohn-Kanade were tested. More than 63% of CK+ video sequences contain asymmetries. This database was built with simulation expressions (non natural expressions) and we confirm that our results are modelled correctly regarding the capture, illumination and personal conditions.
The different conditions of each database do not allow a simple characterization. We proposed an experimental index to model these conditions (e.g. personal conditions, head motion and luminance-capture), κ. In future, we will determine this index automatically for each database and, more precisely, for each sequence of images into database.
Using the index κ we have shown that our approach is independent of each database. This independence was tested in our experiments.
The preliminary steps of our approach show the feasibility to detect suspected persons in nervous or altered situations. Here is the beginning of our methodology to detect deceitful clues in facial expressions.

Author details
Claudio