Overview of methods used in facial expression recognition
In Human Computer Interaction (HCI) community, the ultimate goal which we are striving to achieve is to create a natural, harmony way of bi-directional communication between machine and human. As we well known, many machine intelligence applications are based on machine vision technology, for instance, industrial detection, face recognition, medical image computer aided diagnose (MICAD), fingerprint recognition, and so on. In terms of artificial intelligence, it is built upon digital image processing and video frame analyzing to extract feature information to recognize some interesting objects. Moreover, a high-level machine learning system should be capable of identifying human emotion states and make interactions accordingly. Multimodal human emotion recognition involving facial expression and motion recognition could be applied to intelligent video surveillance system to provide an early warning mechanism in case of potential unsafe action occurring. There is no doubt that the ever increasing various kinds of crimes in our modern society which make the living environment around us even worse, demand for an intelligent and automatic security precautions measures to offer people a more convenient, relaxed living conditions. Meanwhile, automatic intelligent video-based surveillance system has received a lot of interest in the computer vision and human computer interaction community in recent years. CMU’s Video Surveillance and Monitoring (VSAM) project  and MIT AI Lab’s Forest of Sensors project  are examples of recent research efforts in this field. As a matter of fact, the safeguard has to keep watch on lots of screens in control center in real application. The video displayed on the screens are captured from cameras which are distributed in various security-sensitive areas such as elevator, airports, railway station or public places. However, because of human’s inherent information processing limitation, it is impossible to pick up all of the useful information from the monitors at the same time properly and spontaneously. That means some potentially dangerous information could be missed out. We expect the human computer interaction system is able to take the information-processing burden off the human. An ideal and effective intelligent surveillance system should work automatically without or with minimal human intervention. In addition, we believe that an intelligent surveillance system should be capable of preventing and predict criminal occurring by biometric recognition, rather than identifying the suspect after attacks happened. Figure 1 shows some pictures in which the subjects are under the anxiety or stress emotion state when she interviewed with human resource manager, not criminal nevertheless.
A number of universities and research groups, commercial companies are conducting meaningful projects and have made significant progress in visual surveillance monitoring system. The Safehouse Technology Pty Ltd in Australia developed The Clarity Visual Surveillance System (CVSS) , which provides security professionals with the next generation in visual surveillance management. Although state-of-the-art computer vision technology and artificial intelligence algorithm were used, the system did not integrate affective computing methods in it yet. Picard defined “Affective Com
2. Previous work
Psychological research findings suggest that humans make judgment action depend on the combined visual modalities of face and body more effective than any other channel . Since Paul Ekman and Frisen  divided human emotion into six primary categories: happiness, sadness, disgust, angry, surprise, fear, there has been a great amount of research on facial expression recognition. Eckman revealed that the facial expression showed the inside emotional state and the specific sense at that time, which was hardly controlled by human artificially. In automatic facial expression recognition, feature extraction and classifier design are two major procedures. As shown in Table 1, the most frequently being used feature extraction method is Gabor wavelet feature. A subset of these filters is chosen using AdaBoost, which is transmitted for training SVM as a reduced feature representation. A wide range of surveillance systems have been developed for different applications. G. Lu, X. Li, H. Li proposed a surveillance system for nurses to distinguish the neonatal pain expression from non-pain expression automatically using Gabor wavelet transform and support vector machine. They reported 85.29% recognition rate of pain versus non-pain . The Safe-house Technology Pty Ltd in Australia developed Clarity Visual Surveillance System. It employs state-of-the-art computer vision algorithm and artificial intelligence, whereas it did not involve affective computing technology and emotion recognition in it yet. Bouchrika introduced gait analysis for visual surveillance considering that the identification of the individuals who are suspected of committing crimes is more important than predict when a crime is about to happen . P. Rani, N.Sarkar and J.Adams built an anxiety-recognition system capable of interpreting the information contained in physiological signals processes to predict the probable anxiety state . In  H.Kage et.al introduced pattern recognition technologies for video surveillance and physical security. Their system detects the occurrence of a relevant action via image motion analysis. Optical-flow is used to analyze motion and AdaBoost is been for face detection. In , H.Gunes and M.Piccardi presented an approach to automatic visual recognition of expressive face and upper-body gestures from video sequences which are suitable for human communicative behavior. They defined some rules for facial expression and body motion recognition respectively to make right decision firstly. Min-max analysis method is used to detect the eyebrows, eyes, mouth and chin by evaluating the topographic gray level relief. Body feature is extracted by using traditional image processing methods such as background subtraction and dilation.
In order to improve the accuracy of classifiers and provide fast, pertinent response, many researchers proposed optimized approaches. Typically Genetic Algorithm or Particle Swarm Optimization is often used for feature selection and parameters optimization [19,20,21,22]. In our work, the key problems are facial expression, head motion and hand gesture analysis. We also care more about run-time consumption which impacts the recognition efficiency.
|Feature Extraction Methods||Classifier design||Optimized Feature|
|Gabor wavelet[2,4,5,6,10,16] Optical Flow, PCA FLD, Eigenfaces,|
|Hide Markov Model |
Support Vector Machine[2,6,7,8],
Artificial Neural Network[4,5],
3.1. System framework
A visual automated surveillance system consists of three main phases: face and body motion detection, facial feature and motion feature extraction, classification and recognition, as shown in Figure 2. By means of sensitive facial expression extraction, the subject’s intention and prospective behaviour are analyzed to report any suspicious expression or activities to the control central. The system would be able to reduce potential crimes by recognizing suspicious individuals’ emotion state ahead of security threats happening.
The front end inputs involve three different parts: (1)static facial image database input; (2)static face image database input, and (3) real face image detection captured by cameras using image sequence analysis. All of these three channels have to be passed through the facial expression feature extraction procedure. AdaBoost, Optical Flow and Gabor feature can be used to extract facial expression feature. In the middle of the system framework, both automatic facial expression recognition and human motion recognition are implemented through classifier learning and test, which involve PSO-SVMs feature extraction and parameters optimization. In our system, we use cascaded SVM architecture combined with particle swarm optimization algorithm to train and test SVM to get an optimized classifier which runs faster in computational-consuming conditions.
The integration of Bi-Modality feature fusion classification for affective state recognition is composed by two uni-modality signals, as shown in Fig 3. One is facial expression feature classification and fusion, the other is motion feature classification and fusion. The essential aspect for computer to understand human expression is to achieve head’s rigid motion and face’s non rigid motion by which machine can track and analyze the facial expression change. Among this, the head’s rigid motion can be denoted by six parameters: three rotation R(rx, ry, rz) parameters: head shifting around x, y and z axis, and three translation parameters: downward and upward, left and right, front and backward. Non rigid motion of Face includes movement of mouse, eye, eyebrow and winkle of forehead.
3.2. Cascaded SVM architecture
Support vector machine was developed by Vapnik from the theory of Structural Risk Minimization. However, the classification performance of the practically implemented is often far from the theoretically expected. In order to improve the the classification performance of the real SVM, some researchers attempt to employ ensemble methods, such as conventional Bagging and AdaBoost . However, in [30,31], AdaBoost algorithm are not always expected to improve the performance of SVMs, and even they worsen the performance particularly. This fact is SVM is essentially a stable and strong classifier.
Considering the problem of classifying a set of training vectors belonging to two separate classes,
SVM can be trained by solving the following optimization problem:
where is the i-th slack variable and C is the regularization parameter.
Nonlinear SVM are known to lead to excellent classification accuracies in a wide-range of tasks. It utilizes a set of support vectors to define a boundary between classes, which is dependent on a kernel function. However, as a classifier, SVM is usually slower than neural networks. The reason for this is that the run-time complexity is proportional to the number of support vectors, i.e. to the number of training examples that the SVM algorithm utilizes in the expansion of the decision function .
The optimization problem of Nonlinear SVM with kernel function is denoted by
If is a solution, then
According to Yang and Honvar , the choice of feature used to represent patterns that are presented to a classifier has great impact on several pattern recognition properties, including the accuracy of the learned classification algorithm, the time need for learning a classification function, and the number of examples needed for learning, the cost associated with the features. In addition to feature selection, C.Huang and C. Wang  suggested that proper parameters setting can also improve the SVM classification accuracy. The parameters include penalty parameter C and the kernel function parameter for RBF, which should be optimized before training. D. Iakovidis et.al. proposed a novel intelligent system for the classification of multiclass gene expression data. It is based on a cascading support vector machines scheme and utilize Welch’s t-test for the detection of differentially expressed genes. In their system, a 5-block cascading SVMs architecture was used for the 6-class classification problem as shown in Figure 4 . The classification performance was evaluated by adopting a Leave-One-Out (LOO) cross validation approach. LOO is commonly used when the available dataset is small providing an almost unbiased estimate of the generalization ability of a classifier .
In order to achieve optimal feature subset selection and SVM-RBF parameters, Hsu and Lin  proposed a Grid algorithm to find the best C and sigma for RBF kernel. However their method has expensive computational complexity and does not perform well. Genetic algorithm is an another alternative tool, which has the potential to generate both the optimal feature subset and SVM parameters at the same time. Huang and Wang  conducted some experiments on UCI database using GA-based approach. Their result has better accuracy performance with fewer features than grid algorithm. Compared to Genetic Algorithm, Particle Swarm Optimization has no evolution operators such as crossover and mutation. There are few parameters to adjust. It works well in a wide variety of applications with slight variations .
In this paper, we proposed a cascaded SVM structure, as shown in Fig 6 to speed up body gesture classifier performance over conventional SVM-based methods without reducing detection rate too much and the hierarchical architecture of the detector also reduces the complexity of training of the nonlinear SVM classifier.
3.3. Standard particle swarm optimization algorithm
Particle Swarm Optimization was introduced firstly by James Kennedy and Russel Eberhart . It is a population-based evolutionary computation search technique. In PSO, each potential solution is assigned a randomized velocity, and the potential solutions, called particles, fly through the problem space by following the current optimum particle. Each particle keeps track of its coordinates in hyperspace which are associated with the best solution (fitness) it has achieved so far. This value is called pbest. Another “best” value is also tracked. The “global” version of the PSO keeps track of the overall best value, and its location, which is called gbest . Each particle is treated as a point in a D-dimension space. The original PSO algorithm is described below:
The ith particle’s location vector is represented as; the velocity is denoted by. and are pbest and gbest respectively. and are two random numbers in the range [0,1].
Equation (8) describes the flying trajectory of a population of particles, how the velocity and the location of a particle is dynamically updated. Equation (8.1) consists of three parts. The first part is the momentum part. The velocity is changed by current value. The second part is the cognitive part which represents the particle’s learning capability from its own experience. The third part is the social part which represents the collaboration among all particles .
In order to improve the convergence performance of PSO algorithm, Shi and Eberhart proposed a modified particle swarm optimizer. An inertial weight w is brought into the original PSO algorithm. This w plays the role of balancing the global search and local search. It can be a positive constant or even a positive linear or nonlinear function of time .
Simulations have been conducted on this modified PSO to illustrate the impact of this parameter introduced. It was concluded that the PSO with the inertial weight in the range [0.9,1.2] on average will have a better performance. A time decreasing inertial weight can also bring in a significant improvement on the PSO performance, as shown in Equation (10):
In this paper, PSO is utilized to search the optimal solution of a RBFN-SVM by minimizing the cost function Φ as the fitness function. Each particle is encoded as a real string representing the kernel centers (z) and widths (σ) as well as linear model coefficients w and b. With the movement of the particles in the solution space, the optimal solution with a minimum value of the cost function Φ will be obtained. Optimizing the kernel centers and widths and the weights of the SVM model relating the feature variables synergistically keeps the model from getting trapped into a local optima and improves the model performance . There are two key factors to determine the optimized hyperparameters when using PSO. One is how to represent the hyper-parameter as the particle’s position, namely how to encode. The other problem is how to define the fitness value function which evaluate the goodness of a particle.
In this section, we describe the proposed SVM system for classification task. The aim of this system is to optimize the SVM classifier accuracy by detecting the subset of the best discriminative feature and solving the SVM model selection.
3.3.1. PSO setup
The position of each particle from the swarm is regarded as a vector encoding.
Feature subset f selection: f is a candidate subset of features, and F is a features set which consists of d available input features.
KBNF based SVM parameters optimization, includes C and.
The position vector of each particle can be represented as the other kind of form:
Let be the fitness function of the ith particle. As suggested by Melgani, the choice of the fitness function is important on which PSO evaluates each candidate solution based for designing SVM classifier. They explored a simple SV count as a fitness criterion in the PSO optimization framework.
3.3.2. SVM training and classification with PSO
The pseudo code of the proposed method for SVM-PSO classification is given below.
Many researchers have proposed cascaded SVMs architecture to solve the computational complexity problem. Among the cascaded structure, there are mainly two kinds of SVMs architecture. One is parallel SVMs, and the other is serial SVMs. The parallel cascaded structure of SVM is more relied on hardware architecture of computer, which is developed based on decomposing the original complex problem into a number of independent simplified smaller problems, and in the end the partial results are combined into an ultimate output . Compared to parallel SVMs, the serial structure cascaded SVMs is more feasible and easily to be implemented. Y. Ma and X. Ding proposed a cost-sensitive SVM classifier to detect face . In many classification cases, the cost of False Negative is far more than False Positive. In their cost-sensitive SVMs, different cost are assigned to two types of misclassification to train the cascaded SVMs in different stage of the face detector. The optimization goal is given by
where is the cost for face samples, and is for non-face samples, usually.
We propose here a novel approach named PSO-Cascaded SVMs classification method to combine Cascaded SVMs with particle swarm optimization algorithm. As shown in Figure 6, there are some linear SVMs cascaded to form a serial structure at the front end and a nonlinear SVM at the end of this system.
We think of two optional strategies to introduce particle swarm optimization into the cascaded architecture of serial SVMs: one is integrate PSO into the serial linear SVMs and the other is integrate PSO into the end nonlinear SVM only. Regarding that the serial SVMs structure at the front end is more easily to construct and implement than nonlinear SVM, we take the later strategy. The nonlinear SVM classification is more complicated because it has to deal with feature subset selection and kernel function parameters optimization.
In our experiment, only lip, eye and forehead changes are detected and analyzed. There are 4 states for lip, 3 states for eye and 3 states for forehead individually. We use 10-dimension feature vector to express 4*3*3 (36) kinds of states altogether, which means only 36 facial expressions are considered in our experiment, as shown in Table 3.
|Facial Regions||State Code||State Description|
|Positive Class||Negative Class|
|I.4||Bend up at the corner of the mouth||Bend up at the corner of the mouth|
|eye||II.1||Eyelid Open||Eyelid Mouth|
|II.2||Eyeball turn left||Eyeball turn left|
|II.3||Eyeball oversee||Eyeball look up|
|forehead||III.1||Eyebrow raiser||Eyebrow lower|
|III.2||Eyebrow tighter||Eyebrow stretcher|
The overall accuracy rate for facial expression recognition with our proposed method can achieve 83.5% under uniform illumination. The performance of PSO-SVM classifier learning and testing for features selecting and parameters optimization outweighed other methods like PCA-SVM, PCA-RBF, whose overall accuracy rate achieved 91.46%, whereas PCA-SVM was 85.54% and PCA-RBF 83.27%,respectively.
4. Conclusion and future work
In this paper we have proposed a fusion method for facial expression and gesture recognition to build a surveillance system. Different from traditional six basic emotions on which many researchers have worked, we care only about the anxiety emotion. However, facial expression information is not enough for computer to recognize, we syncretize body gesture recognition to obtain a combined approach. Among many kinds of classifiers, SVM shows good performance for small samples classification. Meanwhile SVM is computational time expensive tool, we have to deal with some optimization such as feature subset selection before training and classifying to improve its performance. Especially for nonlinear SVM, the selection of kernel function and parameters optimization is more important. The cascaded SVMs structure show up superior performance in pattern classification.
Further more, as a simple stochastic global optimization technique inspired by social behavior of bird flocking, PSO can be used into the cascaded SVMs to select feature subset and optimize parameter for kernel function. The performance PSO-SVM strategies have been proposed in this paper for SVM to enhance its learning and classifying capability.
This work is sponsored by the Science and Technology Foundations of Chongqing Municipal Education Commission under Grant No. KJ091216, and Excellent Science and Technology Program for Overseas Studying Talents of Chongqing Municipal Human Resources and Social Security Bureau under Grant No. 09958023, and also by Key project of Science and Research Foundation of Chongqing University of Arts and Sciences under Grant No. Z2009JS07.