Face detection rates for the Motor Trend Magazine's Best Driver Car of the Year.
Observation analysis of vehicle operators has the potential to address the growing trend of motor vehicle accidents. Methods are needed to automatically detect heavy cognitive load and distraction to warn drivers in poor psychophysiological state. Existing methods to monitor a driver have included prediction from steering behavior, smart phone warning systems, gaze detection, and electroencephalogram. We build upon these approaches by detecting cues that indicate inattention and stress from video. The system is tested and developed on data from Motor Trend Magazine's Best Driver Car of the Year 2014 and 2015. It was found that face detection and facial feature encoding posed the most difficult challenges to automatic facial emotion recognition in practice. The chapter focuses on two important parts of the facial emotion recognition pipeline: (1) face detection and (2) facial appearance features. We propose a face detector that unifies state‐of‐the‐art approaches and provides quality control for face detection results, called reference‐based face detection. We also propose a novel method for facial feature extraction that compactly encodes the spatiotemporal behavior of the face and removes background texture, called local anisotropic‐inhibited binary patterns in three orthogonal planes. Real‐world results show promise for the automatic observation of driver inattention and stress.
- facial emotion recognition
- local appearance features
- face detection
In this chapter, we focus on the development of a system to track cognitive distraction and stress from facial expressions. The ultimate goal of our work is to create an early warning system to alert a driver when he/she is stressed or inattentive. This advanced facial emotion recognition technology has the potential to evolve into a human automotive interface that grants nonverbal understanding to smart cars. Motor Trend Magazine's The Enthusiast Network has collected data of a driver operating a motor vehicle on the Mazda Speedway race track for the Best Driver Car of the Year 2014 and 2015 . A GoPro camera was mounted on the windshield facing the driver so that gestures and expressions could be captured naturalistically during operation of the vehicle. Attention and valence were annotated by experts according to the Fontaine/PAD model . The initial goal of both tests was to detect the stress and attention of the driver as metrics for ranking cars, automatically with computer algorithms. However, affective analysis of a driver is a great challenge due to a myriad of intrinsic and extrinsic imaging conditions, extreme gaze, pose, and occlusion from gestures. In 2014, two institutions were invited to apply automatic algorithms to the task but failed. It proved too difficult to detect face region of interest (ROI) with standard algorithms  and it was difficult to find a facial feature‐encoding scheme that gave satisfactory results. Quantification of emotion was instead carried out manually by a human expert due to these problems. In this chapter, we discuss groundbreaking findings from analysis of the Motor Trend data and share promising, novel methods for overcoming the technical challenges posed by the data.
According to the U.S. Centers for Disease Control (CDC), motor vehicle accidents (MVA) are a leading cause of injury and death in the U.S. Prevention strategies are being implemented to prevent deaths, injuries, and save medical costs. Despite this, the U.S. Department of Transportation reported that MVA increased in 2012 after 6 years of consecutive years of declining fatalities. Video‐based technologies to monitor the emotion and attention of automobile drivers have the potential to curb this growing trend. Existing methods to prevent MVA include smart phone collision detection from video , intelligent cruise control systems , and gaze detection . The missing link in all these prevention strategies is the holistic monitoring of the driver from video—the key participant in MVA, and the detection of cues indicating inattention and stress. The introduction of intelligent transportation systems and automotive augmented reality will exacerbate the growing problem of MVA. While one would expect autonomous/self‐driving cars to decrease MVA from inattention, intelligent transportation systems will return control of the vehicle to the driver in emergency situations. This handoff can only occur safely if the vehicle operator is sufficiently attentive, though his/her attention may be elsewhere from complacency due to the auto piloting system. Augmented reality systems seek to enhance the driving experience with heads‐up displays and/or head‐mounted displays that can distract the vehicle operator . In short, driver inattention will continue to be a significant issue with cars into the future.
2. Related work
The field of affect analysis dates back to 1872 when Charles Darwin studied the relationship between apparent expression and underlying emotional state in the book, “The Expression of the Emotions in Man and Animals .” Communication between humans is a complex process beyond the delivery of semantic understanding. During conversation, we communicate nonverbally with gestures, pose, and expressions. One of the first works in automatic affect analysis by computers dates to 1975 . Since this seminal work, emotion recognition has found many applications in medicine [10–12], observation analysis (marketing) , and deception detection [14–16].
Systems to monitor the emotion and attention of vehicle operators date as far back to a 1962 patent that used steering wheel corrections as a predictor of attention and mental state . Currently, there is much interest in the observation analysis of driver cognitive load, attention, and/or stress from video or biometric signals. While gaze has become a popular method for measuring attention of a driver, there is no consensus on how gaze should be monitored. Wang et al.  found that a driver's horizontal gaze dispersion was the most significant indicator of concentration under heavy cognitive load. Mert et al.  studied gaze during the handoff between manual vehicle control and autonomous piloting systems. It was found that if a driver was out of the loop it took more time to recover control of the vehicle, increasing the risk of MVA. However, a drawback to both of these methods is that it may not be possible to obtain an accurate measurement of driver gaze from video. A collaboration between AUDI AG, Volkswagen, and UC San Diego developed a video‐based system for the detection of attention [20, 21]. This system focused on extracting head position and rotation using an array of cameras. We build upon state‐of‐the‐art with an improved system that detects attention from only a single front‐facing camera. In the following, we discuss the two most significant challenges to the system: face detection and facial feature encoding.
2.1. Related work in face detection
Detection of ROI is the first step of pattern recognition. In face detection, a rectangular bounding box must be computed that contains the face of an individual in the video frame. Despite significant advances to the state‐of‐the‐art, detection of face in unconstrained facial emotion recognition scenarios is a challenging task. Occlusion, pose, and facial dynamics reduce the effectiveness of face ROI detectors. Imprecise face detection causes spurious, unrepresentative features during classification. This is a major challenge to practical applications of facial expression analysis. In Motor Trend Magazine’s Best Driver Car of the Year 2014 and 2015, emotion was a metric for rating cars. In 2014, two institutions were invited to apply automatic algorithms to the task but all algorithms failed to sufficiently detect face ROI. Quantification of emotion was carried out manually by a human expert due to this problem .
Over the past 5 years, face detection has been carried out with the Viola and Jones algorithm (VJ) [10, 23–27]. Since the release of VJ, there have been numerous advances to face detection. Dollár et al.  proposed a nonrigid transformation of a model representing the face that is iteratively refined using different regressors at each iteration. Sanchez‐Lozano et al.  proposed a novel discriminative parameterized appearance model (PAM) with an efficient regression algorithm. In discriminative PAMs, a machine‐ learning algorithm detects a face by fitting a model representing the object. Cootes et al.  proposed fitting a PAM using random forest regression voting. De Torre and Nguyen  proposed a novel generative PAM with a kernel‐based PCA. A generative PAM models parameters such as pose and expression, whereas a discriminative PAM computes the model directly.
While the field of pattern recognition has historically been about features, ROI extraction is arguably the most important part of the entire pipeline. The adage, “garbage‐in garbage‐out” applies. In the AV+EC 2015 grand challenge, the Viola and Jones face detector  has a 6.5% detection rate and Google Picasa has a 0.07% detection rate. How does one infer the missing 93.95% of face ROIs? Among the “successfully” extracted faces, what is their quality? If one were to fill in the missing values with poor ROIs the extracted features would be erroneous and lead to a poor decision model. To address this, we propose a system that unifies current approaches and provides quality control of extraction results, called
2.2. Related work in facial appearance features
Local binary patterns (LBP) are one of the most commonly used facial appearance features. They were originally proposed by Ojala et al.  as static feature descriptors that capture texture features within a single frame. LBP encode microtextures by comparing the current pixel to neighboring pixels. Differences are recorded at the bit level, e.g., if the top pixel is greater than the middle pixel a specific bit is set. Identical microtextures will take on the same integer value. There have been many improvements and variations of LBP over the years as the problems within computer vision became more complex. Independent frame‐by‐frame analysis is no longer sufficient for analysis of continuous videos.
A variation of LBP that was developed to address the need of a dynamic texture descriptor was volume local binary patterns (VLBP) . VLBP are an extension of LBP into the spatiotemporal domain. VLBP capture dynamic texture by using three parallel frames centered on the current pixel. The need for a dynamic texture descriptor with a lower dimensionality than VLBP inspired the development of local binary patterns in three orthogonal planes (LBP‐TOP) . The dimensionality of LBP‐TOP is significantly less than VLBP and is computationally less costly than VLBP.
LBP were not always the most popular local appearance feature. Some of the first, most significant works in facial expression analysis by computers used Gabor filters . Gabor filters have historical significance, and they continue to be used in many approaches . Nascent convolutional neural network approaches eventually learn structures similar to a Gabor filter . The Gabor filters are bioinspired and were developed to mimic the V1 cortex of the human visual system. The V1 cortex responds to the gradient images of different orientation and magnitude. It is essentially an appearance‐based feature descriptor that captures all edge information within an image. However, state‐of‐the‐art feature descriptors are known for their compactness and ability to generalize over external and intrinsic factors. The original Gabor filter does not have the ability to generalize in unconstrained settings because it captures all edges within an image, noise included. Furthermore, the Gabor filter is not computationally efficient. The filter produces a response for each filter within its bank. The Gabor filter has been developed into the anisotropic inhibited Gabor filter (AIGF) to model the human visual system's nonclassical receptive field . AIGF generalizes better than the original Gabor filter because of its ability to suppress background noise. A combined Gabor filter with LBP‐TOP has been shown to improve accuracy in the classification of facial expressions .
A thorough search of literature found no work, which has combined the anisotropic‐inhibited Gabor filter and LBP‐TOP and this is one of the foci of this chapter. This novel method that compactly encodes the spatiotemporal behavior of a face also removes background texture. It is called
3. Technical approach
Automatic facial emotion recognition by computers has four steps: (1) region‐of‐interest (ROI) extraction, also known as face detection, (2) registration, colloquially known as alignment, (3) feature extraction, and (4) classification/regression of emotion. This chapter will focus on two important parts of the facial emotion recognition pipeline: face region‐of‐interest extraction and facial appearance features.
3.1. Reference‐based face detection
Reference‐based face detection consists of two phases: (1) In the training phase, a reference face is computed with avatar reference image. This face represents a well‐extracted face and quantifies the quality of detection results in the next step. (2) In testing, multiple candidate face ROIs are detected, and the candidate ROI that best matches the reference face in the least squared sense is selected for further processing. Three different methodologies for finding the face ROI are combined: a boosted cascade of Haar‐like features (Viola and Jones (VJ) , a discriminative parameterized appearance model (SIFT landmark points matched with iterative least squares), and a parts‐based deformable model. VJ was selected because of its ubiquitous use in the field of face analysis. Discriminative parameterized appearance models were recently deployed in commercial software . Parts‐based deformable models showed promise for face ROI extraction in the wild . Despite the success of currently used methods, there is still much room for improvement. In the Motor Trend data, there are segments of video where one extractor will succeed when others fail. Therefore, better performance can be achieved by unifying these three methods to generate multiple candidate face ROIs and quantitatively determine which candidate is the best ROI. Note that Refs. [38, 39] use VJ for an initial bounding box so running more than one face detector is not excessive for state‐of‐the‐art approaches.
3.1.1. Reference‐based face detection in training
The avatar reference image concept generates a reference image of an expressionless face. It was previously used for registration  and learning . A proof of optimality of the avatar image concept is given in the previous work . Let be an image in the training data . To estimate the avatar reference image , take the mean across all face images:
where is the number of training images; is a pixel location; and is the ‐th image in the dataset . The process iterates by rewarping to to create a more refined estimate of the reference face. The procedure is described as follows: (1) compute reference using Eq. (1) from all training ROIs , (2) warp all to the reference, and (3) recompute Eq. (1) using the warped images from the previous step. Steps (2) and (3) are iterated for three times which was empirically selected in Ref. . Results of the reference face at different iterations are shown in Figure 1. SIFT‐Flow warps the images in step (2) and the reader is referred to  for a full description of SIFT‐Flow. In short, a dense, per‐pixel SIFT feature warp is computed with loopy belief propagation. After this point, a represents a well‐extracted reference face.
3.1.2. Reference‐based face detection in testing
To robustly detect a face, three different pipelines simultaneously extract the ROI. We fuse a discriminative parameterized appearance model, a part‐based deformable model, and the Viola and Jones framework. In Viola and Jones (VJ), detection of the face is carried out with a boosted cascade of Haar‐like features. Because of the near‐standard use of VJ, we omit an in‐depth explanation of the method. The reader is referred to  for the details of the algorithm.
126.96.36.199. Discriminative parameterized appearance model
Consider a sparse appearance model of the face. The face detection problem can be framed as an optimization problem that fits the landmark points representing the face. A face is successfully detected when the gradient descent in the fitness space of the optimization problem is complete. Traversing the fitness space can be viewed as a supervised learning problem , rather than carrying out a gradient descent with Gauss‐Newton algorithm . In the training phase the following equation is minimized:
188.8.131.52. Parts‐based deformable models
Parts‐based deformable models represent a face as a collection of landmark points similar to PAMs. The difference is that the most likely locations of the parts are calculated with a probabilistic framework. The landmark points are represented as a mixture of trees of landmark points on the face . Let be the set of landmark points on the face. A facial configuration is modeled as . Alignment of the landmark points is achieved by maximizing the posterior likelihood of appearance and shape. The objective function is formulated as follows:
where is the objective function to be minimized; is the video frame; is the mixture index; is the landmark point indexes; is the template of mixture at point ;
Inference is carried out by maximizing the following:
which enumerates over all mixtures and configurations. The maximum likelihood of the model which best fits the parameters is computed with the Chow‐Liu algorithm .
184.108.40.206. Least square selection
We compare the results of all three pipelines to check if a face has been properly detected. The problem is posed where we must quantify the accuracy of each extraction pipeline. We minimize the candidate face ROI to the reference of a face in the least squared sense:
where is a candidate face ROI from one of the face extraction pipelines . It is possible that Eq. (7) failed to generate a candidate face. There are two causes for this: (A) there are no candidate face ROIs generated, or (B) the selected face is a false alarm, e.g., it is not a face, or the bounding box is poorly centered. To prevent (B), the face selected in Eq. (7) must have a distance to the reference of no greater than parameter , which is empirically selected in training. If the detector fails because of (A) or the threshold is less than , the last extracted face should be used for processing further in the recognition pipeline. Note when comparing this proposed method to other detectors in Table 1 we count (A) and (B) as a failure of the method.
|%||Viola and Jones (VJ)||Constrained local models (CLM)||Supervised descent method (SDM)||Proposed face detector|
|True positive rate|
3.2. Local anisotropic inhibited binary patterns in three orthogonal planes
3.2.1. Gabor filter
A Gabor filter is a bandpass filter that is used for edge detection at a specific orientation and scale. Images are typically filtered by many Gabor filters at different parameters, called a bank. It is modulated by a sine and a cosine. When it is modulated by a sine, the Gabor filter finds symmetric edges. When it is modulated by a cosine, the Gabor filter finds antisymmetric edges. According to Grigorescu et al. , a Gabor filter at a specific orientation and magnitude is:
where is the spatial aspect ratio that effects the eccentricity of the filter; is the angle parameter that tunes the orientation; and is the wavelength parameter that tunes the filter to a specific spatial frequency, or magnitude. In pattern recognition this is also referred to a scale. is the variance of the distribution. It determines the size of the filter. is the phase offset that is taken at 0 and .and are defined as follows:
The Gabor filter can be used as local appearance filter by tuning the filter to a local neighborhood while still varying the orientation: and varying . For the rest of the chapter, represents with , and , and with varying and . Given an image , the Gabor energy filter is given by:
which corresponds to the magnitude of filtering the image at the phase values of and .
3.2.2. Anisotropic‐inhibited Gabor filter
The original formulation of the Gabor energy filter does not generalize well. The Gabor energy filter captures all edges and magnitudes within the image, including the edges due to noisy background texture. For example, MPEG block encoding artifacts that present as a grid‐like repeating pattern. In the field of facial expression recognition, face morphology causes creases along the face that are not a part of the background texture thus a better contour map can be extracted by removing the background texture of the face. In order to eliminate the background texture detected by the Gabor filter, we build upon the Anisotropic Gabor energy filter. To suppress the background texture, we take a weighted Gabor filter:
where the weighted function is:
where , where is the Heaviside step function; is the difference of Gaussians:
resembles a ring. Eq. (12) retrieves the background texture of without the texture of itself by weighting by the ring‐like filter . The resulting anisotropic‐inhibited Gabor filter is described as follows:
where is a parameter that affects how much of the background texture is removed. ranges from 0 to 1, where 0 indicates no background texture removal and 1 indicates complete background texture removal. The first term of Eq. (15) defines the original Gabor energy filter that captures all edges including background edges. The second term subtracts the weighted Gabor filter with a specified alpha, depending on how much background suppression is needed. We follow  where a value of was empirically selected.
To obtain an image that contains only the strongest edges and corresponding orientations, we take the edges with the strongest magnitude across different orientations:
The resulting output of Anisotropic Inhibited Gabor Filter is an image that is . Results are given in Figure 2.
We build upon the work in Ref. , but the proposed approach is significantly different. The anisotropic Gabor energy filter (AIGF) further computes the orientations corresponding to the maximum edges as follows:
A soft histogram is computed from with votes weighted by the maximal edge response . For the proposed approach, we use and do not compute a soft histogram.
3.2.3. Local binary patterns
Local binary patterns (LBP) encode local appearance as a microtexture code. The code is a function of comparison to the intensity values of neighboring pixels. Some formulations are invariant to rotation and monotonic grayscale transformations . At present LBP and its many variations are one of the most widely used feature descriptors for facial expression recognition. LBP result in a texture descriptor with dimensionality of where is a parameter that controls the number of pixel neighbours. The LBP code of a pixel at is given as follows:
where iterates over points in the neighborhood of ; is the sign of the expression; is a counter starting from 0 that increments on each iteration; and is the neighborhood of points about (see Figure 3A). encodes the result of the intensity difference in a specific bit. A histogram is taken for further compactness and tolerance of registration errors. Each pixel in is encoded with an LBP code from Eq. (18) then an ‐level histogram is extracted from . Typically, the image is segmented into nonoverlapping regions and a histogram is extracted from each region . While powerful and effective for static images, LBP lacks the ability to capture temporal changes in continuous video data.
3.2.4. Volumetric local binary patterns
Volume local binary patterns (VLBP) and local binary patterns in three orthogonal planes (LBP‐TOP) are variations of LBP that were developed to capture dynamic textures for video data. In VLBP, the circle of neighboring points in LBP is scaled up to a cylinder. VLBP computes code values as a function of three parallel planes centered at . That is, the middle plane contains the center pixel. VLBP coding is obtained by the following equation:
where iterates over three time points: , , and . is the set of spatiotemporal neighbours of (see Figure 3B). A large set of results in a large feature vector while a small results in a small feature vector. As with LBP, a histogram is taken for further compactness. The maximum grey‐level from Eq. (19) is , thus VLBP are more computationally expensive to calculate and require larger feature vector.
3.2.5. Local binary patterns in three orthogonal planes
LBP‐TOP was developed as an alternative to VLBP. VLBP and LBP‐TOP differ in two ways. First, LBP‐TOP uses three orthogonal planes that intersect at the center pixel. Second, VLBP considers the cooccurrences of all neighboring points from three parallel frames, which make for a larger feature vector. LBP‐TOP only considers features from each separate plane and then concatenates them together, making the feature vector much shorter when compared to VLBP for large values of . LBP‐TOP performs LBP on the three orthogonal planes corresponding to the
3.2.6. Local anisotropic inhibited Gabor patterns in three orthogonal planes
In the proposed method, the computational efficiency of LBP‐TOP is applied to images filtered with the anisotropic‐inhibited Gabor filter. The suppression of background texture provides an image that only contains the edges separate from the background texture. These edges are the significant boundaries of facial features that are useful when determining expression and emotion. Local anisotropic binary patterns’ (LAIBP) code values are computed as follows:
where is the maximal edge magnitude from Eq. (16). LAIBP‐TOP features are extracted in a similar fashion to LBP‐TOP: Compute codes from Eq. (20) in
4. Experimental results
Data in this work have been provided by Motor Trend Magazine from their Best Driver Car of the Year 2014 and 2015. They consist of frontal face video of a test driver as he drives one of 10 automobiles around a racetrack. Parts of the video will be released publicly on YouTube at a later date. The videos are 1080p HD quality captured with a Go Pro Hero 4 and range from 231 to 720 seconds in length. The camera is mounted on the windshield of the car facing the driver's face. The dataset was labeled with the Fontaine emotional model  rather than facial action units or emotional categories to quantize emotion. Emotions such as happiness, sadness, etc. occupy a space in a two‐dimensional Euclidean space defined by valence and arousal. The objective of the dataset is to detect the valence and arousal of an individual on a per‐frame basis. Valence, also known as evaluation‐pleasantness, describes positivity or negativity of the person's feelings or feelings of situation, e.g., happiness versus sadness. Arousal, also known as activation‐arousal, describes a person's interest in the situation, e.g., eagerness versus anxiety.
For face detection results, we use true positive rate and score. score is given by:
For both metrics, higher is better. For full recognition results, we use root mean squared (RMS) error and correlation. The correlation coefficient is given by:
where is the expected operation;
4.3. Results comparing different face detectors
Face detection results are given in Table 1. In general, VJ is the worst performer with the highest variance. Though CLM and SDM have acceptable detection rates, they too have a high variance and some videos are a total failure with no face extraction. The proposed algorithm improves detection rates on both datasets and reduces variance.
4.4. Results comparing different facial appearance features
For the full recognition pipeline: The landmarks for the inner corner of the eyes and the tip of the nose are used as control points for a course registration. These points are the least effected by face morphology. An ‐SVR is used for prediction of valence and arousal values .
Full regression results and a comparison to other state‐of‐the‐art facial appearance features are given in Table 2. Experiments employed a 9‐fold, leave‐one‐video‐out cross‐validation. For correlation, higher is better; for RMS lower is better. In Table 2, the correlation and RMS values for valence and arousal labels by the proposed method performed the best for valence and second best for arousal. Removal of background noise and then implementing LBP‐TOP provided better results. RMS values for the proposed method are also the best for arousal and second best for valence. The proposed method has the best average correlation and the lowest average RMS value. Graphs comparing the ground‐truth and predicted labels are given in Figure 5. It was found that frames with extreme head rotation tended to have lower correlation and higher error due to the difficulty of registering the dataset.
In this chapter, we proposed a system to perform facial expression recognition on a brand new dataset. This dataset is unconstrained and unique. We proposed a new feature vector that is robust to background noise and capable of capturing dynamic textures. We also proposed a novel method for fusing the output of many face detectors. Both approaches provided better results than other state‐of‐the‐art methods. In the future work, the face detection scheme will be scaled up to a 3D model to better detect the extreme out of plane head rotations.