1. Introduction
The technology of biometric recognition systems for personal identification commonly manipulate the input data acquired from irises, voiceprints, fingerprints, signatures, human faces, and so on. The recognition of irises, voiceprints, fingerprints, and signatures belongs to the passive methods that require the camera with a high resolution or capture people‘s biometric information at a short range. These methods are not suitable for our person following robot to be developed, because they cannot provide convenience for users. Face recognition belongs to one of the active methods that users need keep away from a camera at a certain distance only. In this chapter, face recognition is regarded as a kind of human computer interfaces (HCIs) that are applied to the interaction between humans and robots. Thus, we attempt to develop an automatic real-time multiple faces recognition and tracking system equipped on the robot that can detect human faces and confirm a target in an image sequence captured from a PTZ camera, and keep tracking the target that has been identified as a master or stranger, then employ a laser range finder to measure a proper distance between target’s owner and the robot.
Generally, three main procedures: face detection, face recognition, and face tracking are implemented on such a system. In literature, a large number of face localization techniques had been proposed. According to the literature (Hjelmås & Low, 2001), the methods for face detection can be basically grouped into feature-based and image-based approaches. The development of the feature-based approaches can be further divided into three areas: active shape models, feature analysis, and low-level analysis for edges, grey-levels, skin colours, motions, and so forth. On the other hand, the image-based approaches can be categorized into linear subspace methods, neural networks, and statistical approaches.
The development of face recognition is more and more advanced in the past twenty years (Zhao et al., 2003). Each system has its own solution. At present, all researches about face recognition take the features that are robust enough to represent different human faces. A novel classification method, called the nearest feature line (NFL), for face recognition was proposed (Li et al., 1999). The derived FL can capture more variations of face images than the original feature points do, and it thus expands the capacity of an available face image database. However, if there are a lot of face images in the database, the recognition accuracy will reduce. Two simple but general strategies for a common face image database are compared and developed two new algorithms (Brunelli & Poggio, 1993); the first one is based on the computation of a set of geometrical features, such as nose width and length, mouth position, and chin shape, and the second one is based on almost-grey-level template matching. Nevertheless, under the different lighting conditions, the characteristic of geometry will change. In the literature (Er et al., 2002), the researchers combined the techniques of principal component analysis (PCA), Fisher’s linear discriminant, and radial basis function to increase the correct rate of face recognition. However, their approach is unable to carry out in real time.
On the whole, the methods of tracking objects can be categorized into three ways: match tracking, predictive tracking, and energy functions (Fan et al., 2002). Match tracking has to detect moving objects in the entire image. Although the accuracy of match tracking is considerable, it is very time-consuming and we can not improve the performance effectively. Energy functions are often adopted in a snake model. They provide a good help for contour tracking. In general, there are two methods of predictive tracking: the Kalman filter and a particle filter. The Kalman filter has great effects when the objects move in linear paths, but it’s not appropriate for the non-linear and non-Gaussian movements of objects. On the contrary, the particle filter performs well for non-linear and non-Gaussian problems. A research team completed a face tracking system which exploits a simple linear Kalman filter (An et al., 2005). They used a small number of critical rectangle features selected and trained by an AdaBoost learning algorithm, and then detected the initial position, size, and incline angle of a human face correctly. Once a human face is reliably detected, they extract the colour distributions of the face and the upper body from the detected facial regions and the upper body regions for creating respective robust colour modelling by virtue of
In this chapter, an automatic real-time multiple faces recognition and tracking system installed on a person following robot is presented, which is inclusive of face detection, face recognition, and face tracking procedures. To identify human faces quickly and accurately, an AdaBoost algorithm is used for training a strong classifier for face detection (Viola & Jones, 2001). As to face recognition, we modify the discriminative common vectors (DCVs) algorithm (Cevikalp & Wilkes, 2004; Cevikalp et al., 2005; Gulmezoglu et al., 2001) and employ the minimum Euclidean distance to measure the similarity of a detected face image and a candidate one. To raise the confidence level, the most likely person is determined by the majority voting of ten successive recognition results from a face image sequence. In the sequel, the results of recognition will be assigned into two classes: “master” and “stranger.” In our system, the robot will track the master unceasingly except that the power is turned off. In the face tracking procedure, we propose an improved particle filter to dynamically locate multiple faces. According to the position of the target in an image, we issue a series of commands (moving forward, turning left or turning right) to drive the motors of wheels on the robot, and evaluate the distance between a master and the robot by means of a laser range finder to issue a set of commands (stop or turn backward) until the robot follows to a suitable distance in front of the master.
2. Hardware description
The following describes the hardware system of our experimental robot whose frame size is 40 cm long, 40 cm wide, and 130 cm high as Fig. 1 shows. The robot has three wheels; the motors of two front wheels driving the robot to move forward/backward and turn left/right, whereas one rear wheel without dynamic power is used for supporting the robot only. By controlling the turning directions of the two front wheels, we can change the moving direction of the robot. In order to help the robot to balance, we settle one ball caster in the rear.
Fig. 2 illustrates the hardware architecture of our experimental robot. We can receive the distance data of the laser range finder from the Industrial PC via RS-232. In the camera system, the images are acquired through the USB 2.0 video grabber. After processing the captured images and receiving the distance data from the laser range finder, the program will send commands through RS-232 to the motor system. The H8/3694F microprocessor can link the micro serial servo controller, and then the left and right motors of the front wheels are driven.
3. Face detection
The face detection is a crucial step of the human face recognition and tracking system. For enabling the system to extract facial features more efficiently and perfectly, we must narrow the range of face detection first, and the performance of the face detection method can’t be too low. In order to detect human faces quickly and accurately, we take advantage of an AdaBoost (Adaptive Boosting) algorithm to build a strong classifier with simple rectangle features involved in an integral image (Viola & Jones, 2001). Therefore, no matter what the size of a rectangle feature we use, the execution time is always constant. Some rectangle features are fed to the strong classifier that can distinguish positive and negative images. Although the AdaBoost algorithm spends a lot of time on training the strong classifier, the face detection result attains high performance in the main. Fig. 3 depicts the boosting learning algorithm that we adopt for face detection.
The cascaded structure of weak classifiers using the AdaBoost algorithm is constructed one by one, and each weak classifier has its own order. A positive result will be processed by the next weak classifier, whereas a negative result at any currently processed point leads to the immediate rejection for the corresponding sub-window. First, we input the rectangle features via different sub-windows into many weak classifiers, and the detection error rate is the minimum in the first weak classifier that can delete a lot of negative examples, then the remaining images are more difficult to be removed, which are further processed by the successive weak classifiers. Such an operation is repeatedly executed until the last weak classifier is performed, and the remainders are face images.
In order to take advantage of the integral image, we adopt the concept like processing a pyramid of images. That is, a sub-window constituting the detector scans the input image on many scales. For example, the first detector is composed of
4. Face recognition
Through the face detection procedure, we can take face images that are then fed to the face recognition procedure. To begin with, we execute the image normalization to make the sizes of face images be the same. The intensity of the images will be also adjusted for reducing the lighting effect. After the size normalization and intensity adjustment, we subsequently perform the feature extraction process to obtain the feature vector of a face image. The idea of common vectors was originally introduced for isolated word recognition problems in the case that the number of samples in each class is less than or equal to the dimensionality of the sample space. The approaches to solving these problems extract the common properties of classes in the training set by eliminating the differences of the samples in each class. A common vector for each individual class is obtained from removing all the features that are in the directions of the eigenvectors corresponding to the nonzero eigenvalues of the scatter matrix of its own class. The common vectors are then used for pattern recognition. In our case, instead of employing a given class’s own scatter matrix, we exploit the within-class scatter matrix of all classes to get the common vectors. We also present an alternative algorithm based on the subspace method and the Gram-Schmidt orthogonalization procedure to acquire the common vectors. Therefore, a new set of vectors called the discriminative common vectors (DCVs) will be used for classification, which results from the common vectors. What follows elaborates the algorithms for obtaining the common vectors and the discriminative common vectors (Gulmezoglu et al., 2001)
4.1. The DCV algorithm
Let the training set be composed of
and
where
In the special case,
To find the optimal projection vectors
Let
and
where
Consider the matrices
where
To do this, we need to find a basis in
Let
That is, the vector on the right-hand side of Eq. (10) is independent of the sample indexed
The theorem states that it is enough to project a single sample from each class. This will greatly reduce the computational load of the calculations. After acquiring the common vectors
where
where
In this case, the optimal projection vectors
There will be
We call the feature vectors
which is then compared with the discriminative common vector
Since
The above method can be summarized as follows:
Step 1. Compute nonzero eigenvalues and their corresponding eigenvectors of
Step 2. Choose any sample from each class and project it onto the null space of
Step 3. Compute the eigenvectors
4.2. Similarity measurement
There exist a lot of methods about similarity measurement. In order to implement face recognition on the robot in real time, we select the Euclidean distance to measure the similarity of a face image to those in the face database after feature extraction with the aid of the DCV transformations.
Let the training set be composed of
And the Euclidean distance between the feature vectors of the test image
After the test image compared with the feature vectors in all training models, we can find out the class whose Euclidean distance is minimum. In other words, the test image is identified the most possible person, namely
5. Face tracking
Our detection method can effectively find face regions, and the recognition method can know the master for people in front of the robot. Then the robot will track the master using our face tracking system. On the other hand, we can exchange the roles of the master with a stranger, and the robot will resume its tracking to follow the stranger. Until now, many tracking algorithms have been proposed by researchers. Of them, the Kalman filter and particle filter are applied extensively. The former is often used to predict the state sequence of a target for linear predictions, and it is not suitable for the non-linear and non-Gaussian movement of objects. However, the latter is based on the Monte Carlo integration method and suit to nonlinear predictions. Considering that the postures of a human body are nonlinear motions, our system chooses a particle filter to realize the tracker (Foresti et al., 2003; Montemerlo et al., 2002; Nummiaro et al., 2003). The key idea of this technique is to represent probability densities by sets of samples. Its structure is divided into four major parts: probability distribution, dynamic model, measurement, and factored sampling.
5.1. Particle filter
Each interested target will be defined before the face tracking procedure operates. The feature vector of the
where
According to the above target parameters, the tracking system builds some face models and their associated candidate models, each of which is represented by a particle. Herein, we select the Bhattacharyya coefficient to measure the similarity of two discrete distributions resulting from the respective cues of a face model and its candidate models existing in two consecutive frames.
Three kinds of cues are considered in the observation model; that is, the colour cue, edge cue, and motion cue. The colour cue employed in the observation model is still generated from the practice presented in the above literatures, instead of modelling the target by a colour histogram. The raw image of each frame is first converted to a colour probability distribution image via the face colour model, and then the samples are measured on the colour probability distribution image. In this manner, the face colour model is adapted to find interested regions by the face detection procedure, next handle the possible changes of the face regions due to variable illumination frame-by-frame, and the Bhattacharyya coefficient is used to calculate the similarity defined as
where
In comparison to our earlier proposed methods (Fahn et al., 2009; Fahn & Lin, 2010; Fahn & Lo, 2010), the edge cue is a newly used clue. Edges are a basic feature in images, and they are not more influenced than the colour model suffering from illumination. First, the Sobel filter is applied to obtain all the edges in each frame, and the Bhattacharyya coefficient is employed to represent the similarity measurement just like the colour model:
where
As for the motion feature, our tracking system records the centre positions of interested objects in the previous
and
There are two states of the motion, including the accelerated velocity and decelerated velocity. If the interested face moves with the accelerated velocity, the distance between the current centre position and the average centre position will be larger than the average moving distance. If the interested face moves with the decelerated velocity, conversely, the distance between the current centre position and the average centre position will be less than the average moving distance.
In addition to the distance of face motion, the direction of face motion is another important factor. The angle
It also depends on the average centre positions of the
where
When the occlusion of human faces happens, the motion cue is obvious and becomes more reliable. On the contrary, when the occlusion does not happen, we can weigh particles mainly by the observation of the colour and edge cues. As a result, we take a linear combination of colour, edge, and motion cues as follows:
where
After the current filtering process is finished, we execute the next filtering process again until a local search is achieved. Such a method will track faces steadily even if humans move very fast. In the first filtering process, we establish thirty particles to track faces. But, in the subsequent filtering processes, only ten particles are set up to facilitate the face tracking operation.
5.2. Occlusion handling
Occlusion handling is a major problem in visual surveillance (Foresti et al., 2003). The occlusion problem usually occurs in a multi-target system. When multiple moving objects occlude each other, both the colour and edge information are more unreliable than the velocities and directions of moving objects. In order to solve the occlusion problem, we have improved the strategies of the filtering process and modified the weights of the particles depicted in the previous subsection. Compared with the earlier proposed methods (Fahn et al., 2009; Fahn & Lin, 2010; Fahn & Lo, 2010), the major improvement is that the directions of moving objects in a number of the past frames are incorporated into the constituents of the motion cue. We will check whether the occlusion ceases to exist. If only one human face is detected in the face detection procedure, then the occlusion will not occur in this condition. Nevertheless, if two and more faces are detected, the occlusion will occur when moving faces are near to each other, and the system will go to the occlusion handling mode.
5.3. Robot control
We can predict the newest possible position of a human face by means of the particle filtering technique. According to the possible position, we issue a control command to make the robot track the
5.3.1. PTZ camera control
The principal task of the PTZ camera control is to keep the
where
5.3.2. Wheel motor control
The main task of the wheel motor control is to direct the robot to track the
where
The second information is the laser range data. We equip a laser range finder on the front of the robot to measure the distance
Through controlling the robot, we can increase its functionality of interacting with people. If the robot can recognize humans’ identities and follow him/her abidingly, then the relation between the robot and human being is definitely closer to each other, and it can be more extensive to use. So, we control the movement of the robot on the condition of state changes between the face tracking and face recognition modes. After face detection and target confirmation, the system will switch to the recognition mode and start the robot to execute a certain task.
6. Experimental results
Table 1 list both the hardware and software that are used for the development of the real-time face tracking and recognition system installed on our robot. All the experimental results presented in this section were obtained from taking an average of the outcome data for different 10 times to demonstrate the effectiveness of the proposed methods.
6.1. Face detection
The training database for face detection consists of 861 face images segmented by hand and 4,320 non-face images labelled by a random process from a set of photographs taken in outdoors. The sizes of these face and non-face images are all
We perform the experiments on face detection in two different kinds of image sequences: ordinary and jumble. The image sequence is regarded as a jumble if the background in this sample video is cluttered; that is, the environment is possessed of many pixels like skin colours and/or the illumination is not appropriate. Then we define the “error rate” as a probability of detecting human faces incorrectly; for example, regarding an inhuman face as a human face. And the “miss rate” is a probability which a target appears in the frame but not detected by the system. Table 2 shows the face detection rates for the above experiments.
From the experimental results, we can observe that the ordinary image sequences have better correct rates, because they encounter less interference, and the error rates are lower in this situation. The performance on the jumble image sequences is inferior to that on the ordinary ones. The main influence is due to the difference of luminance, but the effect of the colour factor is comparatively slight, particularly for many skin-like regions existing in the background. It is noted that the error rate of the jumble image sequences is larger than that of the ordinary ones, because our database has a small number of negative examples. That will cause the error rate raise. Fig. 6 demonstrates some face detection results from using our real-time face detection system in different environments.
6.2. Face recognition
Our face image database comprises 5 subjects to be recognized, and each of them has 20 samples. The nicknames are Craig, Eric, Keng-Li, Jane, and Eva (3 males and 2 females) respectively. The facial expressions (open or closed eyes and smiling or non-smiling) are also varied. The images of one subject included in the database are shown in Fig. 7.
We have our robot execute face recognition in real time using the DCV method. After comparing the feature vector of the captured and then normalized face image with those of all training models, we can find out the class whose Euclidean distance is the minimum. In other words, it is identified as the most possible person. Table 3 presents the range of thresholds for each class to eliminate the corresponding person from the possible subjects included in the database. If the feature similarity of a captured and normalized face image is lower than or equal to the threshold value for every class, the corresponding person is assigned to one of classes. For every 10 frames per a period, we rank the frequency of assignments for each class to determine the recognition result. Fig. 8 graphically shows some face recognition results from using our real-time face recognition system installed on the robot.
6.3. Face tracking
Several different types of accuracy measures are required to evaluate the performance of tracking results. In this subsection, we utilize track cardinality measures for estimating the tracking rates. The reference data should be first determined in order to obtain any of the accuracy measures. Let
and
The types of our testing image sequences are roughly classified into three kinds: a single face (Type A), multiple faces without occlusion (Type B), and multiple faces with occlusion (Type C). Each type has three different image sequences, and we will perform five tests on each of them. In these experiments, we use 30 particles to track human faces in each frame for the first filtering process, followed by 10 particles. Then we analyze the performance of face tracking on the three kinds of testing image sequences according to the aforementioned evaluation methods. Table 4 shows the face tracking performance for the above experiments.
From the experimental results, the tracking accuracy of Type C is lower than those of the other two types. This is because when the occlusion happens, the particles are predicted on the other faces that are also the colour regions. On the other hand, the correct rates of face tracking for Types A and B are similar to each other; that is, the tracking performance for a single face is almost the same to that for multiple faces without occlusion. When the robot and the target move simultaneously, some objects whose colours are close to the skin colour may exist in the background. In this situation, we must consider the speed of the target and assume that the target disappearing on the left side of the image does not appear on the right side immediately. Through simple distance limits, the robot can track the target continuously. Moreover, at regular intervals we will either abandon the currently tracked targets then intend to detect the face regions once more, or keep tracking on the wrong targets until the head moves near the resampling range of the particle filter. Attempting to enlarge the resampling range can slightly conquer this problem. However, the wider the resampling range is, the sparser the particles are. It has good outcomes for the robot to track faces, but sometimes it will reduce the tracking correctness. Fig. 9 shows some face tracking results from using our real-time face tracking system equipped on the robot.
7. Conclusion and feature works
In this chapter, we present a completely automatic real-time multiple faces recognition and tracking system installed on a robot that can capture an image sequence from a PTZ camera, then use the face detection technique to locate face positions, and identify the detected faces as the master or strangers, subsequently track a target and guide the robot near to the target continuously. Such a system not only allows robots to interact with human being adequately, but also can make robots react more like mankind.
Some future works are worth investigating to attain better performance. In the face recognition procedure, if the background is too cluttered to capture a clear foreground, the recognition rate will decrease. Because most of our previous training samples were captured in a simple environment, sometimes static objects in the uncomplicated background are identified as the foreground. We can increase some special training samples in a cluttered background to lower the miss rate during the face detection. Of course, it will raise the face recognition accuracy, but need a lot of experiments to collect special and proper training samples.
References
- 1.
An K. H. Yoo D. H. Jung S. U. Chung M. J. 2005 . “Robust multi-view face tracking,” in Proceedings of the IEEE International Conference on Intelligent Robots and Systems, Alberta, Canada, pp. 1905-1910 - 2.
Bing Y. Lianfu J. Ping C. 2002 A new LDA-based method for face recognition,” in Proceedings of the International Conference on Pattern Recognition, Quebec, Canada,1 168 171 - 3.
Brunelli R. Poggio T. 1993 . “Face recognition: features versus templates,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 15, no. 10, pp. 1042-1052, 1993 - 4.
Cevikalp H. Neamtu M. Wilkes M. Barkana A. 2005 Discriminative common vectors for face recognition,” IEEE Transactions on Pattern Analysis and Machine Intelligence,27 1 4 13 - 5.
Cevikalp H. Wilkes M. 2004 Face recognition by using discriminative common vectors,” in Proceedings of the International Conference on Pattern Recognition, Cambridge, United Kingdom,1 326 329 - 6.
Chen L. F. Liao H. Y. M. Ko M. T. Lin J. C. Yu G. J. 2000 A new LDA-based face recognition system which can solve the small sample size problem,” Pattern Recognition,33 10 1713 1726 - 7.
Er M. J. Wu S. Lu J. Toh H. L. 2002 Face recognition with radial basis function (RBF) neural networks,” IEEE Transactions on Neural Networks,13 3 697 710 - 8.
Fahn C. S. Kuo M. J. Wang K. Y. 2009 Real-time face tracking and recognition based on particle filtering and AdaBoosting techniques,” in Proceedings of the 13th International Conference on Human-Computer Interaction, LNCS 5611, San Diego, California,198 207 - 9.
Fahn C. S. Lin Y. T. 2010 Real-time face detection and tracking techniques used for the interaction between humans and robots,” in Proceedings of the 5th IEEE Conference on Industrial Electronics and Applications, Taichung, Taiwan. - 10.
Fahn C. S. Lo C. S. 2010 A high-definition human face tracking system using the fusion of omnidirectional and PTZ cameras mounted on a mobile robot,” in Proceedings of the 5th IEEE Conference on Industrial Electronics and Applications, Taichung, Taiwan. - 11.
Fan K. C. Wang Y. K. Chen B. F. 2002 Introduction of tracking algorithms,” Image and Recognition,8 4 17 30 - 12.
Foresti G. L. Micheloni C. Snidaro L. Marchiol C. 2003. “Face detection for visual surveillance,” in Proceedings of the 12th IEEE International Conference on Image Analysis and Processing, Mantova, Italy, pp. 115-120, 2003 - 13.
Gulmezoglu M. B. Dzhafarov V. Barkana A. 2001 The common vector approach and its relation to principal component analysis,” IEEE Transactions on Speech and Audio Processing,9 6 655 662 - 14.
Hjelmås E. Low B. K. 2001 Face detection: a survey,” Computer Vision and Image Understanding,83 3 236 274 - 15.
Huang R. Liu Q. Lu H. Ma S. 2002 Solving the small size problem of LDA,” in Proceedings of the International Conference on Pattern Recognition, Quebec, Canada,3 29 32 - 16.
Li S. Z. Lu J. 1999 Face recognition using the nearest feature line method,” IEEE Transactions on Neural Networks,10 2 439 443 - 17.
Montemerlo M. Thrun S. Whittaker W. 2002 Conditional particle filters for simultaneous mobile robot localization and people-tracking,” in Proceedings of the IEEE International Conference on Robotics and Automation, Washington, D.C.,1 695 701 - 18.
Nummiaro K. Koller-Meier E. Van Gool L. 2003 An adaptive color-based particle filter,” Image and Vision Computing,21 1 99 110 - 19.
Turk M. Pentland A. 1991 Face recognition using eigenfaces,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Maui, Hawaii,586 591 - 20.
Viola P. Jones M. 2001 Rapid object detection using a boosted cascade of simple features,” in Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Kauai, Hawaii,1 511 518 - 21.
Zhao W. Chellappa R. Phillips P. J. Rosenfeld A. 2003 Face recognition: a literature survey,” ACM Computing Surveys,35 4 399 458