The technology of biometric recognition systems for personal identification commonly manipulate the input data acquired from irises, voiceprints, fingerprints, signatures, human faces, and so on. The recognition of irises, voiceprints, fingerprints, and signatures belongs to the passive methods that require the camera with a high resolution or capture people‘s biometric information at a short range. These methods are not suitable for our person following robot to be developed, because they cannot provide convenience for users. Face recognition belongs to one of the active methods that users need keep away from a camera at a certain distance only. In this chapter, face recognition is regarded as a kind of human computer interfaces (HCIs) that are applied to the interaction between humans and robots. Thus, we attempt to develop an automatic real-time multiple faces recognition and tracking system equipped on the robot that can detect human faces and confirm a target in an image sequence captured from a PTZ camera, and keep tracking the target that has been identified as a master or stranger, then employ a laser range finder to measure a proper distance between target’s owner and the robot.
Generally, three main procedures: face detection, face recognition, and face tracking are implemented on such a system. In literature, a large number of face localization techniques had been proposed. According to the literature (Hjelmås & Low, 2001), the methods for face detection can be basically grouped into feature-based and image-based approaches. The development of the feature-based approaches can be further divided into three areas: active shape models, feature analysis, and low-level analysis for edges, grey-levels, skin colours, motions, and so forth. On the other hand, the image-based approaches can be categorized into linear subspace methods, neural networks, and statistical approaches.
The development of face recognition is more and more advanced in the past twenty years (Zhao et al., 2003). Each system has its own solution. At present, all researches about face recognition take the features that are robust enough to represent different human faces. A novel classification method, called the nearest feature line (NFL), for face recognition was proposed (Li et al., 1999). The derived FL can capture more variations of face images than the original feature points do, and it thus expands the capacity of an available face image database. However, if there are a lot of face images in the database, the recognition accuracy will reduce. Two simple but general strategies for a common face image database are compared and developed two new algorithms (Brunelli & Poggio, 1993); the first one is based on the computation of a set of geometrical features, such as nose width and length, mouth position, and chin shape, and the second one is based on almost-grey-level template matching. Nevertheless, under the different lighting conditions, the characteristic of geometry will change. In the literature (Er et al., 2002), the researchers combined the techniques of principal component analysis (PCA), Fisher’s linear discriminant, and radial basis function to increase the correct rate of face recognition. However, their approach is unable to carry out in real time.
On the whole, the methods of tracking objects can be categorized into three ways: match tracking, predictive tracking, and energy functions (Fan et al., 2002). Match tracking has to detect moving objects in the entire image. Although the accuracy of match tracking is considerable, it is very time-consuming and we can not improve the performance effectively. Energy functions are often adopted in a snake model. They provide a good help for contour tracking. In general, there are two methods of predictive tracking: the Kalman filter and a particle filter. The Kalman filter has great effects when the objects move in linear paths, but it’s not appropriate for the non-linear and non-Gaussian movements of objects. On the contrary, the particle filter performs well for non-linear and non-Gaussian problems. A research team completed a face tracking system which exploits a simple linear Kalman filter (An et al., 2005). They used a small number of critical rectangle features selected and trained by an AdaBoost learning algorithm, and then detected the initial position, size, and incline angle of a human face correctly. Once a human face is reliably detected, they extract the colour distributions of the face and the upper body from the detected facial regions and the upper body regions for creating respective robust colour modelling by virtue of k-means clustering and multiple Gaussian models. Then fast and efficient multi-view face tracking is executed using several critical features and a simple linear Kalman filter. However, two critical problems, lighting condition change and the number of clusters in the k-means clustering, are not solved well yet. In addition to the Kalman filter, some real-time face tracking systems based on particle filtering techniques were proposed (Fahn et al., 2009; Fahn & Lin, 2010; Fahn & Lo, 2010). The researchers utilized a particle filter to localize human faces in image sequences. Since they have considered the hair colour information of a human head, it will keep tracking even if the person is back to the sight of a camera.
In this chapter, an automatic real-time multiple faces recognition and tracking system installed on a person following robot is presented, which is inclusive of face detection, face recognition, and face tracking procedures. To identify human faces quickly and accurately, an AdaBoost algorithm is used for training a strong classifier for face detection (Viola & Jones, 2001). As to face recognition, we modify the discriminative common vectors (DCVs) algorithm (Cevikalp & Wilkes, 2004; Cevikalp et al., 2005; Gulmezoglu et al., 2001) and employ the minimum Euclidean distance to measure the similarity of a detected face image and a candidate one. To raise the confidence level, the most likely person is determined by the majority voting of ten successive recognition results from a face image sequence. In the sequel, the results of recognition will be assigned into two classes: “master” and “stranger.” In our system, the robot will track the master unceasingly except that the power is turned off. In the face tracking procedure, we propose an improved particle filter to dynamically locate multiple faces. According to the position of the target in an image, we issue a series of commands (moving forward, turning left or turning right) to drive the motors of wheels on the robot, and evaluate the distance between a master and the robot by means of a laser range finder to issue a set of commands (stop or turn backward) until the robot follows to a suitable distance in front of the master.
2. Hardware description
The following describes the hardware system of our experimental robot whose frame size is 40 cm long, 40 cm wide, and 130 cm high as Fig. 1 shows. The robot has three wheels; the motors of two front wheels driving the robot to move forward/backward and turn left/right, whereas one rear wheel without dynamic power is used for supporting the robot only. By controlling the turning directions of the two front wheels, we can change the moving direction of the robot. In order to help the robot to balance, we settle one ball caster in the rear.
Fig. 2 illustrates the hardware architecture of our experimental robot. We can receive the distance data of the laser range finder from the Industrial PC via RS-232. In the camera system, the images are acquired through the USB 2.0 video grabber. After processing the captured images and receiving the distance data from the laser range finder, the program will send commands through RS-232 to the motor system. The H8/3694F microprocessor can link the micro serial servo controller, and then the left and right motors of the front wheels are driven.
3. Face detection
The face detection is a crucial step of the human face recognition and tracking system. For enabling the system to extract facial features more efficiently and perfectly, we must narrow the range of face detection first, and the performance of the face detection method can’t be too low. In order to detect human faces quickly and accurately, we take advantage of an AdaBoost (Adaptive Boosting) algorithm to build a strong classifier with simple rectangle features involved in an integral image (Viola & Jones, 2001). Therefore, no matter what the size of a rectangle feature we use, the execution time is always constant. Some rectangle features are fed to the strong classifier that can distinguish positive and negative images. Although the AdaBoost algorithm spends a lot of time on training the strong classifier, the face detection result attains high performance in the main. Fig. 3 depicts the boosting learning algorithm that we adopt for face detection.
The cascaded structure of weak classifiers using the AdaBoost algorithm is constructed one by one, and each weak classifier has its own order. A positive result will be processed by the next weak classifier, whereas a negative result at any currently processed point leads to the immediate rejection for the corresponding sub-window. First, we input the rectangle features via different sub-windows into many weak classifiers, and the detection error rate is the minimum in the first weak classifier that can delete a lot of negative examples, then the remaining images are more difficult to be removed, which are further processed by the successive weak classifiers. Such an operation is repeatedly executed until the last weak classifier is performed, and the remainders are face images.
In order to take advantage of the integral image, we adopt the concept like processing a pyramid of images. That is, a sub-window constituting the detector scans the input image on many scales. For example, the first detector is composed ofpixels, and an image ofpixels is scanned by the detector. After that, the current image is smaller than the previous one by 1.25 times. A fixed scale detector is then employed to scan each of these images. When the detector scans an image, subsequent locations are acquired from shifting the window by some number of pixels, , where is the current scale. Note that the choice of affects both the speed of the detector and the accuracy of the detection. Detection results are clustered into a class to determine the representative position of the target via comparing the distance between the centres of a detection result and the class with a given threshold. The threshold changes with the current scale like shifting pixels mentioned above. If more than one detection result overlaps with each other, the final region is computed from the average of their positions of each overlapping part.
4. Face recognition
Through the face detection procedure, we can take face images that are then fed to the face recognition procedure. To begin with, we execute the image normalization to make the sizes of face images be the same. The intensity of the images will be also adjusted for reducing the lighting effect. After the size normalization and intensity adjustment, we subsequently perform the feature extraction process to obtain the feature vector of a face image. The idea of common vectors was originally introduced for isolated word recognition problems in the case that the number of samples in each class is less than or equal to the dimensionality of the sample space. The approaches to solving these problems extract the common properties of classes in the training set by eliminating the differences of the samples in each class. A common vector for each individual class is obtained from removing all the features that are in the directions of the eigenvectors corresponding to the nonzero eigenvalues of the scatter matrix of its own class. The common vectors are then used for pattern recognition. In our case, instead of employing a given class’s own scatter matrix, we exploit the within-class scatter matrix of all classes to get the common vectors. We also present an alternative algorithm based on the subspace method and the Gram-Schmidt orthogonalization procedure to acquire the common vectors. Therefore, a new set of vectors called the discriminative common vectors (DCVs) will be used for classification, which results from the common vectors. What follows elaborates the algorithms for obtaining the common vectors and the discriminative common vectors (Gulmezoglu et al., 2001)
4.1. The DCV algorithm
Let the training set be composed of classes, where each class contains samples, and let be a column vector which denotes the sample from the class. There will be a total of samples in the training set. Supposing that, three scatter matrices, , and are respectively defined below:
where is the mean of all samples and is the mean of samples in the class.
In the special case, and for all, modified Fisher’s linear discriminant criterion attains a maximum. However, a projection vector, satisfying the above conditions, does not necessarily maximize the between-class scatter matrix. The following is a better criterion (Bing et al., 2002; Turk & Pentland, 1991):
To find the optimal projection vectors in the null space of, we project the face samples onto the null space of and then obtain the projection vectors by performing principal component analysis (PCA). To accomplish this, the vectors that span the null space of must first be computed. Nevertheless, this task is computationally intractable since the dimension of this null space is probably very large. A more efficient way of realizing this task is resorted to the orthogonal complement of the null space of, which significantly becomes a lower-dimensional space.
Let be the original sample space, be the range space of, and be the null space of . Equivalently,
where is the rank of , is an orthonormal set, andis the set of orthonormal eigenvectors corresponding to the nonzero eigenvalues of .
Consider the matricesand. Since, every face image has a unique decomposition of the following form
where , and and are the orthogonal projection operators onto and , respectively. Our goal is to compute
To do this, we need to find a basis in, which can be accomplished by an eigenanalysis in. In particular, the normalized eigenvectors corresponding to the nonzero eigenvalues of will be an orthonormal basis in. The eigenvectors can be obtained by calculating the eigenvectors of the smaller matrix defined as where is a matrix of the form depicted below:
Let and be the nonzero eigenvalue and the corresponding eigenvector of , where . Then will be the orthonormal eigenvector that corresponds to the nonzero eigenvalue of . The sought-for projection onto is achieved using Eq. (8). In this manner, it turns out that we obtain the same unique vector for all samples of a class as follows:
That is, the vector on the right-hand side of Eq. (10) is independent of the sample indexed. We refer to the vectors as the common vectors.
The theorem states that it is enough to project a single sample from each class. This will greatly reduce the computational load of the calculations. After acquiring the common vectors, the optimal projection vectors will be those that maximize the total scatter of the common vectors:
where is a matrix whose columns are the orthonormal optimal projection vectors , and is the scatter matrix of the common vectors:
where is the mean of all common vectors:
In this case, the optimal projection vectors can be found by an eigenanalysis in . Particularly, all eigenvectors corresponding to the nonzero eigenvalues of will be the optimal projection vectors. is typically a large matrix. Thus, we can use the smaller matrix to find nonzero eigenvalues and the corresponding eigenvectors of , where is the matrix of the form expressed as
There will be optimal projection vectors since the rank of is if all common vectors are linearly independent. If two common vectors are identical, then the two classes, which are represented by this vector, cannot be distinguished from each other. Since the optimal projection vectors belong to the null space of , it follows that when the image samples of the class are projected onto the linear span of the projection vectors , the feature vector of the projection coefficients will also be independent of the sample indexed . Therefore, we have
We call the feature vectors discriminative common vectors, and they will be used for the classification of face images. To recognize a test image , its feature vector is found by
which is then compared with the discriminative common vector of each class using the Euclidean distance. The discriminative common vector found to be the closest to is adopted to identify the test image.
Since is only compared to a single vector for each class, the classification is very efficient for real-time face recognition tasks. In the Eigenface, Fisherface, and Direct-LDA methods, the test sample feature vector is typically compared to all feature vectors of samples in the training set. It makes these methods be impractical for real-time applications for large training sets. To overcome this, they should solve the small sample size problem alternatively (Chen et al., 2000; Huang et al., 2002).
The above method can be summarized as follows:
Step 1. Compute nonzero eigenvalues and their corresponding eigenvectors of using the matrix, where and is given by Eq. (9). Set matrix , where are the orthonormal eigenvectors of with rank .
Step 2. Choose any sample from each class and project it onto the null space of to obtain the common vectors:
Step 3. Compute the eigenvectors of , associated with the nonzero eigenvalues, by use of the matrix , where and is given in Eq. (14). There are at most eigenvectors corresponding to the nonzero eigenvalues to constitute the projection matrix, which will be employed to obtain feature vectors as shown in Eqs. (15) and (16).
4.2. Similarity measurement
There exist a lot of methods about similarity measurement. In order to implement face recognition on the robot in real time, we select the Euclidean distance to measure the similarity of a face image to those in the face database after feature extraction with the aid of the DCV transformations.
Let the training set be composed of classes, where each class contains samples, and let be a column vector which denotes the sample from the class. If is the feature vector of a test image, the similarity of the test image and a given sample is defined as
And the Euclidean distance between the feature vectors of the test image and class is
After the test image compared with the feature vectors in all training models, we can find out the class whose Euclidean distance is minimum. In other words, the test image is identified the most possible person, namely. Note that a range of the threshold is prescribed for every class to eliminate the person who does not belong to the face database as Eq. (20) shows:
5. Face tracking
Our detection method can effectively find face regions, and the recognition method can know the master for people in front of the robot. Then the robot will track the master using our face tracking system. On the other hand, we can exchange the roles of the master with a stranger, and the robot will resume its tracking to follow the stranger. Until now, many tracking algorithms have been proposed by researchers. Of them, the Kalman filter and particle filter are applied extensively. The former is often used to predict the state sequence of a target for linear predictions, and it is not suitable for the non-linear and non-Gaussian movement of objects. However, the latter is based on the Monte Carlo integration method and suit to nonlinear predictions. Considering that the postures of a human body are nonlinear motions, our system chooses a particle filter to realize the tracker (Foresti et al., 2003; Montemerlo et al., 2002; Nummiaro et al., 2003). The key idea of this technique is to represent probability densities by sets of samples. Its structure is divided into four major parts: probability distribution, dynamic model, measurement, and factored sampling.
5.1. Particle filter
Each interested target will be defined before the face tracking procedure operates. The feature vector of theinterested face in an elliptical shape in theframe is denoted as follows:
where is the centre of the elliptical window in the frame (i.e., at time step ), and symbolize the minor and major axes of the ellipse, respectively, and is person’s identity assigned from the face recognition result.
According to the above target parameters, the tracking system builds some face models and their associated candidate models, each of which is represented by a particle. Herein, we select the Bhattacharyya coefficient to measure the similarity of two discrete distributions resulting from the respective cues of a face model and its candidate models existing in two consecutive frames.
Three kinds of cues are considered in the observation model; that is, the colour cue, edge cue, and motion cue. The colour cue employed in the observation model is still generated from the practice presented in the above literatures, instead of modelling the target by a colour histogram. The raw image of each frame is first converted to a colour probability distribution image via the face colour model, and then the samples are measured on the colour probability distribution image. In this manner, the face colour model is adapted to find interested regions by the face detection procedure, next handle the possible changes of the face regions due to variable illumination frame-by-frame, and the Bhattacharyya coefficient is used to calculate the similarity defined as
where and mean the discrete distributions of the colour cues of the face model and its candidate model at time steps andrespectively.
In comparison to our earlier proposed methods (Fahn et al., 2009; Fahn & Lin, 2010; Fahn & Lo, 2010), the edge cue is a newly used clue. Edges are a basic feature in images, and they are not more influenced than the colour model suffering from illumination. First, the Sobel filter is applied to obtain all the edges in each frame, and the Bhattacharyya coefficient is employed to represent the similarity measurement just like the colour model:
where and stand for the discrete distributions of the edge cues of the face model and its candidate model at time steps andrespectively.
As for the motion feature, our tracking system records the centre positions of interested objects in the previousframes continued, say twenty. Both the distance and direction are then computed in the face tracking procedure for each interested object acting as a particle. At time step , the average moving distance of the particle (i.e., a candidate model) referring to its average centre positions and in the previous frames are expressed below:
There are two states of the motion, including the accelerated velocity and decelerated velocity. If the interested face moves with the accelerated velocity, the distance between the current centre position and the average centre position will be larger than the average moving distance. If the interested face moves with the decelerated velocity, conversely, the distance between the current centre position and the average centre position will be less than the average moving distance.
In addition to the distance of face motion, the direction of face motion is another important factor. The angle symbolizing the direction of the particle at time step is expressed as
It also depends on the average centre positions of the particle in the previous frames. All the scope of angles ranged from to is partitioned into four orientations, each of which covers the angle of. In order to get rid of an irregular face motion trajectory, we consider the most possibility of the direction by virtue of the majority voting on the number of times of the corresponding orientation of Thus, the likelihood of face motion at time step is determined by the following equation:
where is the distance between the centre position in the current frame and the average centre position in the previous frames, is referred to Eq. (26), and is the maximum frequency of the four orientations appearing in the previous frames. Therefore, the likelihood of the motion feature combines both the measurements of the moving distance and moving direction.
When the occlusion of human faces happens, the motion cue is obvious and becomes more reliable. On the contrary, when the occlusion does not happen, we can weigh particles mainly by the observation of the colour and edge cues. As a result, we take a linear combination of colour, edge, and motion cues as follows:
where . Fig. 4 shows the iteration procedure for tracking the interested face using our improved particle filter.
After the current filtering process is finished, we execute the next filtering process again until a local search is achieved. Such a method will track faces steadily even if humans move very fast. In the first filtering process, we establish thirty particles to track faces. But, in the subsequent filtering processes, only ten particles are set up to facilitate the face tracking operation.
5.2. Occlusion handling
Occlusion handling is a major problem in visual surveillance (Foresti et al., 2003). The occlusion problem usually occurs in a multi-target system. When multiple moving objects occlude each other, both the colour and edge information are more unreliable than the velocities and directions of moving objects. In order to solve the occlusion problem, we have improved the strategies of the filtering process and modified the weights of the particles depicted in the previous subsection. Compared with the earlier proposed methods (Fahn et al., 2009; Fahn & Lin, 2010; Fahn & Lo, 2010), the major improvement is that the directions of moving objects in a number of the past frames are incorporated into the constituents of the motion cue. We will check whether the occlusion ceases to exist. If only one human face is detected in the face detection procedure, then the occlusion will not occur in this condition. Nevertheless, if two and more faces are detected, the occlusion will occur when moving faces are near to each other, and the system will go to the occlusion handling mode.
5.3. Robot control
We can predict the newest possible position of a human face by means of the particle filtering technique. According to the possible position, we issue a control command to make the robot track the face continuously. Such robot control comprises two parts: PTZ camera control and wheel motor control. The following is the command set of robot control described at time step t, where the first two commands belong to the PTZ camera control and the remaining ones used for the wheel motor control:
5.3.1. PTZ camera control
The principal task of the PTZ camera control is to keep the face appearing in the centre of the screen. It prevents the target from situating on the upper or lower side of the scope of the screen. If is lower than or equal to 50, then we assign the command to the PTZ camera at time step t; if is greater than or equal to 190, then we assign the command to the PTZ camera at time step t, as shown in Eq. (31):
where is the centre of the elliptic window in the y-direction at time step t.
5.3.2. Wheel motor control
The main task of the wheel motor control is to direct the robot to track the face continuously. We utilize two kinds of information to determine which command is issued. The first information is the position of the face appearing in the screen. According to this, we assign the commands to activate the wheel motors that make the robot move appropriately. If is lower than or equal to 100, then we assign the command to the wheel motors at time step t; if is greater than or equal to 220, then we assign the command to the wheel motors at time step t as Eq. (32) shows:
where is the centre of the elliptic window in the x-direction at time step t.
The second information is the laser range data. We equip a laser range finder on the front of the robot to measure the distance between the robot and a target which is allowed to have the incline angle ranged from to . In accordance with the distance, we can assign the commands to the wheel motors to control the movement of the robot. If is lower than or equal to 60, then we assign the command to the wheel motors at time step t; if is greater than or equal to 120, then we assign the command to the wheel motors at time step t; otherwise, we assign the command to the wheel motors at time step t, as stated in Eq. (33):
Through controlling the robot, we can increase its functionality of interacting with people. If the robot can recognize humans’ identities and follow him/her abidingly, then the relation between the robot and human being is definitely closer to each other, and it can be more extensive to use. So, we control the movement of the robot on the condition of state changes between the face tracking and face recognition modes. After face detection and target confirmation, the system will switch to the recognition mode and start the robot to execute a certain task.
6. Experimental results
Table 1 list both the hardware and software that are used for the development of the real-time face tracking and recognition system installed on our robot. All the experimental results presented in this section were obtained from taking an average of the outcome data for different 10 times to demonstrate the effectiveness of the proposed methods.
6.1. Face detection
The training database for face detection consists of 861 face images segmented by hand and 4,320 non-face images labelled by a random process from a set of photographs taken in outdoors. The sizes of these face and non-face images are all pixels. Fig. 5 illustrates some examples of positive and negative images used for training. Notice that each positive image contains a face region beneath the eyebrows and beyond the half-way between the mouth and the chin.
We perform the experiments on face detection in two different kinds of image sequences: ordinary and jumble. The image sequence is regarded as a jumble if the background in this sample video is cluttered; that is, the environment is possessed of many pixels like skin colours and/or the illumination is not appropriate. Then we define the “error rate” as a probability of detecting human faces incorrectly; for example, regarding an inhuman face as a human face. And the “miss rate” is a probability which a target appears in the frame but not detected by the system. Table 2 shows the face detection rates for the above experiments.
From the experimental results, we can observe that the ordinary image sequences have better correct rates, because they encounter less interference, and the error rates are lower in this situation. The performance on the jumble image sequences is inferior to that on the ordinary ones. The main influence is due to the difference of luminance, but the effect of the colour factor is comparatively slight, particularly for many skin-like regions existing in the background. It is noted that the error rate of the jumble image sequences is larger than that of the ordinary ones, because our database has a small number of negative examples. That will cause the error rate raise. Fig. 6 demonstrates some face detection results from using our real-time face detection system in different environments.
6.2. Face recognition
Our face image database comprises 5 subjects to be recognized, and each of them has 20 samples. The nicknames are Craig, Eric, Keng-Li, Jane, and Eva (3 males and 2 females) respectively. The facial expressions (open or closed eyes and smiling or non-smiling) are also varied. The images of one subject included in the database are shown in Fig. 7.
We have our robot execute face recognition in real time using the DCV method. After comparing the feature vector of the captured and then normalized face image with those of all training models, we can find out the class whose Euclidean distance is the minimum. In other words, it is identified as the most possible person. Table 3 presents the range of thresholds for each class to eliminate the corresponding person from the possible subjects included in the database. If the feature similarity of a captured and normalized face image is lower than or equal to the threshold value for every class, the corresponding person is assigned to one of classes. For every 10 frames per a period, we rank the frequency of assignments for each class to determine the recognition result. Fig. 8 graphically shows some face recognition results from using our real-time face recognition system installed on the robot.
6.3. Face tracking
Several different types of accuracy measures are required to evaluate the performance of tracking results. In this subsection, we utilize track cardinality measures for estimating the tracking rates. The reference data should be first determined in order to obtain any of the accuracy measures. Let be the total number of actual tracks as the reference data, be the total number of reported tracks, be the number of actual tracks that have corresponding reported tracks, and be the number of reported tracks that do not correspond to true tracks. The measures are determined by the fraction of true tracks, the fraction of reported tracks that are false alarms where they do not correspond to true tracks, and the miss rate is. The following defines these variables:
The types of our testing image sequences are roughly classified into three kinds: a single face (Type A), multiple faces without occlusion (Type B), and multiple faces with occlusion (Type C). Each type has three different image sequences, and we will perform five tests on each of them. In these experiments, we use 30 particles to track human faces in each frame for the first filtering process, followed by 10 particles. Then we analyze the performance of face tracking on the three kinds of testing image sequences according to the aforementioned evaluation methods. Table 4 shows the face tracking performance for the above experiments.
From the experimental results, the tracking accuracy of Type C is lower than those of the other two types. This is because when the occlusion happens, the particles are predicted on the other faces that are also the colour regions. On the other hand, the correct rates of face tracking for Types A and B are similar to each other; that is, the tracking performance for a single face is almost the same to that for multiple faces without occlusion. When the robot and the target move simultaneously, some objects whose colours are close to the skin colour may exist in the background. In this situation, we must consider the speed of the target and assume that the target disappearing on the left side of the image does not appear on the right side immediately. Through simple distance limits, the robot can track the target continuously. Moreover, at regular intervals we will either abandon the currently tracked targets then intend to detect the face regions once more, or keep tracking on the wrong targets until the head moves near the resampling range of the particle filter. Attempting to enlarge the resampling range can slightly conquer this problem. However, the wider the resampling range is, the sparser the particles are. It has good outcomes for the robot to track faces, but sometimes it will reduce the tracking correctness. Fig. 9 shows some face tracking results from using our real-time face tracking system equipped on the robot.
7. Conclusion and feature works
In this chapter, we present a completely automatic real-time multiple faces recognition and tracking system installed on a robot that can capture an image sequence from a PTZ camera, then use the face detection technique to locate face positions, and identify the detected faces as the master or strangers, subsequently track a target and guide the robot near to the target continuously. Such a system not only allows robots to interact with human being adequately, but also can make robots react more like mankind.
Some future works are worth investigating to attain better performance. In the face recognition procedure, if the background is too cluttered to capture a clear foreground, the recognition rate will decrease. Because most of our previous training samples were captured in a simple environment, sometimes static objects in the uncomplicated background are identified as the foreground. We can increase some special training samples in a cluttered background to lower the miss rate during the face detection. Of course, it will raise the face recognition accuracy, but need a lot of experiments to collect special and proper training samples.