Real-time Multi-Face Recognition and Tracking Techniques Used for the Interaction between Humans and Robots

Chin-Shyurng Fahn; Chih-Hsin Wang

doi:10.5772/19589

Author Information

Show +

Chin-Shyurng Fahn*
- National Taiwan University of Science and Technology, Taipei, Taiwan, R. O. C.
Chih-Hsin Wang*
- National Taiwan University of Science and Technology, Taipei, Taiwan, R. O. C.

*Address all correspondence to:

1. Introduction

The technology of biometric recognition systems for personal identification commonly manipulate the input data acquired from irises, voiceprints, fingerprints, signatures, human faces, and so on. The recognition of irises, voiceprints, fingerprints, and signatures belongs to the passive methods that require the camera with a high resolution or capture people‘s biometric information at a short range. These methods are not suitable for our person following robot to be developed, because they cannot provide convenience for users. Face recognition belongs to one of the active methods that users need keep away from a camera at a certain distance only. In this chapter, face recognition is regarded as a kind of human computer interfaces (HCIs) that are applied to the interaction between humans and robots. Thus, we attempt to develop an automatic real-time multiple faces recognition and tracking system equipped on the robot that can detect human faces and confirm a target in an image sequence captured from a PTZ camera, and keep tracking the target that has been identified as a master or stranger, then employ a laser range finder to measure a proper distance between target’s owner and the robot.

Generally, three main procedures: face detection, face recognition, and face tracking are implemented on such a system. In literature, a large number of face localization techniques had been proposed. According to the literature (Hjelmås & Low, 2001), the methods for face detection can be basically grouped into feature-based and image-based approaches. The development of the feature-based approaches can be further divided into three areas: active shape models, feature analysis, and low-level analysis for edges, grey-levels, skin colours, motions, and so forth. On the other hand, the image-based approaches can be categorized into linear subspace methods, neural networks, and statistical approaches.

The development of face recognition is more and more advanced in the past twenty years (Zhao et al., 2003). Each system has its own solution. At present, all researches about face recognition take the features that are robust enough to represent different human faces. A novel classification method, called the nearest feature line (NFL), for face recognition was proposed (Li et al., 1999). The derived FL can capture more variations of face images than the original feature points do, and it thus expands the capacity of an available face image database. However, if there are a lot of face images in the database, the recognition accuracy will reduce. Two simple but general strategies for a common face image database are compared and developed two new algorithms (Brunelli & Poggio, 1993); the first one is based on the computation of a set of geometrical features, such as nose width and length, mouth position, and chin shape, and the second one is based on almost-grey-level template matching. Nevertheless, under the different lighting conditions, the characteristic of geometry will change. In the literature (Er et al., 2002), the researchers combined the techniques of principal component analysis (PCA), Fisher’s linear discriminant, and radial basis function to increase the correct rate of face recognition. However, their approach is unable to carry out in real time.

On the whole, the methods of tracking objects can be categorized into three ways: match tracking, predictive tracking, and energy functions (Fan et al., 2002). Match tracking has to detect moving objects in the entire image. Although the accuracy of match tracking is considerable, it is very time-consuming and we can not improve the performance effectively. Energy functions are often adopted in a snake model. They provide a good help for contour tracking. In general, there are two methods of predictive tracking: the Kalman filter and a particle filter. The Kalman filter has great effects when the objects move in linear paths, but it’s not appropriate for the non-linear and non-Gaussian movements of objects. On the contrary, the particle filter performs well for non-linear and non-Gaussian problems. A research team completed a face tracking system which exploits a simple linear Kalman filter (An et al., 2005). They used a small number of critical rectangle features selected and trained by an AdaBoost learning algorithm, and then detected the initial position, size, and incline angle of a human face correctly. Once a human face is reliably detected, they extract the colour distributions of the face and the upper body from the detected facial regions and the upper body regions for creating respective robust colour modelling by virtue of k-means clustering and multiple Gaussian models. Then fast and efficient multi-view face tracking is executed using several critical features and a simple linear Kalman filter. However, two critical problems, lighting condition change and the number of clusters in the k-means clustering, are not solved well yet. In addition to the Kalman filter, some real-time face tracking systems based on particle filtering techniques were proposed (Fahn et al., 2009; Fahn & Lin, 2010; Fahn & Lo, 2010). The researchers utilized a particle filter to localize human faces in image sequences. Since they have considered the hair colour information of a human head, it will keep tracking even if the person is back to the sight of a camera.

In this chapter, an automatic real-time multiple faces recognition and tracking system installed on a person following robot is presented, which is inclusive of face detection, face recognition, and face tracking procedures. To identify human faces quickly and accurately, an AdaBoost algorithm is used for training a strong classifier for face detection (Viola & Jones, 2001). As to face recognition, we modify the discriminative common vectors (DCVs) algorithm (Cevikalp & Wilkes, 2004; Cevikalp et al., 2005; Gulmezoglu et al., 2001) and employ the minimum Euclidean distance to measure the similarity of a detected face image and a candidate one. To raise the confidence level, the most likely person is determined by the majority voting of ten successive recognition results from a face image sequence. In the sequel, the results of recognition will be assigned into two classes: “master” and “stranger.” In our system, the robot will track the master unceasingly except that the power is turned off. In the face tracking procedure, we propose an improved particle filter to dynamically locate multiple faces. According to the position of the target in an image, we issue a series of commands (moving forward, turning left or turning right) to drive the motors of wheels on the robot, and evaluate the distance between a master and the robot by means of a laser range finder to issue a set of commands (stop or turn backward) until the robot follows to a suitable distance in front of the master.

2. Hardware description

The following describes the hardware system of our experimental robot whose frame size is 40 cm long, 40 cm wide, and 130 cm high as Fig. 1 shows. The robot has three wheels; the motors of two front wheels driving the robot to move forward/backward and turn left/right, whereas one rear wheel without dynamic power is used for supporting the robot only. By controlling the turning directions of the two front wheels, we can change the moving direction of the robot. In order to help the robot to balance, we settle one ball caster in the rear.

Figure 1.
Illustration of our experimental robot: (a) the robot frame; (b) the concrete robot.

Fig. 2 illustrates the hardware architecture of our experimental robot. We can receive the distance data of the laser range finder from the Industrial PC via RS-232. In the camera system, the images are acquired through the USB 2.0 video grabber. After processing the captured images and receiving the distance data from the laser range finder, the program will send commands through RS-232 to the motor system. The H8/3694F microprocessor can link the micro serial servo controller, and then the left and right motors of the front wheels are driven.

Figure 2.
Hardware architecture of our experimental robot.

3. Face detection

The face detection is a crucial step of the human face recognition and tracking system. For enabling the system to extract facial features more efficiently and perfectly, we must narrow the range of face detection first, and the performance of the face detection method can’t be too low. In order to detect human faces quickly and accurately, we take advantage of an AdaBoost (Adaptive Boosting) algorithm to build a strong classifier with simple rectangle features involved in an integral image (Viola & Jones, 2001). Therefore, no matter what the size of a rectangle feature we use, the execution time is always constant. Some rectangle features are fed to the strong classifier that can distinguish positive and negative images. Although the AdaBoost algorithm spends a lot of time on training the strong classifier, the face detection result attains high performance in the main. Fig. 3 depicts the boosting learning algorithm that we adopt for face detection.

Figure 3.
The boosting learning algorithm used in face detection.

The cascaded structure of weak classifiers using the AdaBoost algorithm is constructed one by one, and each weak classifier has its own order. A positive result will be processed by the next weak classifier, whereas a negative result at any currently processed point leads to the immediate rejection for the corresponding sub-window. First, we input the rectangle features via different sub-windows into many weak classifiers, and the detection error rate is the minimum in the first weak classifier that can delete a lot of negative examples, then the remaining images are more difficult to be removed, which are further processed by the successive weak classifiers. Such an operation is repeatedly executed until the last weak classifier is performed, and the remainders are face images.

In order to take advantage of the integral image, we adopt the concept like processing a pyramid of images. That is, a sub-window constituting the detector scans the input image on many scales. For example, the first detector is composed of20×20pixels, and an image of320×240pixels is scanned by the detector. After that, the current image is smaller than the previous one by 1.25 times. A fixed scale detector is then employed to scan each of these images. When the detector scans an image, subsequent locations are acquired from shifting the window by some number of pixels, s×Δ, where s is the current scale. Note that the choice of Δ affects both the speed of the detector and the accuracy of the detection. Detection results are clustered into a class to determine the representative position of the target via comparing the distance between the centres of a detection result and the class with a given threshold. The threshold changes with the current scale like shifting pixels mentioned above. If more than one detection result overlaps with each other, the final region is computed from the average of their positions of each overlapping part.

4. Face recognition

Through the face detection procedure, we can take face images that are then fed to the face recognition procedure. To begin with, we execute the image normalization to make the sizes of face images be the same. The intensity of the images will be also adjusted for reducing the lighting effect. After the size normalization and intensity adjustment, we subsequently perform the feature extraction process to obtain the feature vector of a face image. The idea of common vectors was originally introduced for isolated word recognition problems in the case that the number of samples in each class is less than or equal to the dimensionality of the sample space. The approaches to solving these problems extract the common properties of classes in the training set by eliminating the differences of the samples in each class. A common vector for each individual class is obtained from removing all the features that are in the directions of the eigenvectors corresponding to the nonzero eigenvalues of the scatter matrix of its own class. The common vectors are then used for pattern recognition. In our case, instead of employing a given class’s own scatter matrix, we exploit the within-class scatter matrix of all classes to get the common vectors. We also present an alternative algorithm based on the subspace method and the Gram-Schmidt orthogonalization procedure to acquire the common vectors. Therefore, a new set of vectors called the discriminative common vectors (DCVs) will be used for classification, which results from the common vectors. What follows elaborates the algorithms for obtaining the common vectors and the discriminative common vectors (Gulmezoglu et al., 2001)

4.1. The DCV algorithm

Let the training set be composed of C classes, where each class contains M samples, and let xmc be a d-dimensional column vector which denotes the m-th sample from the c-th class. There will be a total of N=MC samples in the training set. Supposing thatd>M−C, three scatter matricesSW, SB, and STare respectively defined below:

SW=∑c=1C∑m=1M(xmc−x¯c)(xmc−x¯c)T,E1

SB=∑c=1C(x¯c−x¯)(x¯c−x¯)T,E2

and

ST=∑c=1C∑m=1M(xmc−x¯)(xmc−x¯)T=SW+SB,E3

where x¯ is the mean of all samples and x¯c is the mean of samples in the c-th class.

In the special case, wTSWw=0 and wTSBw≠0 for allw∈Rd−0d, modified Fisher’s linear discriminant criterion attains a maximum. However, a projection vectorw, satisfying the above conditions, does not necessarily maximize the between-class scatter matrix. The following is a better criterion (Bing et al., 2002; Turk & Pentland, 1991):

.

To find the optimal projection vectors w in the null space ofSW, we project the face samples onto the null space of SW and then obtain the projection vectors by performing principal component analysis (PCA). To accomplish this, the vectors that span the null space of SW must first be computed. Nevertheless, this task is computationally intractable since the dimension of this null space is probably very large. A more efficient way of realizing this task is resorted to the orthogonal complement of the null space ofSW, which significantly becomes a lower-dimensional space.

Let Rd be the original sample space, V be the range space ofSW, and V⊥ be the null space of SW. Equivalently,

V=span{αk|SWαk≠0,k=1,2,…,r}E5

and

V⊥=span{αk|SWαk=0,k=r+1,r+2,…,d},E6

where r<d is the rank of SW, {α1,α2,…,αd}is an orthonormal set, and{α1,α2,…,αr}is the set of orthonormal eigenvectors corresponding to the nonzero eigenvalues of SW.

Consider the matricesQ=[α1α2…αr]andQ¯=[αr+1αr+2…αd]. SinceRd=V⊕V⊥, every face image xmc∈Rd has a unique decomposition of the following form

xmc=ymc+zmc,E7

where ymc=Pxmc=QQTxmc∈V,zmc=P¯xmc=Q¯Q¯Txmc∈V⊥, and P and P¯ are the orthogonal projection operators onto V and V⊥, respectively. Our goal is to compute

zmc=xmc−ymc=xmc−Pxmc.E8

To do this, we need to find a basis inV, which can be accomplished by an eigenanalysis inSW. In particular, the normalized eigenvectors αk corresponding to the nonzero eigenvalues of SW will be an orthonormal basis inV. The eigenvectors can be obtained by calculating the eigenvectors of the smaller N×N matrix defined as SW=ATA where A is a d×N matrix of the form depicted below:

A=[x11−x¯1⋯xM1−x¯1x12−x¯2⋯xMC−x¯C].E9

Let λk and vk be the k-th nonzero eigenvalue and the corresponding eigenvector of ATA, where k<N−C. Then αk=Avk will be the orthonormal eigenvector that corresponds to the k-th nonzero eigenvalue of SW. The sought-for projection onto V⊥ is achieved using Eq. (8). In this manner, it turns out that we obtain the same unique vector for all samples of a class as follows:

xcomc=xmc−QQTxmc=Q¯Q¯Txmc,m=1,2,…,M,c=1,2,…,C.E10

That is, the vector on the right-hand side of Eq. (10) is independent of the sample indexedm. We refer to the vectors xcomc as the common vectors.

The theorem states that it is enough to project a single sample from each class. This will greatly reduce the computational load of the calculations. After acquiring the common vectorsxcomc, the optimal projection vectors will be those that maximize the total scatter of the common vectors:

where W is a matrix whose columns are the orthonormal optimal projection vectors wk, and Scom is the scatter matrix of the common vectors:

Scom=∑c=1C(xcomc−x¯com)(xcomc−x¯com)T,c=1,2,…,C,E12

where x¯com is the mean of all common vectors:

x¯com=1C∑c=1Cxcomc.E13

In this case, the optimal projection vectors wk can be found by an eigenanalysis in Scom. Particularly, all eigenvectors corresponding to the nonzero eigenvalues of Scom will be the optimal projection vectors. Scom is typically a large d×d matrix. Thus, we can use the smaller C×C matrix AcomTAcom to find nonzero eigenvalues and the corresponding eigenvectors of Scom=AcomAcomT, where Acomis the d×C matrix of the form expressed as

Acom=[xcom1−x¯com⋯xcomC−x¯com]E14

.

There will be C−1 optimal projection vectors since the rank of Scom is C−1 if all common vectors are linearly independent. If two common vectors are identical, then the two classes, which are represented by this vector, cannot be distinguished from each other. Since the optimal projection vectors wk belong to the null space of SW, it follows that when the image samples xmc of the i-th class are projected onto the linear span of the projection vectors wk, the feature vector Ωc=[⟨xmc,w1⟩⋯⟨xmc,wC−1⟩]Tof the projection coefficients ⟨xmc,wk⟩ will also be independent of the sample indexed m. Therefore, we have

Ωc=WTxmc,m=1,2,…,M,c=1,2,…,CE15

.

We call the feature vectors Ωc discriminative common vectors, and they will be used for the classification of face images. To recognize a test image xtest, its feature vector is found by

Ωtest=WTxmc,m=1,2,...,M,c=1,2,...,CE16

which is then compared with the discriminative common vector Ωc of each class using the Euclidean distance. The discriminative common vector found to be the closest to Ωtest is adopted to identify the test image.

Since Ωtest is only compared to a single vector for each class, the classification is very efficient for real-time face recognition tasks. In the Eigenface, Fisherface, and Direct-LDA methods, the test sample feature vector Ωtest is typically compared to all feature vectors of samples in the training set. It makes these methods be impractical for real-time applications for large training sets. To overcome this, they should solve the small sample size problem alternatively (Chen et al., 2000; Huang et al., 2002).

The above method can be summarized as follows:

Step 1. Compute nonzero eigenvalues and their corresponding eigenvectors of SW using the matrixATA, where SW=AAT and A is given by Eq. (9). Set matrix Q=[α1α2…αr], where αk,k=1,2,…,rare the orthonormal eigenvectors of SWwith rank r.

Step 2. Choose any sample from each class and project it onto the null space of SW to obtain the common vectors:

xcomc=xmc−QQTxmc,m=1,2,…,M,c=1,2,…,C.E17

Step 3. Compute the eigenvectors wk of Scom, associated with the nonzero eigenvalues, by use of the matrix AcomTAcom, where Scom=AcomAcomT and Acom is given in Eq. (14). There are at most C−1 eigenvectors corresponding to the nonzero eigenvalues to constitute the projection matrixW=[w1w2…wC−1], which will be employed to obtain feature vectors as shown in Eqs. (15) and (16).

4.2. Similarity measurement

There exist a lot of methods about similarity measurement. In order to implement face recognition on the robot in real time, we select the Euclidean distance to measure the similarity of a face image to those in the face database after feature extraction with the aid of the DCV transformations.

Let the training set be composed of C classes, where each class contains M samples, and let ymc be a d-dimensional column vector which denotes the m-th sample from the c-th class. If y is the feature vector of a test image, the similarity of the test image and a given sample is defined as

d(y,ymc)=‖y−ymc‖.E18

And the Euclidean distance between the feature vectors of the test image y and class c is

d(y,c)=min{d(y,y1c),d(y,y2c),…,d(y,ymc)}.E19

After the test image compared with the feature vectors in all training models, we can find out the class whose Euclidean distance is minimum. In other words, the test image is identified the most possible person, namelyIdy. Note that a range of the threshold Tc is prescribed for every class to eliminate the person who does not belong to the face database as Eq. (20) shows:

Idy={argmincd(y,c)NULLifmin{d(y,c)≤Tc,c=1,2,…,C}otherwise.E20

5. Face tracking

Our detection method can effectively find face regions, and the recognition method can know the master for people in front of the robot. Then the robot will track the master using our face tracking system. On the other hand, we can exchange the roles of the master with a stranger, and the robot will resume its tracking to follow the stranger. Until now, many tracking algorithms have been proposed by researchers. Of them, the Kalman filter and particle filter are applied extensively. The former is often used to predict the state sequence of a target for linear predictions, and it is not suitable for the non-linear and non-Gaussian movement of objects. However, the latter is based on the Monte Carlo integration method and suit to nonlinear predictions. Considering that the postures of a human body are nonlinear motions, our system chooses a particle filter to realize the tracker (Foresti et al., 2003; Montemerlo et al., 2002; Nummiaro et al., 2003). The key idea of this technique is to represent probability densities by sets of samples. Its structure is divided into four major parts: probability distribution, dynamic model, measurement, and factored sampling.

5.1. Particle filter

Each interested target will be defined before the face tracking procedure operates. The feature vector of thei-thinterested face in an elliptical shape in thet-thframe is denoted as follows:

Ft,i={xt,i,yt,i,wt,i,ht,i,Idt,i},E21

where (xt,i,yt,i) is the centre of the i-th elliptical window in the t-thframe (i.e., at time step t), wt,i and ht,i symbolize the minor and major axes of the ellipse, respectively, and Idt,i is person’s identity assigned from the face recognition result.

According to the above target parameters, the tracking system builds some face models and their associated candidate models, each of which is represented by a particle. Herein, we select the Bhattacharyya coefficient to measure the similarity of two discrete distributions resulting from the respective cues of a face model and its candidate models existing in two consecutive frames.

Three kinds of cues are considered in the observation model; that is, the colour cue, edge cue, and motion cue. The colour cue employed in the observation model is still generated from the practice presented in the above literatures, instead of modelling the target by a colour histogram. The raw image of each frame is first converted to a colour probability distribution image via the face colour model, and then the samples are measured on the colour probability distribution image. In this manner, the face colour model is adapted to find interested regions by the face detection procedure, next handle the possible changes of the face regions due to variable illumination frame-by-frame, and the Bhattacharyya coefficient is used to calculate the similarity defined as

MCt,i=BC(Cfacet−1,i(l,m),Cfacet,i(l,m)),E22

where Cfacet−1,i(l,m)and Cfacet,i(l,m)mean the discrete distributions of the colour cues of the i-thface model and its candidate model at time steps t−1 andt,respectively.

In comparison to our earlier proposed methods (Fahn et al., 2009; Fahn & Lin, 2010; Fahn & Lo, 2010), the edge cue is a newly used clue. Edges are a basic feature in images, and they are not more influenced than the colour model suffering from illumination. First, the Sobel filter is applied to obtain all the edges in each frame, and the Bhattacharyya coefficient is employed to represent the similarity measurement just like the colour model:

MEt,i=BC(Efacet−1,i(l,m),Efacet,i(l,m)),E23

where Efacet−1,i(l,m) and Efacet,i(l,m)stand for the discrete distributions of the edge cues of the i-thface model and its candidate model at time steps t−1 andt,respectively.

As for the motion feature, our tracking system records the centre positions of interested objects in the previousnumframes continued, say twenty. Both the distance and direction are then computed in the face tracking procedure for each interested object acting as a particle. At time step t, the average moving distance Adist,iof the i-th particle (i.e., a candidate model) referring to its average centre positions Axt,i and Ayt,i in the previous num frames are expressed below:

Axt,i=∑k=1numxt−k,inum,E24

Ayt,i=∑k=1numyt−k,inum,E25

and

Adist,i=∑k=1num(xt,i−xt−k,i)2+(yt,i−yt−k,i)2numE26

There are two states of the motion, including the accelerated velocity and decelerated velocity. If the interested face moves with the accelerated velocity, the distance between the current centre position and the average centre position will be larger than the average moving distance. If the interested face moves with the decelerated velocity, conversely, the distance between the current centre position and the average centre position will be less than the average moving distance.

In addition to the distance of face motion, the direction of face motion is another important factor. The angle θt,isymbolizing the direction of the i-th particle at time step t is expressed as

θt,i=(tan−1yt,i−Ayt,ixt,i−Axt,i)×180π.E27

It also depends on the average centre positions of the i-th particle in the previous num frames. All the scope of angles ranged from −90° to +90°is partitioned into four orientations, each of which covers the angle of45°. In order to get rid of an irregular face motion trajectory, we consider the most possibility of the direction by virtue of the majority voting on the number of times of the corresponding orientation of θt−k,i,k=1,2,…,num. Thus, the likelihood of face motion at time step t is determined by the following equation:

Mvt,i={dist,iAdist,i×0.25+votet,i×0.2+0.6ifdist,i≤Adist,idist,iAdist,i×0.35+votet,i×0.2+0.2otherwise,E28

where dist,i is the distance between the centre position in the current frame and the average centre position in the previous numframes, Adist,iis referred to Eq. (26), and votet,i is the maximum frequency of the four orientations appearing in the previous numframes. Therefore, the likelihood of the motion feature combines both the measurements of the moving distance and moving direction.

When the occlusion of human faces happens, the motion cue is obvious and becomes more reliable. On the contrary, when the occlusion does not happen, we can weigh particles mainly by the observation of the colour and edge cues. As a result, we take a linear combination of colour, edge, and motion cues as follows:

Mt,i=(1−αt)βtMct,i+αtβtMEt,i+(1−βt)Mvt,i,E29

where 0≤αt,βt≤1,andαt+βt=1. Fig. 4 shows the iteration procedure for tracking the i-th interested face using our improved particle filter.

Figure 4.
The iteration procedure for tracking the i-th interested face by the improved particle filter.

After the current filtering process is finished, we execute the next filtering process again until a local search is achieved. Such a method will track faces steadily even if humans move very fast. In the first filtering process, we establish thirty particles to track faces. But, in the subsequent filtering processes, only ten particles are set up to facilitate the face tracking operation.

5.2. Occlusion handling

Occlusion handling is a major problem in visual surveillance (Foresti et al., 2003). The occlusion problem usually occurs in a multi-target system. When multiple moving objects occlude each other, both the colour and edge information are more unreliable than the velocities and directions of moving objects. In order to solve the occlusion problem, we have improved the strategies of the filtering process and modified the weights of the particles depicted in the previous subsection. Compared with the earlier proposed methods (Fahn et al., 2009; Fahn & Lin, 2010; Fahn & Lo, 2010), the major improvement is that the directions of moving objects in a number of the past frames are incorporated into the constituents of the motion cue. We will check whether the occlusion ceases to exist. If only one human face is detected in the face detection procedure, then the occlusion will not occur in this condition. Nevertheless, if two and more faces are detected, the occlusion will occur when moving faces are near to each other, and the system will go to the occlusion handling mode.

5.3. Robot control

We can predict the newest possible position of a human face by means of the particle filtering technique. According to the possible position, we issue a control command to make the robot track the i-thface continuously. Such robot control comprises two parts: PTZ camera control and wheel motor control. The following is the command set of robot control described at time step t, where the first two commands belong to the PTZ camera control and the remaining ones used for the wheel motor control:

SCommandt,i={upt,i,downt,i,forwardt,i,stopt,i,backwardt,i,leftt,i,rightt,i}.E30

5.3.1. PTZ camera control

The principal task of the PTZ camera control is to keep the i-thface appearing in the centre of the screen. It prevents the target from situating on the upper or lower side of the scope of the screen. If yt,i is lower than or equal to 50, then we assign the command upt,i to the PTZ camera at time step t; if yt,i is greater than or equal to 190, then we assign the command downt,i to the PTZ camera at time step t, as shown in Eq. (31):

Commandt,i={upt,idownt,ino-operationifyt,i≤50ifyt,i≥190otherwise,E31

where yt,i is the centre of the i-thelliptic window in the y-direction at time step t.

5.3.2. Wheel motor control

The main task of the wheel motor control is to direct the robot to track the i-thface continuously. We utilize two kinds of information to determine which command is issued. The first information is the position of the face appearing in the screen. According to this, we assign the commands to activate the wheel motors that make the robot move appropriately. If xt,i is lower than or equal to 100, then we assign the command leftt,i to the wheel motors at time step t; if xt,i is greater than or equal to 220, then we assign the command rightt,i to the wheel motors at time step t as Eq. (32) shows:

Commandt,i={leftt,irightt,ino-operationifxt,i≤100ifxt,i≥220otherwise,E32

where xt,i is the centre of the i-thelliptic window in the x-direction at time step t.

The second information is the laser range data. We equip a laser range finder on the front of the robot to measure the distance Dt,i between the robot and a target which is allowed to have the incline angle ranged from −10° to +10°. In accordance with the distanceDt,i, we can assign the commands to the wheel motors to control the movement of the robot. If Dt,i is lower than or equal to 60, then we assign the command backwardt,i to the wheel motors at time step t; if Dt,i is greater than or equal to 120, then we assign the command forwardt,i to the wheel motors at time step t; otherwise, we assign the command stopt,i to the wheel motors at time step t, as stated in Eq. (33):

Commandt,i={backwardt,iifDt,i≤60forwardt,iifDt,i≥120stopt,iotherwise.E33

Through controlling the robot, we can increase its functionality of interacting with people. If the robot can recognize humans’ identities and follow him/her abidingly, then the relation between the robot and human being is definitely closer to each other, and it can be more extensive to use. So, we control the movement of the robot on the condition of state changes between the face tracking and face recognition modes. After face detection and target confirmation, the system will switch to the recognition mode and start the robot to execute a certain task.

6. Experimental results

Table 1 list both the hardware and software that are used for the development of the real-time face tracking and recognition system installed on our robot. All the experimental results presented in this section were obtained from taking an average of the outcome data for different 10 times to demonstrate the effectiveness of the proposed methods.

Table 1.
The Developing Environment of Our Face Tracking and Recognition System Installed on the Robot

6.1. Face detection

The training database for face detection consists of 861 face images segmented by hand and 4,320 non-face images labelled by a random process from a set of photographs taken in outdoors. The sizes of these face and non-face images are all 20×20 pixels. Fig. 5 illustrates some examples of positive and negative images used for training. Notice that each positive image contains a face region beneath the eyebrows and beyond the half-way between the mouth and the chin.

Figure 5.
Some examples of positive and negative images used for training. The first row is face images and the others are non-face images.

We perform the experiments on face detection in two different kinds of image sequences: ordinary and jumble. The image sequence is regarded as a jumble if the background in this sample video is cluttered; that is, the environment is possessed of many pixels like skin colours and/or the illumination is not appropriate. Then we define the “error rate” as a probability of detecting human faces incorrectly; for example, regarding an inhuman face as a human face. And the “miss rate” is a probability which a target appears in the frame but not detected by the system. Table 2 shows the face detection rates for the above experiments.

From the experimental results, we can observe that the ordinary image sequences have better correct rates, because they encounter less interference, and the error rates are lower in this situation. The performance on the jumble image sequences is inferior to that on the ordinary ones. The main influence is due to the difference of luminance, but the effect of the colour factor is comparatively slight, particularly for many skin-like regions existing in the background. It is noted that the error rate of the jumble image sequences is larger than that of the ordinary ones, because our database has a small number of negative examples. That will cause the error rate raise. Fig. 6 demonstrates some face detection results from using our real-time face detection system in different environments.

Table 2.
The Face Detection Rates of Two Different Kinds of Image Sequences

Figure 6.
Some face detection results from using our real-time face detection system.

6.2. Face recognition

Our face image database comprises 5 subjects to be recognized, and each of them has 20 samples. The nicknames are Craig, Eric, Keng-Li, Jane, and Eva (3 males and 2 females) respectively. The facial expressions (open or closed eyes and smiling or non-smiling) are also varied. The images of one subject included in the database are shown in Fig. 7.

Figure 7.
Images of one subject included in our database.

Table 3.
The Range of Thresholds for Each Class to Eliminate the Corresponding Person from the Possible Subjects

We have our robot execute face recognition in real time using the DCV method. After comparing the feature vector of the captured and then normalized face image with those of all training models, we can find out the class whose Euclidean distance is the minimum. In other words, it is identified as the most possible person. Table 3 presents the range of thresholds for each class to eliminate the corresponding person from the possible subjects included in the database. If the feature similarity of a captured and normalized face image is lower than or equal to the threshold value for every class, the corresponding person is assigned to one of classes. For every 10 frames per a period, we rank the frequency of assignments for each class to determine the recognition result. Fig. 8 graphically shows some face recognition results from using our real-time face recognition system installed on the robot.

Figure 8.
Some face recognition results from using our real-time face recognition system installed on the robot.

6.3. Face tracking

Several different types of accuracy measures are required to evaluate the performance of tracking results. In this subsection, we utilize track cardinality measures for estimating the tracking rates. The reference data should be first determined in order to obtain any of the accuracy measures. Let A be the total number of actual tracks as the reference data, R be the total number of reported tracks, A′ be the number of actual tracks that have corresponding reported tracks, and R′ be the number of reported tracks that do not correspond to true tracks. The measures are determined by mc, the fraction of true tracks, fc, the fraction of reported tracks that are false alarms where they do not correspond to true tracks, and the miss rate ismd. The following defines these variables:

mc=A′A,E34

fc=R′A,E35

and

md=1−RA.E36

The types of our testing image sequences are roughly classified into three kinds: a single face (Type A), multiple faces without occlusion (Type B), and multiple faces with occlusion (Type C). Each type has three different image sequences, and we will perform five tests on each of them. In these experiments, we use 30 particles to track human faces in each frame for the first filtering process, followed by 10 particles. Then we analyze the performance of face tracking on the three kinds of testing image sequences according to the aforementioned evaluation methods. Table 4 shows the face tracking performance for the above experiments.

Table 4.
The Face Tracking Performance of Three Different Kinds of Image Sequences

From the experimental results, the tracking accuracy of Type C is lower than those of the other two types. This is because when the occlusion happens, the particles are predicted on the other faces that are also the colour regions. On the other hand, the correct rates of face tracking for Types A and B are similar to each other; that is, the tracking performance for a single face is almost the same to that for multiple faces without occlusion. When the robot and the target move simultaneously, some objects whose colours are close to the skin colour may exist in the background. In this situation, we must consider the speed of the target and assume that the target disappearing on the left side of the image does not appear on the right side immediately. Through simple distance limits, the robot can track the target continuously. Moreover, at regular intervals we will either abandon the currently tracked targets then intend to detect the face regions once more, or keep tracking on the wrong targets until the head moves near the resampling range of the particle filter. Attempting to enlarge the resampling range can slightly conquer this problem. However, the wider the resampling range is, the sparser the particles are. It has good outcomes for the robot to track faces, but sometimes it will reduce the tracking correctness. Fig. 9 shows some face tracking results from using our real-time face tracking system equipped on the robot.

Figure 9.
Some face tracking results from using our real-time face tracking system for: (a) a single human; (b) multiple humans within the field of view of the PTZ camera equipped on the robot.

7. Conclusion and feature works

In this chapter, we present a completely automatic real-time multiple faces recognition and tracking system installed on a robot that can capture an image sequence from a PTZ camera, then use the face detection technique to locate face positions, and identify the detected faces as the master or strangers, subsequently track a target and guide the robot near to the target continuously. Such a system not only allows robots to interact with human being adequately, but also can make robots react more like mankind.

Some future works are worth investigating to attain better performance. In the face recognition procedure, if the background is too cluttered to capture a clear foreground, the recognition rate will decrease. Because most of our previous training samples were captured in a simple environment, sometimes static objects in the uncomplicated background are identified as the foreground. We can increase some special training samples in a cluttered background to lower the miss rate during the face detection. Of course, it will raise the face recognition accuracy, but need a lot of experiments to collect special and proper training samples.

References

1. AnK. H. YooD. H. JungS. U. ChungM. J. 2005. “Robust multi-view face tracking,” in Proceedings of the IEEE International Conference on Intelligent Robots and Systems, Alberta, Canada, pp. 1905-1910
2. BingY.LianfuJ.PingC.2002A new LDA-based method for face recognition,” in Proceedings of the International Conference on Pattern Recognition, Quebec, Canada, 1168171
3. BrunelliR.PoggioT. 1993. “Face recognition: features versus templates,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 15, no. 10, pp. 1042-1052, 1993
4. CevikalpH.NeamtuM.WilkesM.BarkanaA.2005Discriminative common vectors for face recognition,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 271413
5. CevikalpH.WilkesM.2004Face recognition by using discriminative common vectors,” in Proceedings of the International Conference on Pattern Recognition, Cambridge, United Kingdom, 1326329
6. ChenL. F.LiaoH. Y. M.KoM. T.LinJ. C.YuG. J.2000A new LDA-based face recognition system which can solve the small sample size problem,” Pattern Recognition, 331017131726
7. ErM. J.WuS.LuJ.TohH. L.2002Face recognition with radial basis function (RBF) neural networks,” IEEE Transactions on Neural Networks, 133697710
8. FahnC. S.KuoM. J.WangK. Y.2009Real-time face tracking and recognition based on particle filtering and AdaBoosting techniques,” in Proceedings of the 13th International Conference on Human-Computer Interaction, LNCS 5611, San Diego, California, 198207
9. FahnC. S.LinY. T.2010Real-time face detection and tracking techniques used for the interaction between humans and robots,” in Proceedings of the 5th IEEE Conference on Industrial Electronics and Applications, Taichung, Taiwan.
10. FahnC. S.LoC. S.2010A high-definition human face tracking system using the fusion of omnidirectional and PTZ cameras mounted on a mobile robot,” in Proceedings of the 5th IEEE Conference on Industrial Electronics and Applications, Taichung, Taiwan.
11. FanK. C.WangY. K.ChenB. F.2002Introduction of tracking algorithms,” Image and Recognition, 841730
12. ForestiG. L. MicheloniC. SnidaroL. MarchiolC. 2003. “Face detection for visual surveillance,” in Proceedings of the 12th IEEE International Conference on Image Analysis and Processing, Mantova, Italy, pp. 115-120, 2003
13. GulmezogluM. B.DzhafarovV.BarkanaA.2001The common vector approach and its relation to principal component analysis,” IEEE Transactions on Speech and Audio Processing, 96655662
14. HjelmåsE.LowB. K.2001Face detection: a survey,” Computer Vision and Image Understanding, 833236274
15. HuangR.LiuQ.LuH.MaS.2002Solving the small size problem of LDA,” in Proceedings of the International Conference on Pattern Recognition, Quebec, Canada, 32932
16. LiS. Z.LuJ.1999Face recognition using the nearest feature line method,” IEEE Transactions on Neural Networks, 102439443
17. MontemerloM.ThrunS.WhittakerW.2002Conditional particle filters for simultaneous mobile robot localization and people-tracking,” in Proceedings of the IEEE International Conference on Robotics and Automation, Washington, D.C., 1695701
18. NummiaroK.Koller-MeierE.Van GoolL.2003An adaptive color-based particle filter,” Image and Vision Computing, 21199110
19. TurkM.PentlandA.1991Face recognition using eigenfaces,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Maui, Hawaii, 586591
20. ViolaP.JonesM.2001Rapid object detection using a boosted cascade of simple features,” in Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Kauai, Hawaii, 1511518
21. ZhaoW.ChellappaR.PhillipsP. J.RosenfeldA.2003Face recognition: a literature survey,” ACM Computing Surveys, 354399458

[1] 1. AnK. H. YooD. H. JungS. U. ChungM. J. 2005. “Robust multi-view face tracking,” in Proceedings of the IEEE International Conference on Intelligent Robots and Systems, Alberta, Canada, pp. 1905-1910

[2] 2. BingY.LianfuJ.PingC.2002A new LDA-based method for face recognition,” in Proceedings of the International Conference on Pattern Recognition, Quebec, Canada, 1168171

[3] 3. BrunelliR.PoggioT. 1993. “Face recognition: features versus templates,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 15, no. 10, pp. 1042-1052, 1993

[4] 4. CevikalpH.NeamtuM.WilkesM.BarkanaA.2005Discriminative common vectors for face recognition,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 271413

[5] 5. CevikalpH.WilkesM.2004Face recognition by using discriminative common vectors,” in Proceedings of the International Conference on Pattern Recognition, Cambridge, United Kingdom, 1326329

[6] 6. ChenL. F.LiaoH. Y. M.KoM. T.LinJ. C.YuG. J.2000A new LDA-based face recognition system which can solve the small sample size problem,” Pattern Recognition, 331017131726

[7] 7. ErM. J.WuS.LuJ.TohH. L.2002Face recognition with radial basis function (RBF) neural networks,” IEEE Transactions on Neural Networks, 133697710

[8] 8. FahnC. S.KuoM. J.WangK. Y.2009Real-time face tracking and recognition based on particle filtering and AdaBoosting techniques,” in Proceedings of the 13th International Conference on Human-Computer Interaction, LNCS 5611, San Diego, California, 198207

[9] 9. FahnC. S.LinY. T.2010Real-time face detection and tracking techniques used for the interaction between humans and robots,” in Proceedings of the 5th IEEE Conference on Industrial Electronics and Applications, Taichung, Taiwan.

[10] 10. FahnC. S.LoC. S.2010A high-definition human face tracking system using the fusion of omnidirectional and PTZ cameras mounted on a mobile robot,” in Proceedings of the 5th IEEE Conference on Industrial Electronics and Applications, Taichung, Taiwan.

[11] 11. FanK. C.WangY. K.ChenB. F.2002Introduction of tracking algorithms,” Image and Recognition, 841730

[12] 12. ForestiG. L. MicheloniC. SnidaroL. MarchiolC. 2003. “Face detection for visual surveillance,” in Proceedings of the 12th IEEE International Conference on Image Analysis and Processing, Mantova, Italy, pp. 115-120, 2003

[13] 13. GulmezogluM. B.DzhafarovV.BarkanaA.2001The common vector approach and its relation to principal component analysis,” IEEE Transactions on Speech and Audio Processing, 96655662

[14] 14. HjelmåsE.LowB. K.2001Face detection: a survey,” Computer Vision and Image Understanding, 833236274

[15] 15. HuangR.LiuQ.LuH.MaS.2002Solving the small size problem of LDA,” in Proceedings of the International Conference on Pattern Recognition, Quebec, Canada, 32932

[16] 16. LiS. Z.LuJ.1999Face recognition using the nearest feature line method,” IEEE Transactions on Neural Networks, 102439443

[17] 17. MontemerloM.ThrunS.WhittakerW.2002Conditional particle filters for simultaneous mobile robot localization and people-tracking,” in Proceedings of the IEEE International Conference on Robotics and Automation, Washington, D.C., 1695701

[18] 18. NummiaroK.Koller-MeierE.Van GoolL.2003An adaptive color-based particle filter,” Image and Vision Computing, 21199110

[19] 19. TurkM.PentlandA.1991Face recognition using eigenfaces,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Maui, Hawaii, 586591

[20] 20. ViolaP.JonesM.2001Rapid object detection using a boosted cascade of simple features,” in Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Kauai, Hawaii, 1511518

[21] 21. ZhaoW.ChellappaR.PhillipsP. J.RosenfeldA.2003Face recognition: a literature survey,” ACM Computing Surveys, 354399458

Real-time Multi-Face Recognition and Tracking Techniques Used for the Interaction between Humans and Robots

Reviews, Refinements and New Ideas in Face Recognition

Author Information

Chin-Shyurng Fahn*

Chih-Hsin Wang*

1. Introduction

2. Hardware description

Figure 1.

Figure 2.

3. Face detection

Figure 3.