Comparison of person identification approaches using multimodal biometric traits under different challenges.
The main aims of this chapter are to show the importance and role of human identification and recognition in the field of human-robot interaction, discuss the methods of person identification systems, namely traditional and biometrics systems, and compare the most commonly used biometric traits that are used in recognition systems such as face, ear, palmprint, iris, and speech. Then, by showing and comparing the requirements, advantages, disadvantages, recognition algorithms, challenges, and experimental results for each trait, the most suitable and efficient biometric trait for human-robot interaction will be discussed. The cases of human-robot interaction that require to use the unimodal biometric system and why the multimodal biometric system is also required will be discussed. Finally, two fusion methods for the multimodal biometric system will be presented and compared.
- person identification
- multimodal biometrics
- face recognition
- iris recognition
- palmprint recognition
- ear recognition
- speech recognition
Human identification is one of the oldest behaviors that were done by people to distinguish each other. In the old ages, it was unusual to wrongly identify a person because the number of people was not much in each community. Consequently, memorizing all the persons that you deal within that time was possible. Additionally, it was enough to see the face of any person or to hear his voice to recognize him; therefore, human identification was not considered as a hard issue. The increase of the number of people and the occurrence of commercial and financial transactions forced people to find new reliable methods for human identification in order to prevent the unauthorized person to access authorized information. The new methods of human identification were classified into two main approaches as traditional and biometrics approaches. Matching process of these methods is conducted not only by humans but also by automated systems, which speed up the matching process in addition to the capability of the large size of memory.
2. Person identification approaches (traditional vs. biometrics)
The traditional human identification approaches depend on changeable parameters such as passwords or magnetic/ID cards. These parameters can be easily used by illegal persons, if they know the password or have the card. Losing, forgetting, or stealing are common disadvantages for all the traditional identification methods which make it unreliable and inaccurate especially in the high precise system such as forensics, financial, bank, and border ports systems. The need for more robust systems of person identification in addition to the development of the sensors and automated systems was incentive to construct the systems that depend on the unique features of each person. These features are extracted from a human trait such as fingerprint, face, and speech. Human recognition using features that are extracted from inherent physical or behavioral traits of the individuals is defined as biometrics. In addition to the enhancement of the efficiency and capability of recognition systems, biometrics facilitates identifying, and claiming process, where it is not required to memorize any passwords or to carry any ID cards such as passports or driving license.
Biometrics is the science of establishing the identity of an individual based on a vector of features derived from a behavioral characteristics or specific physical attribute that the person holds. The behavioral characteristic includes how the person interacts and moves, such as their speaking style, hand gestures, signature, etc. The physiological category includes the physical human traits such as fingerprints, iris, face, veins, eyes, hand shape, palmprint, and many more. Evaluating these traits assists the recognition process using the biometric systems .
A biometric system includes two main phases as enrollment and recognition. Biometric data (image, video, or speech) are captured and stored in a database in enrollment phase. The recognition phase mainly includes extraction of the salient features and generation of the matching scores in order to compare query features against the stored templates. The biometric system will report an identity at the end of the decision process after performing matching, and this will be the identity of the most resembling person in the database.
3. Common biometric traits
In this section, a brief overview, requirements, advantages, and disadvantages of the most commonly used unimodal biometric traits are presented and explained.
Face recognition is one of the most important abilities that we use in our daily lives. Face recognition has been an active research area over the last 40 years, and the first automated face recognition system was developed by Takeo Kanade in 1973 . The increasing interest in the face recognition research is caused by the satisfactory performance in many widely used applications such as the public security, commercial, and multimedia data management applications that use face as biometric trait. Face recognition has several advantages over other biometrics such as fingerprint and iris besides being natural and nonintrusive. First, the most important advantage of face is that it can be captured at a distance and in covert manner. Second, in addition to the identity, the face can also show the expression and emotion of the individual such as sadness, wonder, or scaring. Moreover, it provides a biographic data such as gender and age. Third, large databases of face images are already available, where the users should provide their face image in order to acquire driver’s license or ID card. Finally, people are generally more willing to share their face images in the public domain as evinced by the increasing interest in social media applications (e.g., Facebook) with functionalities like face tagging.
A face recognition system generally consists of four modules namely face detection, preprocessing, feature extraction, and matching as shown in Figure 1. An original face image and its preprocessed variant are also shown in Figure 2.
Iris recognition is one of the most reliable methods for personal identification. The use of iris texture analysis for biometric identification is clearly well established with the advantages of uniqueness and stability. Iris recognition has been successfully applied in access control systems managing large databases. The United Arab Emirates has been using iris biometrics for border control and expellees tracking purposes for the past decade .
Iris is one of the most valuable traits for automatic identification of human being. A number of reasons justify this interest. First of all, the iris is a protected internal organ of the eye that is visible from the exterior. The iris is an annular structure and planar shape that turns easily, and it has a rich texture. Furthermore, iris texture is predominantly a phenotypic with limited genetic penetrance. The appearance is stable over lifetime, which holds tremendous promise for leveraging iris recognition in diverse application scenarios such as border control, forensic investigations, and cryptosystems.
There are also some drawbacks with it. It needs much user cooperation for data acquisition, and it is often sensitive to occlusion. Iris data acquisition needs a controlled environment. Additionally, data acquisition devices are quite costly. Iris recognition cannot be used in a covert situation.
A typical iris recognition system has four different modules such as acquisition, segmentation, normalization, and matching. These modules are shown in Figure 3 for a general iris recognition system.
The palmprint recognition system is considered as one of the most successful biometric systems that are reliable and effective. This system identifies the person based on the principal lines, wrinkles, and ridges on the surface of the palm. Studies and research over 10 years have proven that the interesting feature of palmprint is fixed and invariant, and a palmprint acquired from any person is unique, so it can be reliable as a biometric trait.
Some of the advantages of the palmprint recognition compared with other biometric trait systems are invariant line structure, low intrusiveness, and the low cost of capturing device. Palmprint identification requires either high (refers to 400 dpi or more) or low (refers to 150 dpi or less) resolution images in which high-resolution images are suitable for forensic applications such as criminal detection  and low-resolution images are more suitable for civil and commercial applications such as access control. High-resolution and low-resolution palmprint images are demonstrated in Figure 4. Additionally, the area of palmprint is larger than fingerprint; consequently, there is a possibility of capturing more distinctive features in it.
Due to its low cost, user friendly system, high speed, and high accuracy of palmprint recognition, it can be considered as one of the most reliable and suitable biometric recognition system. A lot of work has already been done about palmprint recognition, since it is a very interesting research area. However, more research is needed to obtain efficient palmprint system .
There are three groups of marks which are used in palmprint identification  as geometric features, line features (e.g., principle lines, wrinkles), and point features (e.g., minutiae points). A typical palmprint recognition system consists of palmprint acquisition, preprocessing, feature extraction, and matching phases .
The modern history of fingerprint identification begins in the 19th century with the development of identification bureaus charged with keeping accurate records about indexed individuals. The acquisition of fingerprint was performed firstly by using ink technique .
The main application of fingerprint identification is forensic investigation of crimes. John Maloy performed a forensic identification in the late 1850s  by designing a high-security identification system that has always been the main goal in the security business.
The main reasons for the popularity of fingerprint recognition are as follows:
The pattern of fingerprint is unique to each individual and immutable throughout life from infancy to old age and the patterns of no two hands resemble each other,
Its success in various applications in the forensics, government, and civilian domains,
The fact that criminals often leave their fingerprints at crime scenes,
The existence of large legacy databases such as National Institute of Standards and Technology (NIST), Fingerprint Verification Competition (FVC) evaluation databases from 2000, 2002, and 2004.
The availability of compact and relatively inexpensive fingerprint readers.
A typical fingerprint feature called minutiae is extracted from fingerprint images, as shown in Figure 5, and used for matching process for a fingerprint recognition system.
Recognizing people by their ear has recently received significant attention in the literature. There are many factors that made ear a widely used biometrics. First, the shape of the ear and the structure of cartilaginous tissue of the pinna are very discriminate. It is formed by the outer helix, the antihelix, the lobe, the tragus, the antitragus, and the concha. The ear recognition approaches are based on matching the distance of salient points on the pinna from a landmark location. Second, ear has a structure which does not vary with facial expressions or time, and it is very stable for the end of life. It has been shown that the recognition rate is not affected by aging . Third, ear biometric is convenient as its acquisition is easy because the size of the ear is larger than fingerprint, iris, and retina and smaller than face. Ear data can also be captured even without the knowledge or cooperation of the user from far distance ; therefore, it can be used in passive environment. This makes ear recognition especially interesting for smart surveillance tasks and for forensic image analysis, because ear images can typically be extracted from profile head shots or video footage.
The main drawback of ear biometric is occlusion, where the ear can be partially or fully covered by hair or by other items such as head dress, hearing aids, jewelry, or headphone. In an active identification system, it is not a critical point as the subject can pull his or her hair back, but in a passive identification, it is a problem as there will be nobody informing the subject. Other challenges on ears are different poses (angles), left and right rotation, and different lighting conditions.
The activities of automatic speaker verification and identification have a long history going back to the early 1960s . Dragon systems were the early applications that were used as speech recognizer , which focused on the ability of recognition system to provide acoustic knowledge about speaker. Baum-Welch HMM procedures were employed by these systems to train models.
Speech or voice is one of the behavioral traits that can be used in biometric systems to identify the user based on the stored voice in the enrollment phase, where the voice characteristics such as pronunciation style and voice texture are unique and distinctive for each person. On the other hand, voice can also be considered physiological in addition to behavioral feature based on the shape of the vocal track.
3.6.1. Advantages and disadvantages of voice recognition
Generally, voice recognition is nonintrusive, and people are willing to accept a speech-based biometric system with as little inconvenience as possible. It also offers a cheap recognition technology, because general purpose voice recorders can be used to acquire the data. However, a person’s voice can be easily recorded and can be used for authorized access, and the noise can be canceled by specific software. As a result, these make speech recognition to be used in many applications such as financial applications, security, retail, crime investigation, entertainment, etc.
Speech-based features are sensitive to a number of factors such as background noise, room reverberation, the channel through which the speech is acquired (such as cellular, land-line, and VoIP), overlapping speech, and Lombard or hyper-articulated speech. Additionally, the emotional and physical state of the speaker are important. An illness such as flu can change a person’s voice, and it makes voice recognition difficult. Speech-based authentication is currently restricted to low-security applications because of high variability in an individual’s voice and poor accuracy performance of a typical speech-based authentication system. Existing techniques are able to reduce variability caused by additive noise or linear distortions, as well as compensating slowly varying linear channels .
3.6.2. Speech recognition
Speech recognition process starts by acquiring the sound from a user using microphone, and then, the series of acoustic signals are converted to a set of identifying words. The speech recognition depends on many factors such as language model, vocabulary size, speaking style, speaker enrollment, and transducer . Speech recognition system is classified to “speaker dependent system,” if the user should train the system before using it, and to “speaker independent system,” if the system can recognize any speaker’s speech without the need to train phase. Speech recognition systems can also be divided into “isolated word speech” or “continuous speech” based on the number of the used vocabularies for identification process.
Speaker models [16, 17] enable us to generate the scores from which we will make decisions. As in any pattern recognition problem, the choices are numerous, and the most popular and dominated technique in last two decade is Hidden Markov Models. There are also other techniques used for speech recognition systems such as Artificial Neural Networks (ANN), Back Propagation Algorithm (BPA), Fast Fourier Transform (FFT), Learn Vector Quantization (LVQ), and Neural Networks (NN). A typical speech recognition system is shown in Figure 6.
3.7. Performance evaluation of biometrics systems
Different measurements can be used to evaluate the performance of biometric systems. The most famous measurement is the recognition rate, which is defined as the percentage of the samples that are correctly matched samples to the total tested samples. Another popular measurement is False Reject Rate (FRR) versus False Accept Rate (FAR) at various threshold values, where FRR refers to the expected probability for two mate samples which are wrongly mismatched and FAR refers to the expected probability that two non-mate samples are incorrectly matched.
Single-valued measure “Equal Error Rate (EER),” that is threshold independent, can also be used to evaluate the performance of recognition systems. EER is the value, where FRR and FAR are equal.
Detection Error Trade-off (DET) or Receiver Operating Characteristic (ROC) curves are also used to compare the performance of biometric systems in which both curves plot FRR against FAR in the normal deviate and linear scale, respectively.
4. Biometric challenges
There are several challenges and key factors that can significantly affect the recognition performance as well as degrading the extraction of robust and discriminant features. Some of these challenges such as pose, illumination, aging, facial expression variations, and occlusions are briefly described below, and these challenges are illustrated in Figure 7.
Pose variation: the images of a face or ear vary because of the camera pose (different viewpoints) as shown in Figure 7a. In this condition, some facial parts such as the eyes or nose may become partially or fully occluded. Pose variation has more influence on recognition process because of introducing projective deformations and self-occlusion. Thus, it is possible that images of the same person taken from two different poses may appear more different (intra-user variation) than images of two different people taken with the same poses (inter-user variation). There are many studies that deal with pose variation challenges in [18, 19, 20].
Illumination variation: when the image is captured, it may be affected by many factors to some degree. The appearance of the human face or ear is affected by factors such as lighting that includes spectra, source distribution, and intensity and also camera characteristics such as sensor response and lenses. Illumination variations can also have an effect on the appearance because of skin reflectance properties and the internal camera control . The problem of illumination variation is considered to be one of the main technical challenges in biometric systems especially for face and ear traits, where the face of a person can appear dramatically different as shown in Figure 7b. In order to handle variations in lighting conditions or pose, an image relighting technique based on pose-robust albedo estimation  can be used to generate multiple frontal images of the same person with variable lighting.
Aging: aging can be a natural cause of age progression and an artificial cause of using makeup tools. Facial appearance changes more drastically at younger ages less than 18 years due to the change in subject’s weight or stiffness of skin. All aging related variations such as wrinkles, speckles, skin tone, and shape degrade face recognition performance. One of the main reasons for the small number of studies concerning face recognition in the context of age factor was the absence of a public domain database for studying the effect of aging , since it was very difficult to collect a dataset for face images that contains images for the same subject taken at different ages along his/her life. An example set of images for different ages of the same person is presented in Figure 7c.
Occlusion: faces may be partially occluded by other objects such as scarf, hat, spectacles, beard, and mustache as shown in Figure 7e. This makes the face detection process a difficult task and the recognition itself might be difficult because of some hidden facial parts making features hard to be recognized. For these reasons, in surveillance and commercial applications, face recognition engines reject the images when some part of it is not detected. In the literature, local-feature based methods have been proposed to overcome these occlusion problems . On the other hand, the iris could potentially be occluded due to the eyelashes, eyelids, shadows, or specular reflections, and these occlusions can lead to higher false non-match rates.
Facial expression: the appearance of faces is directly affected by a person’s facial expression such as anger, surprise, and disgust as shown in Figure 7d. Additionally, facial hair such as beard and mustache can change facial appearance specifically near the mouth and chin regions. Moreover, facial expression causes large intra-class variations. In order to handle these facial expression problems, local-feature-based approaches and 3D-model-based approaches are designed .
5. Human robot interaction (HRI)
Human-robot interaction (HRI) is the study of how people can interact with robots and to what extent robots are exploited and used for successful interaction with human beings. It could also be defined as a field of study dedicated to understanding, designing, and evaluating robotic systems for use by or with humans. In general, the interaction is based on the communication with or reaction to each other, either people or things as shown in Figure 8.
5.1. The importance and the role of person identification in human-robot interaction
Person identification is a very important function for robots, which work with humans in the real world . Human identification by robot may enhance the extent of interaction and communication with each other, where identifying the user does not only require ID but also many other information such as age, gender, interests/hobbies, and language of each user. Knowing the age of the user will help the robot to choose the tone of voice, where child may prefer childish voice tone instead of the manly voice and vice versa. Calling “Mr, Ms, Sir, Madam” when communicating with a person is based on gender, which is also important. Additionally, identifying the interest/hobby of the user will highly enhance the interaction, since it is not acceptable to discuss boxing with a person whose interest is ballet. In addition, communicating with a person using his/her original language ensures promotion of the interaction.
5.2. The most appropriate biometric traits of a person that can easily be identified by robot
Interaction depends on the extent of communication between robots and humans. Human and a robot can construct a communication between each other using several forms. Proximity to each other is the main factor that impacts the communication forms between human and robot. Thus, communication and interaction can be classified into two general categories :
Remote interaction: the human and the robot are not at the same place and are separated spatially or even temporally (different rooms, countries, or planets)
Proximate interaction: the humans and the robots are collocated (same room)
Choosing biometric traits that robot should use to identify the user should be compatible with the aforementioned interaction categories. For the remote interaction, the biometric traits whose raw features are images such as face, ear, and iris are not convenient choices, since the majority of remote interaction is conducted by voice communication. Therefore, speech recognition may be the best choice, since it is suitable for direct (different room) and mobile calling. For proximate interaction (face-to-face interaction) and in order to create more real interaction, identification process should use a biometric trait that does not require direct contact with the user in order to capture the biometric traits such as face, ear, and voice, which are captured from a far distance.
6. Multibiometric systems
Some of the limitations imposed by unimodal biometric systems (that is, biometric systems that rely on the evidence of a single biometric trait) can be overcome by using multiple biometric modalities. Increasing the discriminant information and constraints leads to decrease the error in recognition process. More information can be acquired when using different sources of information simultaneously, and the sources of information may be on several types such as multiple biometric traits, algorithms, instances, samples, and sensors. Various scenarios in a multimodal biometric system are demonstrated on Figure 9.
Consolidating multiple features that are acquired from different biometric sources in order to construct a person recognition system is defined as multibiometric systems. For example, fingerprint and palmprint traits, or right and left iris of an individual, or two different samples of the same ear trait may be fused together to recognize the person more accurate and reliable than unimodal biometric systems. Due to the use of more than one biometric source, many of the limitations of unimodal systems can be overcome by the multimodal biometric systems .
Multibiometric systems are able to compensate a shortage of any source using the other source of information. In addition, the difficulty of circumvention of multiple biometric sources simultaneously creates more reliable systems than unimodal systems. On the other hand, the unimodal biometric systems are low cost and require less enrollment and recognition time compared to multimodal systems. Hence, it is essential to carefully analyze the tradeoff between the added cost and the benefits earned when making a business case for the use of multibiometrics in a specific application such as commercial, forensics, and the biometric systems that include large population.
The information used in recognition process can be fused in five different levels :
Sensor level fusion: information of the individual is captured by multiple sensors in order to generate new data that is afterward subjected to feature extraction phase. For instance, in the case of iris biometrics, samples from “Panasonic BM-ET 330” and “LG IrisAccess 4000” sensors may be fused to obtain one sample.
Feature level fusion: in this level, the extracted features from multiple biometric sources are fused to obtain a single feature vector that contains rich biometric information about a client. Integration at feature level is expected to offer good recognition accuracy because it detects the correlated feature values generated by different biometric algorithms, thereby identifying a set of distinguished features.
Score level fusion: it is the most commonly used fusion technique due to the ease of performing a fusion of the match scores in multibiometric systems. Match scores of multiple classifiers are integrated in score-level fusion to produce a single match score, which is used to get a final decision. Score level fusion requires performing score normalization, which converts the scores into common scale. The fused match score is then calculated by three categories, namely likelyhood ratio–based score fusion, transformation-based score fusion, and classifier-based score fusion.
Rank level fusion: it is defined as consolidating associated ranks of multiple classifiers in order to derive consensus rank of each identity to establish the final decision. Rank-level fusion provides less information compared to score level fusion, and it is relevant in identification mode. The final decision of rank-level fusion is obtained by three well-known methods namely Highest Rank, Borda Count, and Logistic Regression methods.
Decision level fusion: the outputs (decisions) of different matchers may be fused to obtain a single/final decision (genuine or imposter in a verification system or the identity of the client in an identification system). A single class label can be obtained by employing techniques like majority voting, behavior knowledge space, etc.
Among the aforementioned fusion techniques, the most popular ones are score-level fusion and feature-level fusion. Most of the person identification systems use these fusion techniques because of their simplicity and high performance. These systems are compared in Table 1 by demonstrating many details of the state-of-the-art multibiometric systems.
|Identification approach||Biometric traits||Databases and challenges||Fusion strategy||Recognition rate (%)|
|Toygar et al. ||Face|
(P, I, E, O, N)
Face + Voice: 94.24
Face + Voice: 97.43
|Eskandari and Toygar ||Iris|
(I, O, N, D)
(P, I, E, O, N)
(I, O, N)
|Feature-level and Score-level fusion|
Face + Iris: 98.66
|Farmanbar and Toygar ||Palmprint|
(P, I, E)
|Feature-level and Score-level fusion|
Palmprint + Face: 99.17
|Hezil and Boukrouche ||Ear|
Palmprint + Ear: 100
|Ghoualmi et al. ||Iris|
Iris + Ear: 99.67
|Telgad et al. ||Face|
Fingerprint-Gabor Filter: 95
Face + Fingerprint: 97.5
|Patil and Bhalke ||Fingerprint|
Fingerprint + Palmprint + Iris = 95.23
The results shown in Table 1 prove that consolidation of different unimodal biometric systems construct a recognition system that is robust against many challenges such as occlusion, pose, and nonuniform illumination. Additionally, the studies presented in Table 1 demonstrate that score-level fusion of more than one biometric trait overcomes the limitations of unimodal biometric systems, and in most of the studies, score-level fusion results outperform feature-level fusion results for person identification.
7. Fusion of face and speech traits
Based on the purpose of the robot, a unimodal or a multimodal recognition system could be selected to be used for human-robot interaction. For example, a military purpose robot should be more accurate than home purpose robot. As mentioned in Section 5.2, the common trait that can be used for human identification by robot in both remote and proximate interaction is voice biometric trait. On the other hand, the face is the most realistic biometric trait in case of proximate interaction.
It will be appropriate to fuse face and voice in human-robot interaction, since both of these traits are noncontacted and the user is unaware that recognition is being performed. Many studies proved that the fusion of face and speech is appropriate for many purposes [37, 38, 39], where face and speech are the best choices since both of them do not need physical or direct contact with sensors [40, 41]. Another advantage of speech over face is that speech can be recognized even when a human and robot are not found in the same physical place. This is useful for voice recognition purposes by mobile phone or when a user and robot are in two different rooms in the same place. Consequently, a realistic human-robot interaction system is achieved, either HRI is conducted by face-to-face, blind, or invisible interaction.
Multimodal biometrics in the context of human-robot interaction is discussed under different challenges. The most commonly used biometric traits namely face, iris, fingerprint, ear, palmprint, and voice are discussed in this chapter. Various challenges such as pose, illumination, expression, aging variations, and occlusion are explained, and many state-of-the-art biometric systems involving these challenges are presented and compared. The comparison of these systems shows that multimodal biometrics overcomes the limitations of unimodal systems and achieves better person identification performance. Additionally, score-level fusion technique applied on more than one biometric trait obtains higher recognition rates for person identification. On the other hand, fusion of face and speech is an appropriate choice for human-robot interaction, since the enrollment phase of face and speech biometric systems does not require physical or direct contact with sensors. The face image or speech of a person can be captured by a robot, even if the person is far away from the robot.