Open access peer-reviewed chapter

Posture and Gesture Recognition for Human-Computer Interaction

By Mahmoud Elmezain, Ayoub Al-Hamadi, Omer Rashid and Bernd Michaelis

Published: October 1st 2009

DOI: 10.5772/8221

Downloaded: 2716

1. Introduction

While automatic hand posture and gesture recognition technologies have been successfully applied to real-world applications, there are still existed several problems that need to be solved for wider applications of Human-Computer Interaction (HCI). One of such problems, which arise in real-time hand gesture recognition, is to extract (spot) meaningful gestures from the continuous sequence of hand motions. Another problem is caused by the fact that the same gesture varies in shape, trajectory and duration, even for the same person. A gesture is spatio-temporal pattern which may be static or dynamic or both. Static morphs of the hands are called postures and hand movements are called gestures. The goal of gesture interpretation is to push the advanced human-computer communication to bring the performance of HCI close to human-human interaction. Sign language recognition is an application area for HCI to communicate with computers and for sign language symbols detection. Sign language is categorized into three main groups namely finger spelling, word level sign and non manual features (Bowden et al., 2003). Finger spelling is used to convey the words letter by letter. The major communication is done through word level sign vocabulary and non-manual features include the facial expressions, mouth and body position.

The techniques for posture recognition with sign languages are reviewed for finger spelling to understand the research issues. The motivation behind this review is to develop a recognition system which works more robustly with high recognition rates. Practically, hand segmentation and computations of good features are important for the recognition. In the recognition of sign languages, different models are used to classify the alphabets and numbers. For example, in (Hussain, 1999), Adaptive Neuro-Fuzzy Inference Systems (ANFIS) model is used for the recognition of Arabic Sign Language. In this proposed technique, colored gloves are used to avoid the segmentation problem and it helps the system to obtain good features. Handouyahia et al. (Handouyahia et al., 1999) presents a recognition system for the International Sign Language (ISL). They have used Neural Network (NN) to train the alphabets. NN is used for the recognition purposes because it can easily learn and train from the features computed for the sign languages. Other approach includes the Elliptic Fourier Descriptor (EFD) used by Malassiotis and Strintzis (Malassiotis & Strintzis, 2008) for 3D hand posture recognition. In their system, they have used orientation and silhouettes from the hand to recognize 3D hand postures. Similarly, Licsar and Sziranyi (Licsar & Sziranyi, 2002) used Fourier coefficients to represent hand shape in their system which enables them to analyze hand gestures for the recognition. Freeman and Roth (Freeman & Roth, 1994) used orientation histogram for the classification of gesture symbols, but huge training data is used to solve the orientation problem and to avoid the misclassification between symbols.

In the last decade, several methods of potential applications (Deyou, 2006; Elmezain et al., 2008a; Kim et al., 2007; Mitra & Acharya, 2007; Yang et al., 2007) in the advanced Hand gesture interfaces for HCI have been suggested but these differ from one another in their models. Some of these models are Neural Network (Deyou, 2006), Hidden Markov Models (HMMs) (Elmezain et al., 2008a; Elmezain et al., 2008b) and Dynamic Time Warping (DTW) (Takahashi et al., 1992). In 1999, Lee et al. (Lee & Kim, 1999) proposed an ergodic model based on adaptive threshold to spot the start and the end points of input patterns, and also classify the meaningful gestures by combining all states from all trained gesture models using HMMs. Kang et al. (Kang et al., 2004) developed a method to spot and recognize the meaningful movements where this method concurrently separates unintentional movements from a given image sequences. Alon et al. (Alon et al., 2005) proposed a new gesture spotting and recognition algorithm using a pruning method that allows the system to evaluate a relatively small number of hypotheses compared to Continuous Dynamic Programming (CDP). Yang et al. (Yang et al., 2007) presented a method for recognition of whole-body key gestures in Human-Robot Interaction (HRI) by HMMs and garbage model for non-gesture patterns.

Mostly, previous approaches use the backward spotting technique that first detects the end point of gesture by comparing the probability of maximal gesture models and non-gesture model. Secondly, they track back to discover the start point of the gesture through its optimal path and then the segmented gesture is sent to the recognizer for recognition. So, there is an inevitable time delay between the meaningful gesture spotting and recognition where this time delay is not well for on-line applications. Above of all, few researchers have addressed the problems on non-sign patterns (which include out-of-vocabulary signs, epentheses, and other movements that do not correspond to signs) for sign language spotting because it is difficult to model non-sign patterns (Lee & Kim, 1999).

The main contribution of this chapter is to explore two parts; the first part is related to hand posture and the second part deals with hand gesture spotting. In posture recognition, an approach is proposed for recognition of ASL alphabets and numbers, which is able to deal with a large number of hand shapes against complex backgrounds and lighting conditions. This approach is based on Hu-Moment, whose features are invariant of translation, rotation and scaling. Besides, geometric features are also incorporated. These feature vectors are used to train Support Vector Machine (SVM) and a recognition process that identifies the hand posture from the SVM of segmented hands. In hand gesture, a robust technique is proposed that executes gesture spotting and recognition simultaneously. The technique recognizes the isolated and the meaningful hand gestures in stereo color image sequences using HMMs. In addition, color and 3D depth map are used to detect hands where the hand trajectory will take place in further step using a robust stereo tracking algorithm to generate 3D dynamic features. This part covers the procedures to design a sophisticated method for non-gesture model, which provides a confidence limit for the calculated likelihood by other gesture models. Furthermore, the confidence measures are used as an adaptive threshold for selecting the proper gesture model or spotting meaningful gestures. The proposed techniques can automatically recognize posture, isolated and meaningful hand gestures with superior performance and low computational complexity when applied on several video samples containing confusing situations such as partial occlusion and overlapping. The rest of the chapter is organized as follows. We formulate the Hidden Markov Models in Section 2 and Support Vector Machine in Section 3. Section 4 discusses the posture and gesture approach in three subsections. Experimental results are given in Section 5. We have tested image and video sequences for hand posture and gesture spotting respectively. Finally, Section 6 ends with a summary and conclusion.

2. Hidden Markov Models

Markov Model is a mathematical model of stochastic process, which generates random sequences of outcomes according to certain probabilities (Elmezain et al., 2007; Rabiner, 1989). A stochastic process is a sequence of feature extraction codewords, the outcomes being the classification of hand gesture path. In a compact mode a discrete HMMs can be symbolized with λ= (A, B, Π) and is described as follows:

  1. The set of states S= {s1, s2, …, sN} where N represents the number of states.

  2. The set of observation symbols V= {v1, v2, …, vM} where M is the number of distinct symbols observable in each state.

  3. An initial probability for each state Πi, i=1, 2,..., N ; such that:

  1. An N-by-N transition matrix A= {aij} where aij is the probability of taking a transition from state i to state j at moment t:

  1. An N-by-M observed symbols matrix B= {bim} where bim gives the probability of emitting symbol vm in state i:

  1. The set of possible emission (an observation) O= {o1, o2, …, oT} where T is the length of gesture path.

Based on HMMs the statistical strategy has many advantages, among them being recalled: rich mathematical framework, powerful learning and decoding methods, good sequences handling capabilities, a flexible topology for the statistical phonology and the syntax. The disadvantages lie in the poor discrimination between the models and in the unrealistic assumptions that must be made to construct the HMMs theory, namely the independence of the successive feature frames (input vectors) and the first order Markov process (Goronzy, 2002). The algorithms developed in the statistical framework to use HMMs are rich and powerful, situation that can explain well the fact that today, hidden Markov models are the widest used in practice to implement gesture recognition and understanding systems. The main problems that can be solved with HMMs are:

  1. Given the observation sequence O = (o1, o2, …, oT), and a model λ= (A, B, Π) how do we efficiently compute P(O| λ), the probability of the observation sequence, given the model. This is the “evaluation problem”. Using the forward and backward procedure provides solution.

  2. Given the observation sequence O = (o1, o2, …, oT), and the model λ, how do we choose a corresponding state sequence S=(s1, s2, …, sT) that is optimal in some sense ( i.e. best explains the observation). The Viterbi algorithm provides a solution to find the optimal path.

  3. How do we adjust the model parameters λ= (A, B, Π) to maximize P(O| λ). This is by far the most difficult problem of HMMs. We choose λ= (A, B, Π) in such a way that its likelihood, P(O| λ), is locally maximized using an iterative procedure like Baum-Welch method (Rabiner, 1989).

Also, HMMs has three topologies; the first topology is Ergodic model (Fully Connected model) in which every state of the model could be reached in a finite number of steps from every other state of the model (Figure 1(a)). Other types of HMMs have been found to account for observed properties of the signal being modelled better than the standard Ergodic model. One such model is shown in Figure 8. This model is called a Left-Right Banded (LRB) because the underlying state sequence associated with model has the property that as time increases the state index increases or stays the same (i.e. no transitions are allowed to states whose indices are lower than the current state). Each state in LRB model can go back to itself or to the next state only. The last topology of HMMs is called a Left-Right (or the Bakis model) in which each state can go back to itself or to the following states. It should be clear that the imposition of the constraints of the LRB and Bakis model essentially have no effect on the re-estimation procedure. This is the case because any HMMs parameter set to zero initially, will remain at zero throughout the re-estimation procedure.

Figure 1.

a) Ergodic topology with 4 states. (b) Simplified Ergodic with fewer transitions.

3. Support Vector Machines

Support Vector Machines is a supervised learning for the optimal modelling of the data (Lin & Weng, 2004). It learns the decision function and separates the data class to the maximum width. Basically, SVM works on two-class i.e. binary classification and is also extendable for multiclass problem. In the literature, there are two types of this extension. All-together approach deals with its optimization problem. It lacks scalability and also faces optimization complexity. Second approach deals in binary fashion with multiple hyper-planes along with the combination into a single classifier. There are further two alternatives for this combination. The first one is based on one-against-all whereas other works as one-against-one. Binary classification of SVM learns on the following principle:


The SVM’s linearly learned decision function f(x) is described as:


wherewis a weight vector while b is the threshold and x is the input sample.

SVM learner defines the hyper-planes for the data and maximum margin is found between these hyper planes. Because of the maximum separation of hyper-planes, it is also considered as a margin classifier. Margin of the hyper-plane is the minimum distance between hyper-plane and the support vectors and this margin is maximised. It can be formulated as following:


whereγis margin of the hyper-plane. Maximisation of the margin of the hyper plane is depicted in Figure 2.

Figure 2.

Margin of the hyper plane.

SVM maps input data into high dimension domain where it is utmost linearly separable as shown in Figure 3. This mapping does not affect the training time because of implicit dot product and kernel trick (Cristianini & Taylor, 2001; Suykens et al., 2005). This is also a reason that SVM is a well suited classifier where features are large in number because they are robust to the curse of dimensionality. Kernel function is the computation of the inner productΦ(x)Φ(y)directly from the input. One of the characteristics of using the kernel is that there is no need to explicitly represent the mapped feature space. Kernel function is mathematically described as follows:


Following are some of the kernel functions which are commonly used to convert the input features into new feature space.

Linear kernel


RBF Gaussian kernel


Polynomial kernel


Sigmoid kernel


whereKis a scaling factor whilecis a shifting factor that controls the mapping. As discussed above, SVM outputs only the class labels for the input sample as output but not the probability information for the classes. Lin et al. describes a method to compute the class probabilities using SVM. Chang et al. developed a library ‘’LIBSVM’’ which provides the tools for the SVM functionalities including class probability estimation (Chang & Lin, 2001).

Figure 3.

Mapping from input data to a richer feature space through kernel function.

SVM has been studied a lot and is being used in a large problem domain including novelty detection regression optimization along with learning and classification. It has a basic architecture which can be modified depending upon the problem domain using margin, kernel type and duality characteristics. SVM lacks several problems which other learners do like non-linear function, problem of local minima etc. It not only distinguishes between classes but also learns to separate them optimally. In addition, the performance of SVM is declined with non-scaled data and multi-class solution is still under process (Burges, 1998).

4. Posture and Gesture Approach

In this Chapter, an approach is developed for the recognition of hand postures and gestures. Besides, improvements are done in the existing system of gesture recognition provided by IESK (Magdeburg University, Germany) whose purpose is to recognize the alphabets characters (A-Z) and numbers (0-9). The proposed approach is based on the analysis of stereo color image sequences with the support of 3d depth information. Gaussian distribution detects the skin pixels from the image sequences and depth information is used to help Gaussian distribution to build the region of interest and overcome the difficulties of overlapping regions. A framework is established to extract the posture and gesture features by the combination of various image processing techniques (Figure 4).

Figure 4.

Simplified structure showing the main modules for the posture and gesture approach.

The computed statistical and geometrical features for the hand movements are invariant to scale, rotation and translation. These features are used for the classification of posture symbols. The classification step is divided into two steps. The first step develops the classes for some set of alphabets for hand posture. In particular, the curvature analysis determines the peaks of the hand (i.e. fingertips) which helps in the reduction of computation and to avoid the classes that are not mandatory to test for that specific posture symbol. The misclassification is also reduced due to this grouping which helps in the recognition of correct symbol. Furthermore, SVM is applied on the respective set of classes to train and test the symbols. In the second step, the hand trajectory will take place in further step using Mean-shift algorithm and Kalman filter (Comaniciu et al., 2003) to generate 3D dynamic features for hand gesture. Furthermore, k-means clustering algorithm (Ding & He, 2004) is employed for the HMMs codewords. To spot meaningful gestures (i.e. Arabic numbers from 0 to 9) accurately, a non-gesture model is proposed, which provides a confidence limit for the calculated likelihood by other gesture models. The confidence measures are used as an adaptive threshold for spotting meaningful gestures.

4.1. Depth Map

Image acquisition step contains 2D image sequences and depth image sequences. For the skin segmentation of hands and face in stereo color image sequences an algorithm is used, which calculates the depth value in addition to skin color information The depth information can be gathered by passive stereo measuring based on cross correlation and the known calibration data of the cameras. Several clusters are composed of the resulting 3D-points. The clustering algorithm can be considered as kind of region growing in 3D which used two criteria; skin color and Euclidean distance (Scott, 1992; Niese et all., 2007). Furthermore, this method is more robust to the disadvantageous lighting and partial occlusion, which occur in real time environment (for instance, in case of gesture recognition). The classification of the skin pixels is improved from Figure 5 by exploiting the depth information which contains the depth value associated with 2D image pixel. In the proposed approach, the depth image is used to select the region of interest in the image and it lies in the range from minimum depth 30cm to maximum depth 200cm. However, the depth range is adaptive and can be changed. From the depth information, not only the search of object of interest is narrowed down but also the processing speed is increased. The region of interest helps to remove the computed skin pixels other than this region. Figure 6 (a)&(b) shows the normalized 2D and 3D depth image ranges up to 10m. The normalization depth images are presented for visualization in the range from 0 to 255. Figure 6 (c)&(d) shows the normalized 2D and 3D depth range of interest (i.e. range from 30cm to 200cm). It should be noted that the region of interest should include the hands and face. The improved results by using the depth information are shown in Figure 5(c).

Figure 5.

a) Original 2D image (b) Skin pixel detection without using depth map (c) Yellow color shows the detection of skin pixels in the image after applying depth information.

By the given 3D depth map from camera set-up system, the overlapping problem between hands and face is solved since the hand regions are closer to the camera rather than the face region (Figure 12& 13).

Figure 6.

a)&(b) shows the normalized 2D and 3D depth image respectively (c)&(d) shows the normalized 2D and 3D depth image for the region of interest (30cm to 200cm). F referes to the face, HL and HR represent the left and right hand respectively.

4.2. Feature Extraction

There is no doubt that selecting good features to recognize the hand posture and gesture path plays a significant role in any system performance. So, we will mention the features about postue and gesture in some details as follows.

4.2.1. Posture Features

In the proposed approach, the statistical and geometrical features are computed for the hand postures. These are described as under.

Statistical Feature Vectors

Hu-Moments (Hu, 1962) are used in statistical feature vectors and are derived from basic moments. More specifically, moments are used to describe the properties of objects shape statistically. In image analysis, moments are considered as a binarized or grey level image with 2D density distribution functions. In this manner, an image segment is categorized with the help of moments. The properties extracted from the moments are area, mean, variance, covariance and skewness.

Central Moments

Iff(x,y)is a digital image of M-by-N dimension, the central moments of order (p+q) is defined as:


wherem00gives the area of the object,m10 and m01are used to locate center of gravity of the object,x¯and y¯gives the coordinates of the center of gravity of the object (i.e. centroid). It can be seen from the above equation that central moments are translation invariant.

Normalized Central Moment

The normalized central moments are defined as:

ηpq=μpqm00Υ,Υ=((p+q)2+1),p,q ϵ{2,3,,}E13

By normalizing the central moments, the moments are scale invariant. The normalization is different for different order moments.


Hu (Hu, 1962) derived a set of seven moments which are translation, orientation and scale invariant. The equations are computed from the second and third order moments. Hu invariants are extended by Maitra (Maitra, 1979) to be invariant under image contrast. Later, Flusser and Suk (Flusser & Suk, 1993) have derived the moment invariant, that are invariant under general affine transformation. The equations of Hu-Moments are defined as:


Hu-Moments are derived from a set of seven moments. These seven moments are derived from second and third order moments. However, zero and first order moments are not used in this process. The first six Hu-Moments are invariant to reflection (Davis & Bradski, 1999) and seventh moment change the sign. Statistical feature vectors contain the following set:


whereϕ1is the first Hu-Moment. Similar is the notation for all other features in this set.

Geometrical Feature Vectors

Geometrical feature set contains two features: circularity and rectangularity. These features are computed to exploit the hand shape with the standard shapes like circle and rectangle. This feature set varies from letter to letter and is useful to recognize the alphabets and numbers. The feature set of the geometrical features is as under:


Circularity: Circularity is the measure of the shape that how much the object’s shape is closer to the circle. In the ideal case, circle gives the circularity as one. The range of circularity varies from 1 to infinity. CircularityCiris defined as:


where Perimeter is the contour of the hand and Area is the total number of hand pixels.

Rectangularity: Rectangularity defines the measure of the shape of the object that how much its shape is closer to the rectangle. The orientation of the object is calculated by computing the angle of all contour points using central moments. Length l and width w is calculated by the difference of largest and smallest angle in the rotation. In ideal case, the rectangularity (Rect) is 1 for rectangle and varies from 0.5 to infinity and is calculated as:


where area is the total pixels of the hand,lis the length andwis the width.

The statistical and geometrical feature vector set combined together to form a set of feature set. It is denoted as:


Ftotalcontains all the features used for hand posture recognition.

Curvature Feature

An important feature for the recognition of alphabets is the curvature feature which tells us about the peaks (i.e. fingertips) in hand. Therefore, before classifying the alphabets by SVM, four groups are made according to the numbers of fingertips detected in the hand. For ASL numbers, we classify them with a single classifier.


The normalization is done for features to keep them in a particular range. Geometrical features vector have the range up to infinity and these features are very different from each other, so they create a scalability problem. In order to keep them in same range and to combine them with statistical feature vector, normalization is carried out and is defined as:


whereminCirandmaxCirare the minimum and maximum circularity of the hand from all classes of feature vectors.CirnormThe notations are the same for rectangularity. Hu-Moments are normalized by the following equation:


whereϕiis the ith Hu-Moment feature.minϕallandmaxϕallare the minimum and maximum values from the set of all classes respectively.

4.2.2. Gesture Features

There are three basic features; location, orientation and velocity. So, we will do a combination of these three basic features and using them as a main feature. A gesture path in spatio-temporal pattern that consists of hand centroid points (xhand , yhand) where the coordinates in the Cartesian space can be extracted from gesture frames directly. We consider two types of location features. The first location feature is Lc that measures the distance from the centroid to a point of the hand gesture because different location features are generated for the same gesture according to the different starting points (Eq. 29). The second location feature is Lsc, which is computed from the start point to the current point of hand gesture path (Eq. 31).


where T represents the length of hand gesture path. (Cx , Cy) refers to the center of gravity at the point n. To verify the real-time implementation, the center of gravity is computed after each image frame.

The second basic feature is the orientation, which gives the direction along the hand when traverses in space during the gesture making process. As described above, the orientation feature is based on the calculation of the hand displacement vector at every point and is represented by the orientation according to the center of gravity (Ө1t), the orientation between two consecutive points (Ө2t) and the orientation between start and current hand gesture point (Ө3t).


The third basic feature is the velocity, which plays an important role during gesture recognition phase particulary at some critical situations. The velocity V is based on the fact that each gesture is made at different speeds where the velocity of the hand decreases at the corner point of a gesture path. The velocity is calculated as the Euclidean distance between the two successive points divided by the time in terms of the number of video frames as follows:


Each frame contains a set of feature vectors at time t (Lct, Lsct, Ө1t, Ө2t, Ө3t, Vt) where the dimension of space is proportional to the size of feature vectors. In this manner, gesture is represented as an ordered sequence of feature vectors, which are projected and clustered in space dimension to obtain discrete codeword that are used as an input to HMMs. This is done using k-means clustering algorithm (Ding & He, 2004; Kanungo et al., 2002), which classifies the gesture pattern into K clusters in the feature space. This algorithm is based on the minimum distance between the center of each cluster and the feature point. We divide a set of feature vectors into a set of clusters. This allows us to model the hand trajectory in the feature space by one cluster. The calculated cluster index is used as input (i.e. observation symbol) to the HMMs. Furthermore, we usually do not know the best number of clusters in a data set. In order to specify the number of clusters K for each execution of the k-means algorithm, we considered K = 28, 29,..., 37 which is based on the numbers of segmented parts in all numbers (0-9) where each straight-line segment is classified into a single cluster.

Suppose we have n sample of trained feature vectors x1, x2,..., xn all from the same class, and we know that they fall into K compact clusters, K < n. Let mi be the mean of the vectors in cluster i. If the clusters are well separated, a minimum distance classifier is used to separate them. That is, we can say that x is in cluster i if x-mi is the minimum of all the K distances. The following procedure for finding the k-means is;

  1. Build up randomly an initial Vector Quantization Codebook for the means m1, m2,..., mk

  2. Until there are no changes in any mean

    1. Use the estimated means to classify each sample of train vectors into one of the clusters mi

    2. for i=1 to K

      1. Replace mi with the mean of all of the samples of trained vector for cluster i

    3. end (for)

  3. end (Until)

A general observation is that different gestures have different trajectories in the cluster space, while the same gesture show very similar trajectories.

4.3. Classification

4.3.1. Hand posture via SVM

In the classification, a symbol is assigned to one of the predefined classes and a fusion of statistical and geometrical feature vectors are used in it. A set of thirteen ASL alphabets (i.e. A, B, C, D, H, I, L, P, Q, U, V, W and Y) and seven ASL numbers (i.e. 0-6) are recognized using SVM and are shown in Figure 7(a) and Figure 7(b) respectively. Classification phase contains two parts. Curvature is analyzed in first part for ASL alphabets where as SVM classifier is used in second part for both ASL alphabets and numbers. The reason for not putting these letters with alphabets is that some letters are very similar to alphabets and it is hard to classify them. For example, ‘D’ and ‘1’ are same with a small change of thumb. Therefore, unlike alphabets, ASL letters are not categorized into groups and classification is carried out for a single group. In this way, the first part for ASL numbers is ignored and it includes only SVM classifier part.

Figure 7.

a)&(b) Set of ASL alphabets and numbers where rectangles show postures sign used.

Curvature Analysis

In the classification, phase, we have used the number of detected fingertips to create the groups for ASL alphabets. These groups are shown in Table 1. The analysis is done to reduce number of signs in each group and to avoid the misclassifications. In the second part, SVM classifies the posture signs based on the detected fingertips.

Group Nr.FingersPosture Symbols
10A, B
21A, B, D, H, I, U
32C, L, P, Q, V,Y

Table 1.

The number of detected fingertips in posture alphabets.

4.3.2. Hand Gesture via HMMs

To spot meaningful gestures, we construct gesture spotting network as shown in Figure 8. The gesture spotting network can be easily expanded the vocabularies by adding a new meaningful gesture HMMs model and then rebuilding a non-gesture model. Shortly, we mention how to model gesture patterns discriminately and how to model non-gesture patterns effectively. Each reference pattern for Arabic numbers (0-9) is modeled by LRB model with varying number of states ranging from 3 to 5 states based on its complexity. As, the excessive number of states can generate the over-fitting problem if the number of training samples is insufficient compared to the model parameters. It is not easy to obtain the set of non-gesture patterns because there are infinite varieties of meaningless motion. So, all other patterns rather than references pattern are modeled by a single HMM called a non-gesture model (garbage model) (Lee & Kim, 1999; Yang et al., 2007; Elmezain et al., 2009). The non-gesture model is constructed by collecting the states of all gesture models in the system as follows:

  1. Duplicate all states from all gesture models, each with an output observation probabilities. Then, we re-estimate that probabilities with gaussian distribution smoothing filter to makes the states represent any pattern.

  2. Self-transition probabilities are kept as in the gesture models.

  3. All outgoing transition are equally assigned as:

aij=1aijN1,for all j,ijE34

whereaijrepresents the transition probabilities of non-gesture model from state i to state j, aij is the transition probabilities of gesture models from state i to state j and N in the number of states in all gesture models.

Figure 8.

Gesture spotting network which contains ten gesture models and one non-gesture model with two null states (Start: ST; End: ET).

The non-gesture model (Figure 1(b)& Figure 8) is a weak model for all trained gesture models and represents every possible pattern where its likelihood is smaller than the dedicated model for a given gesture because of the reduced forward transition probabilities. Also, the likelihood of the non-gesture model provides a confidence limit for the calculated likelihood by other gesture models. Thereby, we can use confidence measures as an adaptive threshold for selecting the proper gesture model or gesture spotting. The number of states for non-gesture model increases as the number of gesture model increases. Moreover, there are many states in the non-gesture model with similar probability distribution, which in turn lead to a waste time and space. To alleviate this problem, a relative entropy (Cover & thomas, 1991) is used. The relative entropy is a measure of the distance between two probability distributions.

Consider two random probability distributions P =(p1, p2,..., pM)T and Q =(q1, q2,..., qM)T , the symmetric relative entropyD(P||Q)is defined as:


The proposed state reduction is based on Eq. 35 and works as follows:

1. Calculate the symmetric relative entropy between each probability distribution pair p(l) and q(n) of l and n states, respectively.


2. Determine the state pair (l, n) with the minimum symmetric relative entropyD(P(l)||Q(n)).

3. Recalculate the probability distribution output by merging these two states over the M observation discrete symbol as:


4. If the number of states is greater than a threshold value, then go to 1, else re-estimate probability distribution output by gaussian distribution smoothing filter to makes the states represent any pattern.

The proposed gesture spotting system contains two main modules; segmentation module and recognition module. In the gesture segmentation module, we use a sliding window which calculates the observation probability of all gesture models and non-gesture model for segmented parts. The start (end) point of gesture is spotted by competitive differential observation probability value between maximal gestures (λg ) and non-gesture (Figure 9). The maximal gesture model is the gesture whose observation probability is the largest among all ten gesture p(O| λg). When this value changes from negative to positive (Eq. 38, O can possibly as gesture g), the gesture starts. Similarly, the gesture ended around the time that this value changes from positive to negative (Eq. 39, O cannot be a gesture).


Figure 9.

Simplified structure showing the main module for hand gesture spotting via HMMs.

After spotting start point in continuous image sequences, then it activates gesture recognition module, which performs the recognition task for the segmented part accumulatively until it receives the gesture end signal. At this point, the type of observed gesture is decided by Viterbi algorithm frame by frame. The following steps show how the Viterbi algorithm works on gesture modelλg:

1. Initialization:


2. Recusion (accumulative observation probability computation):


3. Termination:


whereaijgis the transition probability from state i to state j,bjg(ot)refers to the probability of emitting o at time t in state j, andδtg(j)is the maximum likelihood value in state j at time t.

5. Experiments Discussion

A method for detection and segmentation of the hands in stereo color images with complex background is used where the hand segmentation and tracking takes place using 3D depth map, color information, Gaussian Mixture Model (GMM) (Elmezain et al., 2008b; Ming-Hsuan & Narendra, 1999; Phung et al., 2002) and Mean-shift algorithm in conjunction with Kalman filter (Comaniciu et al., 2003). Firstly, segmentation of skin colored regions becomes robust if only the chrominance is used in analysis. Therefore, YCbCr color space is used in our approach where Y channel represents brightness and (Cb, Cr) channels refer to chrominance. We ignore Y channel to reduce the effect of brightness variation and use only the chrominance channels, which fully represent the color information. A large database of skin and non-skin pixels is used to train the Gaussian model. In the training set, 18972 skin pixels from 36 different races persons and 88320 non-skin pixels from 84 different images are used. The GMM technique begins with modeling of skin using skin database where a variant of k-means clustering algorithm performs the model training to determine the initial configuration of GMM parameters.

Additionally, blob analysis is used to derive the hand boundary area, bounding box and hand centroid point (Figure 12& 13). Secondly, after localization of the hand's target from the segmentation step, we find its color histogram with Epanechnikov kernel (Comaniciu et al., 2003). This kernel assigns smaller weights to pixels further from the center to increases the robustness of the density estimation. To find the best match of our hand target in the sequential frames, the Bhattacharyya coefficient (khalid et al., 2006) is used to measure the similarity by maximizing Bayes error that arising from the comparison of the hand target and candidate. We take in our consideration the mean depth value that is computed from the previous frame for the hand region to solve overlapping between hands and face. The mean-shift procedure is defined recursively and performs the optimization to compute the mean shift vector. After each mean-shift optimization that gives the measured location of the hand target, the uncertainty of the estimate can also be computed and then followed by the Kalman iteration, which drives the predicated position of the hand target. Thereby, the hand gesture path is obtained by taking the correspondences of detected hand between successive image frames (Figure 12). The input images were captured by Bumblebee stereo camera system that has 6 mm focal length at 15FPS with 240  320 pixels image resolution, Matlab and C++ implementation. Our experiments are carried out an isolated gesture recognition and meaningful gesture spotting test.

5.1. Experimental results

5.1.1. Hand Posture

For training the data, a database is built which contains 3000 samples for posture symbols taken from eight persons on a set of thirteen ASL alphabets and seven numbers. Classification results are based on 2000 test samples from five persons and sample test data used is entirely different from the training data. The computed features set are invariant to translation, orientation and scaling, therefore posture signs are tested for these properties which is an important contribution of this work. Experimental result shows the probability of posture classification for each class in the group and it is achieved for test data by the analysis of confusion matrixes. The calculated results include the test posture samples (i.e. alphabets and numbers) with rotation, scaling and under occlusion. The diagonal elements in the confusion matrixes represent the percentage probability of each class in the group. Misclassifications between the different classes are shown by the non-diagonal elements. Feature vector set for posture recognition contains the statistical feature vectors and geometrical feature vectors, so the computed confusion matrix from these features gives an inside view about how different posture symbols are similar to each other. Confusion matrixes and classification probabilities of the groups for ASL alphabets are described here:

Group 1 (No Fingertip Detected): Table 2 shows the confusion matrix of ASL alphabet ‘A’ and ‘B’. It is to be noted that there is no misclassification between these two classes. It shows that these posture symbols are very different from each other.


Table 2.

Confusion Matrix for no fingertip detection. The alphabets in this group are completely different from one another.

Group 2 (One Fingertip Detected): Table 3 shows the confusion matrix of the classes with one fingertip detected. The result of misclassification shows the tendency of a posture symbol towards its nearby posture class. Posture symbols are tested on different orientations and back and forth movements. It can be seen that alphabet ‘A’ results in least misclassification with the other posture symbols because alphabet ‘A’ is different from other postures in this group. ‘H’/’U’ has the maximum misclassification with the other posture alphabets. It is observed that the misclassification of ‘H’/’U’ with ‘B’ is occurred during the back and forth movement. In general, there are very few misclassifications between these posture signs because of the features which are translation, rotation and scale invariant.


Table 3.

Confusion Matrix of the alphabets for one detected fingertip.

Group 3 (Two Fingertips Detected): Table 4 shows the confusion matrix of the classes with two fingertips detected. The posture symbols in this group are tested for scaling and rotations. The presented results show that the highest misclassification exists between ‘P’ and ‘Q’. It is due to the reason that these two signs are not very different in shape and geometry. Besides, statistical features in this group are not very different from each other. Therefore, a strong correlation exists between the symbols in this group which leads to the misclassifications between them.


Table 4.

Confusion Matrix for the signs having two fingertips detected.

Group 4 (Three Fingertips Detected): The posture symbol ‘W’ only falls in the category of three fingertips detections. Therefore, it always results in the classification of alphabet ‘W’.

ASL Numbers: Table 5 shows the confusion matrix of the classes for ASL numbers and these are tested for scaling and rotations. The presented results show the least misclassification of letter ‘0’ with the other classes because its geometrical features are entirely different from the other classes. Highest misclassification exists between letters ‘4’ and ‘5’ as there is a lot of similarity between these signs (i.e. thumb in letter ‘5’ is open). Other misclassifications exists between the letters ‘3’ and ‘6’.


Table 5.

Confusion Matrix of ASL numbers. The maximum and the minimum classification percentage is for the numbers ‘0’ and ‘5’.

Following are the classification results based on statistical and geometrical feature vectors for posture recognition as shown in Figure 10. This result shows that the SVM clearly defines the boundaries between different classes in a group. In this figure, Y-axis shows the probability of the classes and time domain (frames) are represented in the X-axis. Probabilities computed by SVM of the resultant posture alphabets are higher due to the separation between the posture classes in respective group.

Test Sequence 1 with Classification Results: In Figure 10, the major part of graph includes posture signs ‘A’, ‘B’ and some frames at the end shows the posture symbol ‘D’. Posture signs ‘A’ and ‘B’ are the two signs that are categorized in two groups (i.e. no fingertip detection and one fingertip detected). These symbols in this sequence are tested for rotation and back and forth movement. During the occlusion, posture symbol ‘B’ is detected and recognized robustly. Figure 10 presents the test sequence with detected contour and fingertips of left hand. It can also be seen that the left hand and right hand can present different posture signs but the presented results here only show the left hand. However, it can be seen that features of posture signs does not affect much under rotation, scaling and under occlusion. Figure 11 presents the classification probabilities for test sequence in Figure 10. The classification presents good results because the probability of resultant class with respect to other classes is high. The discrimination power of SVM can be seen from this behavior and it classifies the posture signs ‘A’ and ‘B’ correctly. In the sequence, posture sign change from ‘A’ to ‘B’ in frame 90, followed by another symbol change at frame 380 from ‘B’ to ‘D’. Posture sign ‘B’ is detected robustly despite of orientation, scaling and occlusion. However, misclassifications between the groups can be seen from the graph due to false fingertip detection and segmentation. For example, in the frames where no fingertip is detected, posture signs ‘A’ and ‘B’ are classified correctly but misclassifications are observed with other signs in the group with one fingertip detected.

Figure 10.

a) The graph shows the feature set of the posture signs ‘A’, ‘B’ and ‘D’ (b) Test Sequence “ABD-Sequence” for the postures signs ‘A’, ‘B’ and ‘D’ with different rotations and scaling are presented. Yellow dotted circles show rotation where as back and scaling movements are shown by red dotted circles.

Figure 11.

Classification probability of the test sequence. Blue curve shows the highest probability in the initial frames which classifies ‘A’, classification for ‘B’ sign is shown by the brown curve and ‘D’ is shown in the last frames by the light green curve.

5.1.2. Hand Gesture

In our experimental results, each isolated gesture number from 0 to 9 was based on 60 video sequences, which 42 video samples for training by Baum-Welch algorithm and 18 video samples for testing (Totally, our database contains 420 video samples for training and 180 video sample for testing). The gesture recognition module match the tested gesture against database of reference gestures, to classify which class it belongs to.

Figure 12.

a) & (b) Isolated gesture ‘3’ with high three priorities where the probability of non-gesture model before and after state reduction is the same.

The higher priority was computed by Viterbi algorithm to recognize the numbers in real-time frame by frame over LRB topology with different number of states ranging from 3 to 5 based on its complexity. We evaluate the gesture recognition according to different clusters number from 28 to 37, based on the numbers of segmented parts in all numbers (0-9) where each straight-line segment is classified into a single cluster. Therefore, Our experiments showed that the optimal number of clusters is equal to 33 where the higher recognition is achieved. In Figure 12(a)&(b) Isolated gesture ‘3’ with high three priorities, where the probability of non-gesture before and after state reduction is the same (the no. of states of non-gesture model before reduction is 40 and after reduction is 28). Additionally, our database also contains 280 video samples for continuous hand motion. Each video sample either contains one or more than meaningful gestures. We measured the gesture spotting accuracy according to different window size from 1 to 8 (Figure 13(a)). We noted that, the gesture spotting accuracy is improved initially as the sliding window size increase, but degrades as sliding window size increase further. Therefore, the optimal size of sliding window is 5 empirically. Also, result of one meaningful gesture spotting ‘6’ is shown in Figure 13(a) where the start point detection at frame 15 and end point at frame 50.

Figure 13.

a) One meaningful gesture spotting ‘6’ with spotting accuracy for different sliding window size (1-8). (b) Gesture spotting ‘78’ where the mean-shift iteration is 1.52 per frame.

Figure 13 (b) shows the results of continuous gesture path that contains within itself two meanningful gestures ‘7’ and ‘8’. In addition, the mean-shift iteration of continuous gesture path ’78’ is 1.25 per frame, which in turn would be suitable for real-time implementation. In automatic gesture spotting task, there are three types of errors, namely, insertion, substitution and deletion. The insertion error occurs when the spotter detects a nonexistent gesture. A substitution error occurs when the meaningful gesture is classified falsely. The deletion error occurs when the spotter fails to detect a meaningful gesture. Here, we note that some insertion errors cause the substitution errors or deletion errors where the insertion errors affect on the the gesture spotting ratio directly. The reliability of automatic gesture spotting approach is computed by Eq. 43 and achieved 94.35% (Table 6).

Reliability=of correctly recognized gesturesof test gestures+of inseration errors×100%E43
Gesture pathTrain DataSpotting meaningful gestures results
TestInsertDeleteSubstituteCorrectRel . (%)

Table 6.

Result of spotting meaningful hand gestures for numbers from 0 to 9 using Hidden Markov Models.

6. Summary and Conclusion

This chapter is sectioned into two parts; the first part is related to hand posture and the second part deals with hand gesture spotting. In the hand posture, the database contains 3000 samples for training the posture signs and 2000 samples for testing. The recognition process identifies the hand shape using SVM classifier on the manipulated features of segmented hands. The results for the hand posture recognition for thirteen ASL alphabets is 98.65% and for seven ASL numbers, the recognition rate is 98.60%. For the hand gesture, an automatic hand gesture spotting approach for Arabic numbers from 0 to 9 in stereo color image sequences using HMMs is proposed. The gesture spotting network finds the start and end points of meaningful gestures that is embedded in the input stream by the difference observation probability value of maximal gesture models and non-gesture model. On the other side, it performs the hand gesture spotting and recognition tasks simultaneously where it is suitable for real-time applications and solves the issues of time delay between the segmentation and the recognition tasks. The database for hand gesture contains 60 video sequences for each isolated gesture number (42 video sequences for training and 18 video sequences for testing) and 280 video sequences for continuous gestures. The results show that; the proposed approach can successfully recognize isolated gestures and spotting meaningful gestures that are embedded in the input video stream with 94.35% reliability. In short, the proposed approach can automatically recognize posture, isolated and meaningful hand gestures with superior performance and low computational complexity when applied on several video samples containing confusing situations such as partial occlusion.


This work was supported by Transregional Collaborative Research Centre SFB/TRR 62 "Companion-Technology for Cognitive Technical Systems" funded by the German Research Foundation (DFG).

How to cite and reference

Link to this chapter Copy to clipboard

Cite this chapter Copy to clipboard

Mahmoud Elmezain, Ayoub Al-Hamadi, Omer Rashid and Bernd Michaelis (October 1st 2009). Posture and Gesture Recognition for Human-Computer Interaction, Advanced Technologies, Kankesu Jayanthakumaran, IntechOpen, DOI: 10.5772/8221. Available from:

chapter statistics

2716total chapter downloads

8Crossref citations

More statistics for editors and authors

Login to your personal dashboard for more detailed statistics on your publications.

Access personal reporting

Related Content

This Book

Next chapter

Optimal Economic Stabilization Policy under Uncertainty

By Andr&eacute; A. Keller

Related Book

First chapter

A Survey of Decentralized Adaptive Control

By Karel Perutka

We are IntechOpen, the world's leading publisher of Open Access books. Built by scientists, for scientists. Our readership spans scientists, professors, researchers, librarians, and students, as well as business professionals. We share our knowledge and peer-reveiwed research papers with libraries, scientific and engineering societies, and also work with corporate R&D departments and government entities.

More about us