1. Introduction
At present, the inclination of robotic researchers is to develop social robots for a variety of application domains. Socially intelligent robots are capable of having natural interaction with a human by engaging in complex social functions. The challengeable issue is to transfer these social functions into a robot. This requires the development of computation modalities with intelligent and autonomous capabilities for reacting to a human partner within different contexts. More importantly, a robot needs to interact with a human partner through human-trusted social cues which create the interface for natural communication. To execute the above goals, robotic researchers have proposed a variety of concepts that are biologically-inspired and based on other theoretical concepts related to psychology and cognitive science. Recent robotic research has been able to achieve the transference of social behaviors into a robot through imitation-based learning (Ito et al., 2007) (Takano & Nakamura, 2006), and the related learning algorithms have helped in acquiring a variety of natural social cues. The acquired social behaviors have emphasized equipping robots with natural and trusted human interactions, which can be used to develop a wide range of robotic applications (Tapus et al., 2007).
The transference of a variety of skills into a robot involves several diminutive and imperative processes: the need for efficient media for gathering human motion precisely, the elicitation of key characteristic of motion, a generic approach to generate robot motion through the key characteristics of motion, and the need for an approach to evaluate generated robot motions or skills. The use of media for amassing human motions has become a crucial factor that is very important for attaining an agent's motion within deficit noisy data. Current imitation research has explored ways of simulating accurate human motions for robot imitations through a motion capture system (Calinon & Billard, 2007(a)) or through image processing techniques (Riley et al., 2003). A motion capture system provides accurate data that is quieter than image processing techniques (Calinon & Billard, 2007(b)). However, approaches using existing motion capture systems or image processing techniques have faced tedious problems. For example, when using a current motion capture system, markers must be placed on the subject's body, which sometimes causes discomfort for expressing natural motion. Also, image processing techniques utilize more than five cameras to detect human motions, which is a technically difficult task when processing information from five cameras simultaneously.
The earlier stage of imitation research (Hovel et al., 1996) (Ikeuchi et al., 1993) has focused on action recognition and detection of task sequences to teach a demonstrator's task to robots. They have mostly focused on developing perceptual algorithms for visual recognition and analysis of human action sequences. Perceptions were segmented into the actions for defining demonstrator tasks, and these sub-tasks (sequences) were repeated by the robot's arm. This work has dealt with a robot's arm for imitating a demonstrator's tasks, which has been convenient for generating a robot's arm motion in comparison to a robot's whole body motions. A human's body motions are complex when it performs tasks or behaviors, with the angle of their body parts dynamically changing (the kinematics of body motion), and each of the body angles have a relationship to each other. To transfer a demonstrator's motions into a robot, we must consider the above points, including the characteristics of motions.
In essence, an imitation approach must assort the characteristics of an agent's motion: the speed of the motion, the acceleration of motions, the distribution of motions, the changing point of motion directions, etc. Since recent robotic platforms have focused on developing the kosher mathematical model for extracting the characteristics of human motion, these extractions have evolved conveniently for transferring human motion into a robot (Aleotti & Caselli, 2005) (Dillmann, 2004). Kuniyoshi (Kuniyoshi et al., 1994) proposed a robot imitation framework that reproduces a performer's motion by observing the characteristics of motion patterns. A robot has reproduced a complex motion pattern through a recurrent neural network model.
Inamura (Inamura et al., 2004) proposed a robot learning framework by extracting motion segmentation. Motion segmentation has been employed by a Hidden Markov Model (HMM) for the acquisition of a proto symbol to represent body motion. These elicited motion segmentations with a proto symbol have been expended to generate a robot's motions. A problem with these contributions has been the patterns of motion have been assorted by observing the entire motion in each time interval. Instead of assorting the characteristics of motion via observation, it is important to design a mathematical model for selecting the characteristics of motion autonomously.
Another tendency of the proposed motion primitives is based on a framework for robot learning of complex human motions (Kajita et al., 2003) (Mataric, 2000). Recognizing primal motion primitives in each time interval is a decisive issue which is used for generating a whole robotic motion by combining the extracted motion primitives. In (Shiratori et al., 2004), the proposed robot learns dancing through motion primitives, and the forced assumption of an entire dance motion is a combination of determinate motion primitives. To disclose the motion primitives, the speed of the hands and legs during dancing and the rhythm of music are used. Most educed motion primitives are not meaningful and are difficult to replicate. The motion primitives-based techniques are able to cope with a variety of problems when motion primitives are extracted. Thus, there is a need to define diverse motion primitives and to yield to the whole motion through defined motion primitives. This procedure is able to procure different motion patterns that are dissimilar to the original agent's motions. Also, a motion primitive-based technique has to rely on a starting and end points of each motion primitive to generate a robot's motion accurately, which is contestable and arduous in this field.
Calinon & Billard (Calinon & Billard, 2007(c)) have proposed a robot imitation algorithm that projects motion data into a latent space, and the resulting data is employed by the Gaussian Mixture Model (GMM) in order to generate the robot's motion. In addition, a demonstrator is used to refine their motion while the robot reproduces the skills. Several statistical techniques, including a demonstrator motion and a motion-refined strategy were employed for generating the robot's motions. The proposed approach must process a demonstrator motion with recent motion-refined information simultaneously in order to successfully implement the imitation task. We believe their imitation task became too complicated, and another mathematical approach which combines the demonstrator's motion with a motion refine task (robot's motor information) for determining the robot's motions must be considered. The main emphasis of the robot imitation algorithm is that it relies on using less motion data (selecting symbolic postures), and it is necessary to conceive the robot limitation and environment using a simple mathematical framework for imitating human motion precisely.
In our approach, the robot does not use an agent's entire body motion to generate its motion. Instead, it selects preferable symbolic postures to re-generate the robot's motion through the dissimilarity values without any prior knowledge of social cues. Most existing imitation research attempts to transfer an agent's entire motion without considering a robot's limitations (e.g., motor information, body angles, and limitation of robot's motion). These methods are only applicable for predefined contexts, and are inconvenient to consider as a general framework for robot imitation in different contexts.
In contrast, our approach aims to extract symbolic postures, and through these elicited postures the robot generates the rest of the motions while its limitations are enumerated. Therefore, our proposed approach attempts to generate robot motion in different contexts without changing the general framework. Reinforcement Learning (RF) (Kaelbling et al., 1996) is utilized for finding optimal symbolic postures between two selected consecutive dissimilar postures.
2. Human motion tracking
Our approach needs to acquire human's motion information to transfer natural social cues into robot. To accomplish the above task, we have proposed the use of a single camera-based, image-processing technique to accurately obtain a agent's upper body motion. We attach a small color patch to a agent's head, right shoulder, right elbow right wrist, body/naval, left wrist, and left elbow (see Fig. 1). Through these markers, we estimate a agent's 12 upper body angles: hip front angle, shoulder font/rear angle (both left and right hand), shoulder twist angle (both left and right hand), elbow angle (both left and right hand), head front angle, neck twist angle, and neck tilt angle (see Fig. 1 for more details).
3. The extraction of symbolic postures
In this paper, we propose an approach capable of learning and eliciting the motions' segmentation points through postures dissimilarity values without any prior knowledge of the motions.
Our approach assumes that the highest potential dissimilarity posture (points) can change the direction of the motion or the pattern of motion. Here we assumed that the characteristics of posture can be extracted through 12 upper body angles with the mean and variance of the postures in each frame. The postures' dissimilarity values can be computed according to the correlation of two consecutive postures. In this phase we explore the possible key-motion points which are capable of changing the motion pattern or motion directions.
First, we estimated the dissimilarity of two consecutive postures, and the highest dissimilarity values were directed to elicit dissimilarity postures from the entire motion. During this phase, we selected only higher dissimilarity postures which fulfill the 0.8 < i i+1 1 condition. Then, the earliest postures of consecutive postures were selected; for example, if posture number
The significance of our approach was to estimate the possible key-motion points which are common for 12 upper body angles.
However, a study by (Calinon & Billard, 2007(d)) showed that it was necessary to consider each of the joint angles separately for extracting key-motion points. We believe that we have to consider the structure of the posture (combination of joint angles) to elicit key-motion points, since a posture provides information about how each of the joint angles are related in a particular frame. Accordingly, the selected key-motion points were considered as segmentation points of the demonstrator's motions.
4. Elicitation of optimal symbolic postures from reinforcement learning
In a study by (Calinon & Billard, 2007(d)) (Inamura et al., 2004) an HMM model was used for extracting dynamic features of a demonstrator's motions at states of the HMM to construct a robot's motions. Aude (Calinon & Billard, 2007(d)) used an HMM model with the Viterbi algorithm to elicit key-motion points from the entire motion. Here, the Viterbi algorithm searches the most significant state combinations from the inflexion point which are selected by local minimum or maximum points. As is generally known, a Viterbi algorithm searches an optimal state sequence to model motion or behavior. Moreover, the approach forces the Viterbi algorithm to select the best state sequence from inflexion points. But one problem is that the mechanism of the Viterbi algorithm does not consider eliciting the best state sequence, which includes the best key motion points to construct robot's motion. In that sense, there is a limitation in using an HMM for eliciting key motion points which can be considered as the best key motion points to generate a robot's motion - although HMM does provide the best sequence of states for modeling a human's motion or behaviors.
In our approach we used a Reinforcement Learning (Kaelbling et al., 1996) algorithm to learn and extract the most significant postures, which considered the individual difference of the postures. An RF mechanism is capable of directly considering the posture dissimilarity values to find the optimum postures (key motions) in order to construct the robot's motion for a given demonstrator's motion. This is the motivation for and advantage of using RF compared to a HMM, since RF learning extracts a few postures that have maximum individual differences of postures compared with entire postures. We estimated the postures dissimilarity values (pii+1) through equation 1. The estimated values are considred as the states in Q-learning (pii+1 si), and the action is defined as the movemnet of state si si+1. We can define Q-learning function as:
Where R (si, ai) is the reward matrix for each of the actions. The action ai is defined as the movement of one state (posture) to another state (posture) and the element of the reward matrix is based on the value of the state transit (action) which is estimated using posture dissimilarity. In the Q-learning function, the action policy was defined as an essential part to find the optimal postures that have a maximum individual difference when compared with the other postures (motion points) or the optimal verdict to the Q-learning (see Fig.3). Accordingly, we defined two action policies: a state transit can move from one state si to another state sk with i<k, and a state transit cannot be at a similar state (no link between si and si ).
To process Q-learning, we must initialize the rewards matrix R (si, ai) whose estimatation is based on the individual difference of postures estimated by ik= R(si sk, ai), where i<k.
Consequently, if element of R(si sk, ai)>0, the initial reward matrix has a connection between si and sk; otherwise, the reward matrix does not have a connection between si and sk.
These policies are applied to the initial reward matrix. Here, we determine the learning rate t and the discount factor as 1. In the initial stage, we setup the Q-matrix Q(st, at) as a zero matrix. Afterwards, we update ˆQ(st, at) using the reward matrix. After updating the ˆQ(st, at), we employed the epsilon greedy policy to find out the optimal state, and the corresponding key state was used as a guide to extract the optimal key-motion points (postures). RF is the concept of extracting postures that have the most individual difference values from motion sequences. In extracting these postures (key-motion), we assumed that the changing point of motion direction or motion pattern was also significant for constructing a robot's motion.
A similar mechanism is applied to the rest of the unlearned postures to extract optimal symbolic postures from the entire range of human motions. After extracting the optimal symbolic postures, our approach incorporates the divisional cubic spline interpolation for generating a robot's motion, considering each of angles as separate. Please refer to Fig.2 for further understanding of the proposed algorithm.
3. Generating robot motions
In this phase, we consider the trajectory of the angle (demonstrator) separately in the task space to construct each of the robot's angles, since we know the body scales of the robot and demonstrator are totally different. Indeed, both robot motion and demonstrator motion are proportional to each other when we capture motion through their joint angles because the body joint angles do not depend on the scale of the body. To construct the robot's motion, each of the angle trajectories are considered separately in task space, and selected key-motion points (common for every angle) are considered as reference points in the spline interpolation to construct the robot motions. We can define selected reference motion points as (0, 1.., n), where i= 0,1,..,n represents the selected key-motion points, and the corresponding time as (t1, t2,.., tn). The divisional cubic spline interpolation is defined as:
where tj< t < tj+1, j= 0,1..., n-1 ; also aj, bj, cj, and dj are unknown parameters. Each cubic spline is generated by considering two consecutive points. To estimate aj, bj, cj, and dj, we need to define uj, hj, and vj:
After estimating uj, we compute aj, bj, and c
Estimating the above parameters at time tj, where j= 0, 1,...,n we can generate an angle of robot's smooth motion. A similar approach is utilized for generating data of other angles for obtaining an entire robot's motion smoothly and precisely.
5. Experimental protocol
The non-verbal communication channels help to transfer information interactively, and to provide more explicit elucidation to the meaning of verbal language. Since non-verbal communication is an essential channel in human communication for language understanding. Among these channels, a gesture-based channel plays a dominant role in human-human communication.
Recently, robotic research induced the development of a social cue-embodied robot to ameliorate the interface for natural human-robot interactions. A gesture-based channel can be used to more efficaciously and attractively create natural social cues embodied in a robot when in comparison with other communication channels, e.g., facial expressions. However, a gesture-based channel has played a major role in human-human communications, and we believe that a similar manner will work in human-robot interactions.
The experiment was conducted with a Fujitsu HOPE-3 robot with 28 degrees of freedom. The robot's leg DOF was set to a constant position. The human agent wore eight color patches and expressed three social cues in a natural way. Through an image processing technique, we estimated the position of the color patch within each frame. During the process, we first estimated the angle between the human body and camera position, which helped to estimate the 12 body angles.
Since, in our experiment, we attempt to transfer three social cues: a "pointing'' gesture (see Fig. 4), "a gesture for explaining something attractively'' (see Fig. 5) and a gesture for expressing "I don't know'' (see Fig. 6), the human agent is used for transferring these selected social cues to the robot through the proposed imitation algorithm. The aforementioned gesture-based social cues are frequently used in human-human communication, and consequently for these social cues the robot would be used to create better natural human-robot interactions.
The dissimilarity values using the reinforcement learning method was applied to elicit symbolic key postures from the entire motions. Finally, we utilized the divisional cubic spline interpolation for generating robot motion considering each of the 12 angles separately. Fig. 4, Fig. 5, and Fig. 6 illustrate the expression of the agent's social cues and corresponding robot social generated by the proposed imitation algorithm. Our proposed algorithm precisely transferred the social cues into the robot. The robot obtained similar motion patterns of social cues when compared with the agent expressed motion.
6. Experimental results
The novel part of the proposed method is its use of simple and accurate mathematical concepts with a few symbolic gestures for generating the whole robot motion. The robot required less computational complexity to precisely generate natural social cues. The generated robot social cues are commensurate to the patterns of the agent's social cues, and these can be validated by comparing the body angle data of the robot with the actual human body angle data (refer to Fig.7 – Fig.10 for a further description of the proposed algorithm).
Fig.7 illustrates the left hand front/rear angle, and right elbow angle (Fig. 8) for the "pointing gesture.'' In the figure, the dashed-line represents the original human angles and the solid-line represents the generated robot angles. In addition, the x-axis represents the time and the y-axis represents the radian values of angle. The pointing gesture has a quiet simple motion when compared to the other social cues. However, the figures substantiated our claim that the robot-generated social cues had an almost similar pattern as that of the human-agent expressed social cues.
Also, according to the experimental results, some time intervals contained noisy data (see Fig. 7 at time 0.3 < t < 0.4). However, our proposed approach still did not consider these noisy data points in generating the robot's motion. The reason is that we compared the posture dissimilarity values extracted the key symbolic postures which consisted of all 12 body angles.
Also, a similar pattern was shown in Fig. 8 time range in 0.3 < t < 0.4 and 0.3 < t < 0.4. These results support our claim that the noisy data did not have a significant effect on generating accurate robot motion. In order to validate our proposed algorithm, the final social cues were transferred as the "I don't know'' social cue. When carefully analyzing the angle of the right elbow (Fig. 9) and left front/rear (Fig. 10), the robot generated these motions more precisely than the other social cues.
The results of our experiment provide further evidence to validate that the noisy data did not have a significant effect on generating the robot motion precisely. This is demonstrated in Fig.11 and Fig.12, which represent the right hand elbow angle (Fig.11), and right hand shoulder twist angle (Fig.12). The data of the angles were obtained when the human demonstrator expressed the "gesture for explaining something attractively.'' Here, the "circle'' symbol represents selected key motion points for the cubic spline in generating robot motions. Furthermore, Fig.12 shows certain noisy data that were not selected as key motion points.
However, when considering the right hand twist angle (Fig.12) separately, that point still represents a point similar to the motion changing point. The concept of our proposed method includes considering and comparing all body angles to determine the key motion points (symbolic postures).
This manifests how our approach is capable of ignoring noised data efficiently. However, our mechanism did not select that point as a motion changing point. Overall, our results showed that the proposed imitation algorithm was able to generate the robot's social cues precisely, which corresponds to the agent's social cues, except during certain small time intervals.
7. Conclusion
In this paper, we presented a framework to transfer the natural gestural behaviors of a human agent to a robot through a robust imitation algorithm. The novelty of our proposed algorithm is the use of symbolic postures to generate the gestural behaviors of a robot ithout using any training data or trained model. The idea behind using symbolic postures is that a robot is flexibly able to generate its own motion.
The main challenge in robot imitation is identifying the changing points of motion direction at each time interval. In our approach, we estimated the changing points of motion direction through posture dissimilarity values and reinforcement learning at each time interval.
The image processing-based method obtained some noisy data that estimated the position of the colored patches. The noisy data did not have a significant effect on the accurate generation of the robot's motion, which was due to the fact that the imitation algorithm generated the robot's motion through only a small number of symbolic postures. Overall, the experimental results revealed that the proposed imitation algorithm imitated the human gestural behaviors quite accurately, except during only a few time intervals.
Acknowledgments
This research has been supported by both the Grant-in-Aid for Young Scientists (B)(19700477) from the Japan Society for the Promotion of science (JSPS) and the Grant-in-Aid for Sustainable Research Center of the Ministry of Education, Science, Sports and Culture of Japan.
References
- 1.
Aleotti J. Caselli S. 2005 Trajectory clustering and stochastic approximation for robot programming by demonstration, ,1029 1034 , August 2005, IEEE computer society - 2.
Calinon S. Billard A. 2007a Active teaching in robot programming by demonstration, ,702 707 , August 2007, IEEE computer society - 3.
Calinon S. Billard A. 2007b What is the teacher’s role in robot programming by demonstration?- Toward benchmarks for improved learning,8 3 441 464 - 4.
Calinon S. Billard A. 2007c Incremental learning of gestures by imitation in a humanoid robot, ,255 262 , 2007, ACM - 5.
Calinon S. Billard A. 2004d Stochastic gesture production and recognition model for a humanoid robot, ,2769 2774 , 2004 IEEE computer society - 6.
Dillmann R. 2004 Teaching and learning of robot tasks via observation of human performance , ,47 3 109 116 - 7.
Hovel G. Sikka P. Mccarragher B. 1996 Skill acquisition from human demonstration using a hidden markov model, ,2706 2711 , 1996, IEEE computer society - 8.
Ikeuchi K. Kawade M. Suehiro T. 1993 Assembly task recognition with planar, curved, and mechanical contacts, ,688 694 , 1993, IEEE computer society - 9.
Inamura T. Tanie H. Nakamura Y. 2004 Embodied symbol emergence based on mimesis theory, ,23 5 363 377 - 10.
Ito M. Noda K. Hoshino Y. Tani J. 2007 Dynamic and interactive generation of object handling behaviours by a small humanoid robot using a dynamic neural network model,19 3 323 337 - 11.
Kaelbling L. Littman M. Moore A. 1996 Reinforcement learning: a survey, ,4 2 237 285 - 12.
Kajita S. Kanehiro F. Kaneko K. Fujiwara K. Harada K. Yokoi K. Hirukawa H. 2003 Biped walking pattern generation by using preview control of zero-moment point, ,1620 1626 , 2003, IEEE computer society - 13.
Kuniyoshi Y. Inaba M. Inoue H. 1994 Learning by watching: extracting reusable task knowledge from visual observation of human performances, ,10 6 799 822 - 14.
Mataric M. 2000 Getting humanoids to move and imitate , , IEEE,15 4 18 24 - 15.
Riley M. Ude A. Wade K. Atkeson C. 2003 Enabling real-time full-body imitation: a natural way of transferring human movements to humanoids, ,2368 2374 , May 2003, IEEE computer society - 16.
Shiratori T. Nakazawa A. Ikeuchi K. 2004 Detecting dance motion structure through music analysis, ,857 862 , 2004, IEEE computer society - 17.
Takano W. Nakamura Y. 2006 Humanoid robot’s autonomous acquisition of proto-symbols through motion segmentation, ,425 431 , December 2006, IEEE computer society - 18.
Tapus A. Mataric M. Scassellati B. 2007 14 1 35 42