Action Types and Action Commands
In this chapter, I introduce a new concept, “multimodal command language to direct home-use robots,” an example language for Japanese speakers, some recent user studies on robots that can be commanded in the language, and possible future directions.
First, I briefly explain why such a language help users of home-use robots and what properties it should have, taking into account both usability and cost of home-use robots. Then, I introduce RUNA (Robot Users’ Natural Command Language), a multimodal command language to direct home-use robots carefully designed for nonexpert Japanese speakers, which allows them to speak to robots simultaneously using hand gestures, touching their body parts, or pressing remote control buttons. The language illustrated here comprises grammar rules and words for spoken commands based on the Japanese language and a set of non-verbal events including body touch actions, button press actions, and single-hand and double-hand gestures. In this command language, one can specify action types such as
2. Multimodal command language
Many scientists predict that home-use robots which serve us at home will be affordable in future. They will have a number of sensors and actuators and a wireless connection with intelligent home electric devices and the internet, and help us in various ways. Their duties will be classified into physical assistance, operation of home electric devices, information service using the network connection, entertainment, healing, teaching, and so on.
How can we communicate with them? A remote controller with many buttons and a graphical user interface with a screen and pointing device are practical choices, but are not suited for home-use robots which are given many kinds of tasks. Those interfaces require experiences and skills in using them, and even experienced users need time to send a single message pressing buttons or selecting nested menu items. Another choice which will come to one’s mind is a speech interface. Researchers and componies have already developed many robots which have speech recognition and synthesis capabilities; they recognize spoken words of users and respond to them in spoken messages (Prasad et al., 2004). However, they do not understand every request in a natural language such as English for a number of reasons. Therefore, users of those robots must know what word sequences they understand and what they do not. In general, it is not easy for us to learn a set of a vast number of verbal messages a multi-purpose home-use robots would understand, even if it is a subset of a natural language. Another problem with spoken messages is that utterances in natural human communication are often ambiguous. It is computationally expensive for a computer to understand them (Jurafsky & Martin, 2000) because inferrencess based on different knowledge sources (Bos & Oka, 2007) and observations of the speaker and environment are required to grasp the meaning of natural language utterances. For example, think about a spoken command “Place this book on the table“ which requires identification of a book and a table in the real world; there may be several books and two tables around the speaker. If the speaker is pointing one of the books and looking at one of the tables, these nonverbal messages may help a robot understand the command. Moreover, requests such as “Give the book back to me“ with no infomation about
Now, think about a
One should not consider only sets of verbal messages but also
Graphical user interfaces are computaionally inexpensive and enable unambiguous commands using menus, sliders, buttons, text fields, etc. However, as I have already pointed out, they are not usable for all kinds of users and they do not allow us to choose among a large number of commands in a short period of time.
Since character user interfaces require key typing skills, spoken language interfaces are preferable for nonexperts although they are more expensive and there are risks of speech recognition errors. As I pointed out, verbal messages in human communication are often ambiguous due to multi-sense or obscure words, misleading word orders, unmentioned information, etc. Ambiguous verbal messages should be avoided because it is computationally expensive to find and choose among many possible interpretations. One may insist that home-use robots can ask clarification questions. However such questions increases time for a single command, and home-use robots which often ask clarification questions are annoying.
Keyword spotting is a well-known and polular method to
Verbal messages that are not ambiguous tend to contain many words because one needs to put everything in words. Spoken messages including many words are not very natural and more likely to be misrecognised by speech recognisers. Nonverbal modes such as body movement, posture, body touch, button press, and paralanguage, can cover such weaknesses of a verbal command language. Thus, a well-defined multimodal command set combining verbal and nonverbal messages would help users of home-use robots.
Perzanowski et al. developed a multimodal human-robot interface that enables users to give commands combining spoken commands and pointing gestures (Perzanowski et al., 2001). In the system, spoken commands are analysed using a speech-to-text system and a natural language understanding system that parses text strings. The system can disambiguate grammatical spoken commands such as “Go over there“ and “Go to the door over there,“ by detecting a gesture. It can detect invalid text strings and inconsistencies between verbal and nonverbal messages. However, the details of the multimodal language, its grammar and valid gesture set, are not discussed. It is unclear how easy it is to learn to give grammatical spoken commands or valid multimodal commands in the language.
Iba et al. proposed an approach to programming a robot interactively through a multimodal interface (Iba et al., 2004). They built a vaccum-cleaning robot one can interactively control and program using symbolic hand gestures and spoken words. However, their semantic analysis method is similar to keyword spotting, and do not distinguish valid and invalid commands. There are more examples of robots that receives multimodal messages, but no well-defined multimodal languages in which humans can communiate with robots have been proposed or discussed.
Is it possible to design a multimodal language that has the desirable properties? In the next section, I illustrate a well-defined multimodal language I designed taking into account cost, usablity, and learnability.
3. RUNA: a command language for Japanese speakers
The multimodal language, RUNA, comprises a set of grammar rules and a lexicon for spoken commands, and a set of nonverbal events detected using visual and tactile sensors on the robot and buttons or keys on a pad at users’ hand. Commands in RUNA are given in time series of nonverbal events and utterances of the spoken language. The spoken command language defined by the grammar rule set and lexicon enables users to direct home-use robots with no ambiguity. The lexicon and grammar rules are tailored for Japanese speakers to give home-ues robots directions. Nonverbal events function as altanatives to spoken phrases and create multimodal commands. Thus, the language enables users to direct robots in fewer words using gestures, touching robots, pressing buttons, and so on.
3.2. Commands and actions
In RUNA, one can command a home-use robot to move forward, backward, left and right, turn left and right, look up, down, left and right, move to a goal position, switch on and off a home electric device, change the settings of a device, pick up and place an object, push and pull an object, and so on. In the latest version, there are two types of commands: action commands and repetition commands. An action command consists of an
|Action Type||Action Command||Meaning in English|
|s tandup||standup_s||Stand up slowly!|
|m oveforward||moveforward_ f _ 1m moveforward_m_long||Move forward quickly by 1m! Move a lot forward!|
|w alk||walk_s_3steps walk_f _ 10m||Take 3 steps slowly! Walk fast to a point 10m ahead!|
|l ook||look_f_l||Look left quickly!|
|t urn||turn_m_r_30degrees turn_f_l_much||Turn right by 30 degrees! Turn a lot to the left quickly!|
|turnto||turnto_s_back||Turn back slowly!|
|sidestep||sidestep_s_r_2steps||Take 2 steps to the right!|
|h ighfive||highfive_s_r h||Give me a highfive with your right hand!|
|kick||kick_f_l_rf||Kick left with your foot!|
|w avebp||wavebp_f_hips||Wave your hips quickly!|
|settemp||settemp_aircon _2 4||Set the airconditioner at 24 degrees!|
|l owertemp||lower temp_room _2||Lower the temperature of the room by 2 degrees!|
|s witchon||switchon_aircon||Switch on the air conditioner!|
|query||query_room||Give me s ome information about the room !|
|pickup||pickup_30cm_desk||Pick up something 30cm in width on the desk!|
|place||place_floor||Place it on the floor!|
|moveto||moveto_fridge||Go to the fridge!|
|clean||clean_50cm_powerful_2||Vacuum-clean around you twice powerfully!|
|shuttle||shutlle_1m_silent_10||Shuttle silently 10 times within 1m in length!|
|Class||Action Types||Action Parameters|
|AC1||standup , hug, crouch, liedown, squat||speed|
|AC2||moveforward , movebackward||speed, distance|
|AC4||look , lookaround, turnto||speed, target|
|AC5||turn||speed, direction, angle|
|AC6||sidestep||speed, direction, distance|
|AC7||move||speed, direction, distance|
|AC8||highfive , handshake||speed, hand|
|AC9||punch||speed, hand, direction|
|AC10||kick||speed, foot, direction|
|AC11||turnbp, raisebp, lowerbp, wavebp||speed, body part, direction|
|AC13||raisetemp, lowertemp||room, temperature|
|AC19||push, pull||object, height, distance|
|AC20||slide||object, height, direction,distance|
|AC23||clean||area, repetition, mode|
|AC24||shuttle||distance, repetition, mode|
3.3. Syntax of spoken commands
There are more than 300 generative rules for spoken commands in the latest version of RUNA (see Table 3 for some of them). These rules allow Japanese speakers to command robots in a natural way by speech alone, even though there are no recursive rules. A spoken action command in the language is an imperative utterance including a word or phrase which determines the action type and other words that contain information about action parameters. There must be a word or phrase for the action type of the spoken command, although one can leave out parameter values. Figure 1 illustrates a parse tree for a spoken command of the action type
There are more than 250 words (terminal symbols), each of which has its own pronunciation. They are categorized into about 100 parts of speech, identified by nonterminal symbols (Table 4). One can choose among synonymous words to specify an action type or a parameter value.
|1||S → ACTION||action command|
|2||S → REPETITION||repetition command|
|3||ACTION → AC 3||class 3 action command|
|4||AC 3 → P3 AT3||parameters and type (class 3)|
|5||A T 3 → AT_WALK||action type walk|
|6||P3 → SPEED||phrase for speed|
|7||P3 → DIST||phrase for distance|
|8||P3 → SPEED DIST||speed + distance|
|9||P3 → DIST SPEED||distance + speed|
|10||DIST → NUMBER LUNIT PE||number + length unit + PE|
|11||DIST → DISTANCE_AMOUNT PE||short, long|
|12||P17 → OBJECT17 HEIGHTKARA||parameters for class 17 action|
|1 3||HEIGHTKARA → HEIGHTS KARA||height of object to pick up|
|1 4||HEIGHTS → PLACE||desk, floor, etc.|
|1 5||HEIGHTS → HEIGHT NUMBER LUNIT||height in mm/cm/m|
|1 6||HEIGHTS → BODYPARTNO HEIGHT||knee, hips|
|17||OBJECT17 → OBJWIDTH OBJECT||object for class 17 action|
|1 8||OBJWIDTH → WIDTH NUMBER LUNIT NO||width in mm/cm/m|
|1 9||OBJWIDTH → OBJSIZE||small, large|
|20||DIR → DIR_DEICTIC PE||deictic expression for direction|
|21||REPETITION → REPEAT||repeat last action|
|Part of Speech||Words|
|AT_WALK||at_walk_aruke, at_walk_hoko, at_walk_hokosihro|
|K N EE||bp_knee_hiza|
|LUNIT||lu_mm_mm, lu_cm_cm, lu_m_m|
|DIR_LR||dir_r_migi, dir_r_migigawa, dir_r_miginoho, dir_l_hidari, ...|
|DIR_DEICTIC||dir_deictic_koko, dir_deictic_kocchi, dir_deictic_kochira|
|DIGIT||number_1_ichi, number_1_iq, number_2_ni, ...|
|SPEED W||sp_f_hayaku, sp_f_isoide, sp_s_yukkuri, sp_m_futsuni|
|DISTANCE_AMOUNT||dst_long_okiku, dst_short_sukoshi, dst_short_chotto, ...|
|ANGLE_AMOUNT||ang_much_okiku, ang_little_sukoshi, ang_little_chisaku, ...|
|PE (silence or hesitant voice)||mk_pe_q, mk_pe_a:, mk_pe_e:|
3.4. Nonverbal events
In RUNA, a set of
|Event Type||Event Parameters||Example|
|button press||button id , iteration, duration||buttonpress_b4_3_124ms|
|body touch||position, iteration, duration||bodytouch_leftwrist_ 1_ 700ms|
|singlehand waving||direction, iteration, stroke, frequency||singlehandwaving_left_3_long_120 singlehandwaving_up_4_10cm_90|
|doublehand gesture||width, direction, iteration, stroke||doublehandgesture_wide_left_3_short|
Since the language described above is syntactically unambiguous and simple, it is computationally inexpensive to identify action types and parameters in spoken commands.
As I have already mentioned, each spoken action command in RUNA includes a word specifying an action type, which can be distinguished by its own first string element
After a spoken command is divided into phrases and its action type is determined, a parameter value can be extracted from each phrase. It is always possible to determine which parameter the phrase is about by finding a keyword of a category such as LUNIT, AUNIT, DIR_LR, WIDTH, HEIGHT, and ANGLE_AMOUNT. If the keyword contains the parameter value, a string for the value of the parameter,
Note that in RUNA there are deictic words and some parameter values can be left out in spoken commands. For instance, one may say “Turn slowly” without mentioning the direction or “Look this way” using a deictic expression perhaps with a gesture. In such cases, undecided parameters are resolved by nonverbal events described in the previous subsection. There are rules to map parameter values of nonverbal events (Table 5) to parameter values of action commands (Table 2). Designing these mapping rules is a key to a good multimodal command language that is natural and easy to learn. Table 6 shows examples of event parameters that correspond to action parameters.
If a spoken command has some parameters which cannot be resolved by nonverbal events, those parameter slots are filled with default values. Therefore, a command “Kick” is interpreted as “Kick slowly straight with your right foot” using the default parameter values for the action type
|Event Type||Event Parameter||Action Type||Action Parameter|
|button press||button_id||turn / sidestep moveforward||speed, direction speed, distance|
|iteration||raisetemp turn||temperature angle|
|duration||turn walk||angle distance|
|body touch||position||kick raisebp turn / sidestep||foot bodypart direction|
|single hand waving||direction||turn / sidestep||direction|
|stroke||walk turn||distance angle|
|double hand gesture||width||pickup||width|
|iteration||pickup / place||height|
3.6. Command execution by home-use robots
A complete action command with its type and parameter values is executed by a home-use robot if the action is in the robot’s action repertoire. Quantitative parameter values in action commands, e. g.
4. User studies
4.1. Objectives and methods
In the earlier part of this chapter I pointed out that a multimodal command language to direct home-use robots must have several properties. This opinion arises some fundamental questions:
Is the language easy for non-experts to learn and use?
How much time does it take to give a command in the language?
How expensive are robots that can execute commands in the language without significant delay and frequent misinterpretations?
How can the language be improved?
To answer these questions, one must collect data by conducting user studies that record multimodal commands by a wide range of users, speech recognition results, nonverbal events, system interpretations, reactions of home-use robots, user opinions, and so on.
The first question is about learning the command language. One can estimate a user’s ability to give multimodal commands in the language (the user’s linguistic performance) by giving various tasks. Fluency, human errors, command success rates, and time required for each command, and self-assessments can be indicators of performance. The second question is about the language’s efficiency. One must investigate times required for commands by a wide range of users at several stages of learning. The third question can be answered by developing home-use robots and using them in user studies. The last question is related to the other questions and should be answered by finding all sorts of problems including human and system errors. Constructive criticisms by users also play a great role.
My colleagues and I have built a command interpretation system on a personal computer, small real humanoids (Oka et al., 2008), simulated humanoids, and a simulated robot cleaner that can be commanded in different versions of RUNA and conducted some user studies. In these studies, more than a hundred users, mostly young students, commanded one of the robots within 90 minutes. Some of them were asked to give spoken and/or multimodal commands printed in a sheet of paper. Many of them were given one or more goals to be achieved giving spoken or multimodal commands: checking a room, changing the settings of an air conditioner, moving a box, cleaning a dusty area, etc. We video-recorded the users and robots, recorded speech recognition results, nonverbal events, and command interpretations. Each user was asked to fill in a question sheet after commanding the robot.
Before asking each user to command one of the robots, we showed the person a short demonstration movie and handed a leaflet that illustrates how to command each action in diagrams and pictures (Figure 2).We also prepared some short exercise programs to improve users’ success rates and reduce human errors within 20 minutes.
4.2. Summary of results
In the user studies, the novice users were able to command our robots in RUNA consulting one of the leaflets and complete their tasks. In fact, there were many users who were able to direct the robot without their leaflet later in their tests.
Most of the users spoke clearly and fluently after practice, although there were a small number of hesitations, fluffs, and hashes especially in commands including more spoken phrases. Only several users made word misuses. In the latest studies, 92 - 98 % of spoken messages were correctly recognized with no word misrecognition thanks to the latest version (4.1.1) of an open source grammar based speech recognition engine (Lee et al., 2001).
Most nonverbal messages were given properly after the users learned how to use them. There were few human errors such as pressing a wrong button, touching a wrong part of the robot body, and wrong gestures. However, there are problems in specifying some action parameter values in nonverbal messages. For example, most users made errors in choosing a length value out of five pressing a button, even after some practice to learn durations. There were also failures in specifying action parameters using hand gestures due to errors in our gesture detector using a web camera.
A majority of the users in the latest studies recorded a command success rate higher than 90 %. Most of user commands were completed within 10 seconds and our robots responded to them within a second or so. In fact, there were users who repeatedly gave multimodal commands very quickly without looking at the leaflet. Those users spoke immediately after pressing buttons or moving their hand(s) to the camera. About 77 % of the users who were asked a question about their preference answered that they preferred multimodal commands to speech only commands in RUNA. Those users selected multimodal commands to achieve their tasks more often than the others. In one of the latest studies, all of the 20 users felt that they understood how to direct robots, although some of them did not find it easy.
In the question sheets filled in by the users, there were some important opinions about the language. Some of them pointed out that it was difficult to learn to specify action parameter values in nonverbal messages. There were users who thought speech recognition errors caused problems for them.
Further studies are needed to prove the effectiveness of the language for a wide range of non-experts, but our results imply that the current version of RUNA is fairly easy for Japanese speakers to learn. Although the novice users made some errors, they would not need long time or much effort to fully master the spoken language and the set of nonverbal messages. With more experience, they would be able to give spoken commands specifying three or more action parameters fluently, use default parameter values whenever possible, and choose among nonverbal modes. As even novice users were able to command robots within a short period of time, experienced users would not take unnecessarily long time for a command.
Can users of home-use robots teach themselves the language? A demonstration movie and an introductory leaflet which illustrates examples of multimodal commands would help a novice user to grasp the principles of the language. Although it will take a while to master all types of actions, I suppose that it will be easier and easier to learn a new command.
Multimodal commands in the current version of RUNA can be interpreted using a microphone, a web camera, a controller or a keypad, tactile sensors, and a personal computer. Therefore, home-use robots in future would not need extra hardware for understanding commands in RUNA. Besides, more sophisticated speech recognisers and gesture detectors would reduce misinterpretations of user commands.
One should be able to make the language more natural and easier to learn by both amending the mapping between nonverbal events and action parameters and introducing new types of nonverbal events. In the current version of RUNA, some action parameter values are naturally mapped to event parameter values, and can be specified without acquiring skills. For example, it is easy for anyone to specify a direction by pressing a button, touching the robot, or using a gesture. Likewise, no special skill is necessary to give robots information about body parts, repetition counts, modes, and qualitative values such as
Word misuses found in the user studies prove the importance of choice of words for the lexicon of the spoken language. One can prevent word misuses by including as many Japanese words as possible. However, homonyms will increase risks of syntactically or semantically ambiguous utterances and speech recognition errors. Therefore, only frequent word misuses should be removed by adding new words to the lexicon.
5. Future directions
RUNA can be extended by adding grammar rules, spoken words, and types of nonverbal events. Without doubt, multimodal command languages to direct of multipurpose home-use robots in future must have more classes and types of actions. However, its framework based on type and parameter should work well for the purpose of giving home-use robots various kinds of action commands, goals, and missions despite the simplicity and limitations. Certainly, one must avoid syntactically or semantically ambiguous utterances and select types of nonverbal events suitable for specifying parameter values of actions, goals, and missions taking into account both cost and usability.
Nonverbal messages can help human-robot communications in the same ways that they help human-human communications. They can not only segment and disambiguate verbal messages, but also convey the current status of humans and robots. Eye contacts, hand gestures, postures, body touches, and button press actions can be clues to detect and segment spoken commands, phrases, and words; paralanguage may play important roles in disambiguation; nonverbal messages that inform emotional and physical status may help robots’ decision making.
Multimodal languages for responses from home-use robots are also among my interests. Most importantly, robots can send nonverbal messages to convey their status and whether or not they can receive a new command at present. Another interesting future work would be fusing nonverbal and verbal messages. Redundant action parameter values in multiple modes may reduce risks of misinterpretations.
Development of RUNA was supported by KAKENHI Grant-in-Aid for Scientific Research (19500171). I would like to thank all my colleagues who worked on and discussed the subject with me at Fukuoka Institute of Technology.
Bos J. Oka T. 2007Meaningful conversation with mobile robots.
Iba S. Paredis C. J. J. Adams W. Khosla P. K. 2004Interactive multi-modal robot programming, Proceedings of the 9th International Symposium on Experimental Robotics (ISER’04), 503 512, 3-54028-816-3March 2006, Springer, Berlin
Jurafsky D. Martin H. J. 2000
Lee A. Kawahara T. Shikano K. 2001Julius--- an open source real-time large vocabulary recognition engine, Proceedings of the 7th Europian Conference on Speech Communication and Technology, 1691 1694, Aalborg, September 2001, International Speech Communication Association
Oka T. Abe T. Shimoji M. Nakamura T. Sugita K. Yokota M. 2008Directing humanoids in a multimodal command language, Proceedings of the 17th IEEE International Symposium on Robot and Human Interactive Communication (RO-MAN’08), 580 585, 978-1-42442-213-5Munich, August 2008, IEEE
Perzanowski D. Schultz A. C. Adams W. Marsh E. Bugajska M. 2001Building a multimodal human-robot interface.
Prasad R. Saruwatari H. Shikano K. 2004Robots that can hear, understand and talk.