In this chapter, I introduce a new concept, “multimodal command language to direct home-use robots,” an example language for Japanese speakers, some recent user studies on robots that can be commanded in the language, and possible future directions.
First, I briefly explain why such a language help users of home-use robots and what properties it should have, taking into account both usability and cost of home-use robots. Then, I introduce RUNA (Robot Users’ Natural Command Language), a multimodal command language to direct home-use robots carefully designed for nonexpert Japanese speakers, which allows them to speak to robots simultaneously using hand gestures, touching their body parts, or pressing remote control buttons. The language illustrated here comprises grammar rules and words for spoken commands based on the Japanese language and a set of non-verbal events including body touch actions, button press actions, and single-hand and double-hand gestures. In this command language, one can specify action types such as walk, turn, switchon, push, and moveto, in spoken words and action parameters such as speed, direction, device, and goal in spoken words or nonverbal messages. For instance, one can direct a humanoid robot to turn left quickly by waving the hand to the left quickly and saying just “Turn” shortly after the hand gesture. Next, I discuss how to evaluate such a multimodal language and robots commanded in the language, and show some results of recent studies to investigate how easy RUNA is for novice users to command robots in and how cost-effective home-use robots that understand the language are. My colleagues and I have developed real and simulated home-use robot platforms in order to conduct user studies, which include a grammar-based speech recogniser, non-verbal event detectors, a multimodal command interpreter and action generation systems for humanoids and mobile robots. Without much training, users of various ages who have no prior knowledge about the language were able to command robots in RUNA, and achieve tasks such as checking a remote room, operating intelligent home appliances, cleaning a region in a room, etc. Although there were some invalid commands and unsuccessful valid commands, most of the users were able to command robots consulting a leaflet without taking too much time. In spite of the fact that the early versions of RUNA need some modifications especially in the nonverbal parts, many of them appeared to prefer multimodal commands to speech only commands. Finally, I give an overview of possible future directions.
2. Multimodal command language
Many scientists predict that home-use robots which serve us at home will be affordable in future. They will have a number of sensors and actuators and a wireless connection with intelligent home electric devices and the internet, and help us in various ways. Their duties will be classified into physical assistance, operation of home electric devices, information service using the network connection, entertainment, healing, teaching, and so on.
How can we communicate with them? A remote controller with many buttons and a graphical user interface with a screen and pointing device are practical choices, but are not suited for home-use robots which are given many kinds of tasks. Those interfaces require experiences and skills in using them, and even experienced users need time to send a single message pressing buttons or selecting nested menu items. Another choice which will come to one’s mind is a speech interface. Researchers and componies have already developed many robots which have speech recognition and synthesis capabilities; they recognize spoken words of users and respond to them in spoken messages (Prasad et al., 2004). However, they do not understand every request in a natural language such as English for a number of reasons. Therefore, users of those robots must know what word sequences they understand and what they do not. In general, it is not easy for us to learn a set of a vast number of verbal messages a multi-purpose home-use robots would understand, even if it is a subset of a natural language. Another problem with spoken messages is that utterances in natural human communication are often ambiguous. It is computationally expensive for a computer to understand them (Jurafsky & Martin, 2000) because inferrencess based on different knowledge sources (Bos & Oka, 2007) and observations of the speaker and environment are required to grasp the meaning of natural language utterances. For example, think about a spoken command “Place this book on the table“ which requires identification of a book and a table in the real world; there may be several books and two tables around the speaker. If the speaker is pointing one of the books and looking at one of the tables, these nonverbal messages may help a robot understand the command. Moreover, requests such as “Give the book back to me“ with no infomation about the book are common in natural communications.
Now, think about a language for a specific purpose, commanding home-use robots. What properties should such a language have? First, it must be easy to give home-use robots commands without ambiguity in the language. Second, it should be easy for nonexperts to learn the language. Third, we should be able to give a single command in a short period of time. Next, the less misinterpretations, false alarms, and human errors the better. From a practical point of view, cost problems cannot be ignored; both computational cost for command understanding and hardware cost push up the prices of home-use robots.
One should not consider only sets of verbal messages but also multimodal command languages that combine verbal and nonverbal messages. Here, I define a multimodal command language as a set of verbal and nonverbal messages which convery information about commands. Spoken utterances, typed texts, mouse clicks, button press actions, touches, and gestures can constitute a command generally speaking. Therefore, messages sent using character/graphical user interfaces and speech interfaces can be thought of as elements of multimodal command languages.
Graphical user interfaces are computaionally inexpensive and enable unambiguous commands using menus, sliders, buttons, text fields, etc. However, as I have already pointed out, they are not usable for all kinds of users and they do not allow us to choose among a large number of commands in a short period of time.
Since character user interfaces require key typing skills, spoken language interfaces are preferable for nonexperts although they are more expensive and there are risks of speech recognition errors. As I pointed out, verbal messages in human communication are often ambiguous due to multi-sense or obscure words, misleading word orders, unmentioned information, etc. Ambiguous verbal messages should be avoided because it is computationally expensive to find and choose among many possible interpretations. One may insist that home-use robots can ask clarification questions. However such questions increases time for a single command, and home-use robots which often ask clarification questions are annoying.
Keyword spotting is a well-known and polular method to guess the meaning of verbal messages. Semantic analysis based on the method has been employed in many voice activated robotic systems, because it is computationally inexpensive and because it works well for a small set of messages (Prasad et al., 2004). However, since those systems do not distinguish valid and invalid utterances, it is unclear what utterances are acceptable. In other words, those systems are not based on a well-defined command language. For this reason, it is difficult for users to learn to give many kinds of tasks or commands to such robots and for system developers to avoid misinterpretations.
Verbal messages that are not ambiguous tend to contain many words because one needs to put everything in words. Spoken messages including many words are not very natural and more likely to be misrecognised by speech recognisers. Nonverbal modes such as body movement, posture, body touch, button press, and paralanguage, can cover such weaknesses of a verbal command language. Thus, a well-defined multimodal command set combining verbal and nonverbal messages would help users of home-use robots.
Perzanowski et al. developed a multimodal human-robot interface that enables users to give commands combining spoken commands and pointing gestures (Perzanowski et al., 2001). In the system, spoken commands are analysed using a speech-to-text system and a natural language understanding system that parses text strings. The system can disambiguate grammatical spoken commands such as “Go over there“ and “Go to the door over there,“ by detecting a gesture. It can detect invalid text strings and inconsistencies between verbal and nonverbal messages. However, the details of the multimodal language, its grammar and valid gesture set, are not discussed. It is unclear how easy it is to learn to give grammatical spoken commands or valid multimodal commands in the language.
Iba et al. proposed an approach to programming a robot interactively through a multimodal interface (Iba et al., 2004). They built a vaccum-cleaning robot one can interactively control and program using symbolic hand gestures and spoken words. However, their semantic analysis method is similar to keyword spotting, and do not distinguish valid and invalid commands. There are more examples of robots that receives multimodal messages, but no well-defined multimodal languages in which humans can communiate with robots have been proposed or discussed.
Is it possible to design a multimodal language that has the desirable properties? In the next section, I illustrate a well-defined multimodal language I designed taking into account cost, usablity, and learnability.
3. RUNA: a command language for Japanese speakers
The multimodal language, RUNA, comprises a set of grammar rules and a lexicon for spoken commands, and a set of nonverbal events detected using visual and tactile sensors on the robot and buttons or keys on a pad at users’ hand. Commands in RUNA are given in time series of nonverbal events and utterances of the spoken language. The spoken command language defined by the grammar rule set and lexicon enables users to direct home-use robots with no ambiguity. The lexicon and grammar rules are tailored for Japanese speakers to give home-ues robots directions. Nonverbal events function as altanatives to spoken phrases and create multimodal commands. Thus, the language enables users to direct robots in fewer words using gestures, touching robots, pressing buttons, and so on.
3.2. Commands and actions
In RUNA, one can command a home-use robot to move forward, backward, left and right, turn left and right, look up, down, left and right, move to a goal position, switch on and off a home electric device, change the settings of a device, pick up and place an object, push and pull an object, and so on. In the latest version, there are two types of commands: action commands and repetition commands. An action command consists of an action type such as walk, turn, and move, and action parameters such as speed, direction and angle. Table 1 shows examples of action types and commands represented in character string lists. The 38 action types are categorized into 24 classes based on the way in which action parameters are specified naturally in the Japanese language (Table 2). In other words, actions of different classes are commanded using different modifiers. A repetition command requests the most recently executed action.
3.3. Syntax of spoken commands
There are more than 300 generative rules for spoken commands in the latest version of RUNA (see Table 3 for some of them). These rules allow Japanese speakers to command robots in a natural way by speech alone, even though there are no recursive rules. A spoken action command in the language is an imperative utterance including a word or phrase which determines the action type and other words that contain information about action parameters. There must be a word or phrase for the action type of the spoken command, although one can leave out parameter values. Figure 1 illustrates a parse tree for a spoken command of the action type walk which has speed and distance as parameters. The fourth rule in Table 3 generates an action command of AC3 in Table 2. The nonterminal symbol P3 correponds to phrases about speed and distance. There are degrees of freedom in the order of phrases for parameters, and one can use symbolic, deictic, qualitative and quantitative expressions for them (see rules in Table 3).
There are more than 250 words (terminal symbols), each of which has its own pronunciation. They are categorized into about 100 parts of speech, identified by nonterminal symbols (Table 4). One can choose among synonymous words to specify an action type or a parameter value.
3.4. Nonverbal events
In RUNA, a set of nonverbal events is defined and used for commanding robots. These events are lists of character strings representing their own type and parameter values. Table 5 shows examples of nonverbal events. These events can be detected using sensors on home-use robots or buttons and sensing devices at users’ hand without much hardware and computational cost.
Since the language described above is syntactically unambiguous and simple, it is computationally inexpensive to identify action types and parameters in spoken commands.
As I have already mentioned, each spoken action command in RUNA includes a word specifying an action type, which can be distinguished by its own first string element at (Table 4). It can be divided into phrases expressing each parameter value and the action type using words which indicate the end of a parameter phrase, i. e. PE words (Figure 1, Table 4). Therefore, it is straightforward and computationally inexpensive to identify the action type of a spoken command.
After a spoken command is divided into phrases and its action type is determined, a parameter value can be extracted from each phrase. It is always possible to determine which parameter the phrase is about by finding a keyword of a category such as LUNIT, AUNIT, DIR_LR, WIDTH, HEIGHT, and ANGLE_AMOUNT. If the keyword contains the parameter value, a string for the value of the parameter, left or much, is constructed. Otherwise, one must find a numerical expression to compose a string such as 1m amd 2degrees. Thus, the spoken command in Figure 1 is converted to a semantic representation walk_s_2steps.
Note that in RUNA there are deictic words and some parameter values can be left out in spoken commands. For instance, one may say “Turn slowly” without mentioning the direction or “Look this way” using a deictic expression perhaps with a gesture. In such cases, undecided parameters are resolved by nonverbal events described in the previous subsection. There are rules to map parameter values of nonverbal events (Table 5) to parameter values of action commands (Table 2). Designing these mapping rules is a key to a good multimodal command language that is natural and easy to learn. Table 6 shows examples of event parameters that correspond to action parameters.
If a spoken command has some parameters which cannot be resolved by nonverbal events, those parameter slots are filled with default values. Therefore, a command “Kick” is interpreted as “Kick slowly straight with your right foot” using the default parameter values for the action type kick.
3.6. Command execution by home-use robots
A complete action command with its type and parameter values is executed by a home-use robot if the action is in the robot’s action repertoire. Quantitative parameter values in action commands, e. g. short, are converted to quantitative values, e. g. 20cm when robots execute the commands. The robot starts the action immediately if it has completed the previous command. Otherwise, the robot makes a decision depending on various conditions. It may start the new command immediately after completing the ongoing command, abort it and start the new command, or reject the new command explaining the reason. There is no good theory about the decision making yet, so we describe task specific rules for humanoids, robot cleaners, etc.
4. User studies
4.1. Objectives and methods
In the earlier part of this chapter I pointed out that a multimodal command language to direct home-use robots must have several properties. This opinion arises some fundamental questions:
Is the language easy for non-experts to learn and use?
How much time does it take to give a command in the language?
How expensive are robots that can execute commands in the language without significant delay and frequent misinterpretations?
How can the language be improved?
To answer these questions, one must collect data by conducting user studies that record multimodal commands by a wide range of users, speech recognition results, nonverbal events, system interpretations, reactions of home-use robots, user opinions, and so on.
The first question is about learning the command language. One can estimate a user’s ability to give multimodal commands in the language (the user’s linguistic performance) by giving various tasks. Fluency, human errors, command success rates, and time required for each command, and self-assessments can be indicators of performance. The second question is about the language’s efficiency. One must investigate times required for commands by a wide range of users at several stages of learning. The third question can be answered by developing home-use robots and using them in user studies. The last question is related to the other questions and should be answered by finding all sorts of problems including human and system errors. Constructive criticisms by users also play a great role.
My colleagues and I have built a command interpretation system on a personal computer, small real humanoids (Oka et al., 2008), simulated humanoids, and a simulated robot cleaner that can be commanded in different versions of RUNA and conducted some user studies. In these studies, more than a hundred users, mostly young students, commanded one of the robots within 90 minutes. Some of them were asked to give spoken and/or multimodal commands printed in a sheet of paper. Many of them were given one or more goals to be achieved giving spoken or multimodal commands: checking a room, changing the settings of an air conditioner, moving a box, cleaning a dusty area, etc. We video-recorded the users and robots, recorded speech recognition results, nonverbal events, and command interpretations. Each user was asked to fill in a question sheet after commanding the robot.
Before asking each user to command one of the robots, we showed the person a short demonstration movie and handed a leaflet that illustrates how to command each action in diagrams and pictures (Figure 2).We also prepared some short exercise programs to improve users’ success rates and reduce human errors within 20 minutes.
4.2. Summary of results
In the user studies, the novice users were able to command our robots in RUNA consulting one of the leaflets and complete their tasks. In fact, there were many users who were able to direct the robot without their leaflet later in their tests.
Most of the users spoke clearly and fluently after practice, although there were a small number of hesitations, fluffs, and hashes especially in commands including more spoken phrases. Only several users made word misuses. In the latest studies, 92 - 98 % of spoken messages were correctly recognized with no word misrecognition thanks to the latest version (4.1.1) of an open source grammar based speech recognition engine (Lee et al., 2001).
Most nonverbal messages were given properly after the users learned how to use them. There were few human errors such as pressing a wrong button, touching a wrong part of the robot body, and wrong gestures. However, there are problems in specifying some action parameter values in nonverbal messages. For example, most users made errors in choosing a length value out of five pressing a button, even after some practice to learn durations. There were also failures in specifying action parameters using hand gestures due to errors in our gesture detector using a web camera.
A majority of the users in the latest studies recorded a command success rate higher than 90 %. Most of user commands were completed within 10 seconds and our robots responded to them within a second or so. In fact, there were users who repeatedly gave multimodal commands very quickly without looking at the leaflet. Those users spoke immediately after pressing buttons or moving their hand(s) to the camera. About 77 % of the users who were asked a question about their preference answered that they preferred multimodal commands to speech only commands in RUNA. Those users selected multimodal commands to achieve their tasks more often than the others. In one of the latest studies, all of the 20 users felt that they understood how to direct robots, although some of them did not find it easy.
In the question sheets filled in by the users, there were some important opinions about the language. Some of them pointed out that it was difficult to learn to specify action parameter values in nonverbal messages. There were users who thought speech recognition errors caused problems for them.
Further studies are needed to prove the effectiveness of the language for a wide range of non-experts, but our results imply that the current version of RUNA is fairly easy for Japanese speakers to learn. Although the novice users made some errors, they would not need long time or much effort to fully master the spoken language and the set of nonverbal messages. With more experience, they would be able to give spoken commands specifying three or more action parameters fluently, use default parameter values whenever possible, and choose among nonverbal modes. As even novice users were able to command robots within a short period of time, experienced users would not take unnecessarily long time for a command.
Can users of home-use robots teach themselves the language? A demonstration movie and an introductory leaflet which illustrates examples of multimodal commands would help a novice user to grasp the principles of the language. Although it will take a while to master all types of actions, I suppose that it will be easier and easier to learn a new command.
Multimodal commands in the current version of RUNA can be interpreted using a microphone, a web camera, a controller or a keypad, tactile sensors, and a personal computer. Therefore, home-use robots in future would not need extra hardware for understanding commands in RUNA. Besides, more sophisticated speech recognisers and gesture detectors would reduce misinterpretations of user commands.
One should be able to make the language more natural and easier to learn by both amending the mapping between nonverbal events and action parameters and introducing new types of nonverbal events. In the current version of RUNA, some action parameter values are naturally mapped to event parameter values, and can be specified without acquiring skills. For example, it is easy for anyone to specify a direction by pressing a button, touching the robot, or using a gesture. Likewise, no special skill is necessary to give robots information about body parts, repetition counts, modes, and qualitative values such as long and short, in nonverbal messages. However, it is difficult for inexperienced users to specify angles, lengths, heights, and temperatures using buttons or gestures. There are three reasons for this. First, users need skills to specify quantities using durations, frequencies, or lengths; how long should I press the button to turn the robot by 30 degrees? Second, users must remember arbitrary mappings; how many times should I wave my hand to get a turn by 180 degrees? Third, our gesture detector cannot measure lengths with precision. This problem can be remedied by making use of a pen tablet, a touch panel, dials, or a screen to display parameter values. Another possible solution is to start an endless action and stop it by pressing a button, touching the robot, raising the right hand, saying “Stop,” and so on. One should notice the existing methods are still helpful when users do not need high precision.
Word misuses found in the user studies prove the importance of choice of words for the lexicon of the spoken language. One can prevent word misuses by including as many Japanese words as possible. However, homonyms will increase risks of syntactically or semantically ambiguous utterances and speech recognition errors. Therefore, only frequent word misuses should be removed by adding new words to the lexicon.
5. Future directions
RUNA can be extended by adding grammar rules, spoken words, and types of nonverbal events. Without doubt, multimodal command languages to direct of multipurpose home-use robots in future must have more classes and types of actions. However, its framework based on type and parameter should work well for the purpose of giving home-use robots various kinds of action commands, goals, and missions despite the simplicity and limitations. Certainly, one must avoid syntactically or semantically ambiguous utterances and select types of nonverbal events suitable for specifying parameter values of actions, goals, and missions taking into account both cost and usability.
Nonverbal messages can help human-robot communications in the same ways that they help human-human communications. They can not only segment and disambiguate verbal messages, but also convey the current status of humans and robots. Eye contacts, hand gestures, postures, body touches, and button press actions can be clues to detect and segment spoken commands, phrases, and words; paralanguage may play important roles in disambiguation; nonverbal messages that inform emotional and physical status may help robots’ decision making.
Multimodal languages for responses from home-use robots are also among my interests. Most importantly, robots can send nonverbal messages to convey their status and whether or not they can receive a new command at present. Another interesting future work would be fusing nonverbal and verbal messages. Redundant action parameter values in multiple modes may reduce risks of misinterpretations.