Ideal dialogue in a train ticket booking application.
1. Introduction
Human-Computer Interfaces are now widely studied and become one of the major interests among the scientific community. More and more electronic devices surround people in their day-to-day life. This exponential incursion of electronics in homes, cars and work places is not only due to its ability to ease the achievement of common and boring tasks or the continuously decreasing prices but also because the increasing “user-friendliness” of interfaces makes it easier to use.
Being studied for more than fifty years, speech and natural language processing knew major progresses during the two last decades. It is now feasible to build real Spoken Dialogue Systems (SDS) interacting with human users through voice-enabled interactions. Speech often appears as a natural way to interact for a human being and it provides potential benefits such as hand-free access to machines, ergonomics and greater efficiency of interaction. Yet, speech-based interfaces design has been an expert job for a long time. It necessitates good skills in speech technologies and low-level programming. Moreover, rapid design and reusability of previously designed systems are almost impossible. For these reasons, but not only, people are less used to interact with speech-based interfaces which are therefore thought as less intuitive.
Designing and optimizing a SDS is not only the combination of speech and language processing systems such as Automatic Speech Recognition (ASR) (Rabiner & Juang 1993), Spoken Language Understanding (SLU) (Allen 1998), Natural Language Generation (NLG) (Reiter & Dale 2000), and Text-to-Speech (TTS) synthesis (Dutoit 1997) systems. It also requires the development of dialogue strategies taking at least into account the performances of these subsystems (and others), the nature of the task (e.g. form filling (Pietquin & Dutoit 2006a), tutoring (Graesser et al 2001), robot control, or database querying (Pietquin 2006b)), and the user’s behaviour (e.g. cooperativeness, expertise (Pietquin 2004)). The great variability of these factors makes rapid design of dialogue strategies and reusability across tasks of previous work very complex. Most often, such a design is a cyclic process composed of strategy hand-coding, prototype releases and expansive and time consuming user tests. In addition, there is also no guarantee that hand-crafted strategies developed by experts are anything close to optimal. Because it provides data-driven methods and objective clues about performances, statistical learning of optimal dialogue strategies became a leading domain of research (Lemon & Pietquin, 2007). The goal of such approaches is to reduce the number of design cycles (Fig. 1).
Supervised learning for such an optimization problem would require examples of ideal (sub)strategies which are typically unknown. Indeed, no one can actually provide an example of what would have objectively been the perfect sequencing of exchanges after having participated to a dialogue. Humans have a greater propensity to criticize what is wrong than to provide positive proposals. In this context, reinforcement learning using Markov Decision Processes (MDPs) (Levin et al 1998, Singh et al 1999, Scheffler & Young 2001, Pietquin & Dutoit 2006a, Frampton & Lemon 2006) and Partially Observable MDP (POMDPs) (Poupart et al 2005, Young 2006) has become a particular focus.
2. Formalism
2.1. Definitions
In the rest of this chapter, a
2.2. Formal description of man-machine spoken dialog
A man-machine spoken dialog can be considered as a sequential process in which a human user and a Dialog Manager (DM) are communicating using speech through speech and language processing modules (Fig. 2). The role of the DM is to define the sequencing of spoken interactions and therefore to take decisions about what to do at a given time (providing information, asking for information, closing the dialog, etc.). A Spoken Dialog System is often meant to provide information to a user; this is why it is generally connected to a Knowledge Base (KB) through its DM. The dialog is therefore regarded as a turn-taking process in which pieces of information are processed sequentially by a set of modules and perform a cycle going from the DM to the user and back. At each turn
The following paragraphs will use this description of a man-machine dialog as a base to build a probabilistic model.
2.3. MDP and Reinforcement Learning
In our vision of the MDP formalism, a discrete-time system interacting with its stochastic environment through actions is described by a finite or infinite number of states {
These last expressions assume that the Markov property is met, which means that the system’s functioning is fully defined by its one-step dynamics and that its behavior from state
In this framework, a RL
If the probabilities of equations (1) and (2) are known, an analytical solution can be computed by dynamic programming, otherwise the system has to learn the optimal strategy by a trial-and-error process. RL is therefore about how to optimally map situations to actions by trying and observing environment’s feedback. In the most challenging cases, actions may affect not only the immediate reward, but also the next situation and, through that, all subsequent rewards. Trial-and-error search and delayed rewards are the two main features of RL. Different techniques are described in the literature, in the following the Watkin’s Q() algorithm (Watkin 1989) will be used.
3. Human-Machine Dialogue and Markov Decision Process
From the point of view of the dialogue manager, the interaction can probabilistically be described by the joint probability of the signals
In (5), the first term stands for the
3.1. Markov Property and Random Noise
In the case of a SDS, the Markov Property is met if the dialogue manager choice about the action
Let’s illustrate this on a train ticket booking system. When accessing such a system a customer can book a ticket by providing information about the cities of departure and arrival and a desired time of departure. Like in a 3-slot-filling application, three bits of information (sometimes called
Speaker | Spoken Utterance | Dialogue state |
System | Hello, how may I help you? | [false false false] |
User | I’d like to go to Edinburgh. | |
System | What’s your departure city? | [false true false] |
User | I want to leave from Glasgow. | |
System | When do you want to go from Glasgow to Edinburgh ? | [true true false] |
User | On Saturday morning. | |
System | Ok, seats are available in train n° xxx … | [true true true] |
To meet the Markov property with such a state representation, we have to assume that the system behaves the same whatever the order in which the slots where filled (and by the way, whatever the values of the attributes). The Markov assumption is also made about the environment; that is the user behaves the same whatever the filling order as well. These are of course strong assumptions but we will see later that they lead to satisfactory results.
Finally, most often the noise is considered as being random so as to have independence between
3.2. Dialogue Management as an MDP
As claimed in paragraph 0 and depicted on Fig. 2, a task-oriented (or goal-directed) man-machine dialogue can be seen as a turn-taking process in which a human user and a dialogue manager exchange information through different channels processing speech inputs and outputs (ASR, TTS...). In our problem, the dialogue manager’s action (or dialogue act) selection
As shown on Fig. 3, the
3.3. Reward Function
To fit totally to the MDP formalism, a
where
where
3.4. Partial Observability
If a direct mapping between observations and system (or dialogue) states exists, the process is completely observable and the task model (see Eq. 6) can easily be built. Yet, it is rarely the case that the observations are directly linked to the dialogue state. Indeed, the real dialogue state at time
This is typically what happens in partially observable environments where a probability distribution is drawn over possible states given the observations. For this reason, emerging research is focused on the optimization of spoken dialogue systems in the framework of Partially Observable Markov Decision Processes (POMDPs) (Poupart et al 2005, Young 2006)
4. Learning Dialogue Policies using Simulation
Using the framework described previously, it is theoretically possible to automatically learn spoken dialogue policies allowing natural conversation between human users and computers. This learning process should be realised online, through real interactions with users. One could even imagine building the reinforcement signal from direct queries to the user about his/her satisfaction after each interaction (Fig. 4).
For several reasons, direct learning through interactions is made difficult. First, a human user would probably react badly to some of the exploratory actions the system would choose since they might be completely incoherent. Anyway a very large number of interactions are required (typically tens of thousands of dialogues for standard dialogue systems) to train such a system. This is why data driven learning as been proposed so as to take advantage of existing databases for bootstrapping the learning process. Two methods were initially investigated: learning the state transition probabilities and the reward distribution from data (Singh et al, 1999) or learning parameters of a simulation environment mainly reproducing the behaviour of the user (Levin et al 2000). The second method is today preferred (Fig. 5). Indeed, whatever the data set available, it is unlikely that it contains every possible state transitions and it allows exploring the entire state and policy space. Dialogue simulation is therefore necessary for expanding the existing data sets and learning optimal policies. Most often, the dialogue is simulated at the intention level rather than at the word sequence or speech signal level, as it would be in the real world. An exception can be found in (Lopez Cozar et al 2003). Here, we regard an intention as the minimal unit of information that a dialogue participant can express independently. Intentions are closely related to concepts, speech acts or dialogue acts. For example, the sentence "I'd like to buy a desktop computer" is based on the concept buy(desktop). It is considered as unnecessary to model environment behavior at a lower level, because strategy optimization is a high level concept. Additionally, concept-based communication allows error modeling of all the parts of the system, including natural language understanding (Pietquin & Renals 2002, Pietquin & Dutoit 2006b). More pragmatically, it is simpler to automatically generate concepts compared with word sequences (and certainly speech signals), as a large number of utterances can express the same intention while it should not influence the dialogue manager strategy. Table 2 describes such a simulation process. The intentions have been expanded in the last column for comprehensiveness purposes. The signals column refers to notations of section 0.
Signals | Intentions | Expanded Intentions |
sys0 | greeting | Hello! How may I help you? |
u0 | arr_city = ‘Paris’ | I’d like to go to Paris. |
sys1 | const(arr_time) | When do you prefer to arrive? |
u1 | arr_time = ‘1.00 PM’ | I want to arrive around 1 PM. |
sys2 | rel(arr_time) | Don’t you prefer to arrive later? |
u2 | rel = false | No. |
sys3 | conf(arr_city) | Can you confirm you want to go to Paris? |
u3 | conf = true | Yes ! |
… | … | … |
… | … | … |
This approach requires modelling the environment of the dialogue manager as a stochastic system and to learn the parameters of this model from data. It has been a topic of research since the early 2000’s (Levin et al 2000, Scheffler & Young 2001, Pietquin 2004). Most of the research is now focused on simulating the user (Georgila et al 2005, Pietquin 2006a, Schatzmann et al 2007a) and assessing the quality of a user model for training a reinforcement learning agent is an important track (Schatzmann et al 2005, Rieser & Lemon 2006, Georgila et al 2006). Modelling the errors introduced by the ASR and NLU systems is also a major topic of research (Scheffler & Young 2001, Lopez Cozar et al 2003, Pietquin & Beaufort 2005, Pietquin & Dutoit 2006b).
5. Speech-Based Database Querying
We will illustrate reinforcement learning based dialogue optimization on the particular task of a speech-based database querying system. In such an application, the user wants to extract from a database one or a list of records selected according to specific features provided by a user through speech-based interactions.
In the following, several experiments made on a database containing 350 computer configurations are described. The database is split into 2 tables (for notebooks and desktops), each of them containing 6 fields: pc_mac (pc or mac), processor_type, processor_speed, ram_size, hdd_size and brand. The goal of the dialogue system is therefore to extract one or a short list of computer configurations from the database and to present it the user. To do so, the system will have to gather information about which computer features the user wants. In the following, the application is described in terms of actions, states and rewards so as to be mapped to the Markov decision processes paradigm.
5.1. Action Set
The task involves database querying. Therefore possible systems actions do not only imply interactions with the user (such as spoken questions, confirmation requests or assertions) but also with the database (such as database querying). The action set contains 8 generic actions:
greet: greeting (e.g. “How may I help you ?”).
constQ(
openQ: ask an open ended question.
expC(
allC: ask for a confirmation of all the arguments.
rel(
dbQ([
close: present data and close the dialogue session.
The value of
5.2. State Space
The way the state space is built is very important and several state variables can be envisioned for describing the same task. Yet, some general considerations might be taken into account:
The state representation should contain enough information about the history of the dialogue so as to assume the Markov property to be met.
State spaces are often considered as informational in that sense that they are built thanks to the amount of information the DM could retrieve from the environment until it reached the current state.
The state representation must embed enough information so as to give an accurate representation of the situation to which an action has to be associated (it is not as obvious as it sounds).
The state space must be kept as small as possible since the reinforcement learning algorithms converge in linear time with the number of states of the underlying Markov decision process.
According to these considerations and the particular task of database querying, two slightly different state spaces where built to describe the task as an MDP so as to illustrate the sensitivity of the method to the state space representation. In the first representation, referred to as
A vector of 7 boolean values (one for each value of
Information about the Confidence Level (CL) of each value set to
Notice that ‘dbQ’ actions will only include values with a
The second way to represent the state space is built on the same basis but an additional state variable
5.3. Reward Function
Once again, there is not a unique way to build the reward function and slight differences in the choices can result in large variations in the learned strategy. To illustrate this, some simple functions will be described in the following. According to (Walker et al, 1997), the reward function (which is here a cost function that we will try to minimize) should rely on an estimate of the dialogue time duration (
In this last expression, the
On another hand, the task completion is not always easy to define. The
In these last expressions, #(
Finally, the ASR performance measures will be provided by the confidence levels (
6. Experiments
The number of required interactions between a RL agent and its environment is quite large (104 dialogues at least in our case). So, it has been mandatory to simulate most of the dialogues for reasons explained in section 0. An intention-based simulation environment has therefore been built as described in (Pietquin & Dutoit 2006a). It simulates ASR errors using a constant Word Error Rate (WER). It generates confidence levels as real numbers ranging between 0 and 1 according to a distribution measured on a real system. If the system has to recognize more than one argument at a time, the
Several experimental results obtained with different settings of the state space and the reward function will be exposed in the following. These settings are obtained by combining in three different ways the parameters
6.1. First Experiment: S
1
, N
U
, TC
max
The first experiment is based on the smaller state space
NU | NS | TCmax | TCav |
2.25 | 3.35 | 6.7 | 1.2 |
greet | constQ | openQ | expC | AllC | rel | dbQ | close |
1.00 | 0.06 | 0.0 | 0.14 | 0.0 | 0.05 | 1.10 | 1.00 |
When looking at the three first columns of the performance table (Table 3), the learned strategy doesn’t look so bad. It actually has a short duration in terms of user turns as well as in system turns and has a very high task completion rate in terms of
When looking to the average frequency of actions in table, one can see that the only action addressed to the user that happens frequently during a dialogue is the greeting action. Others almost never occur. Actually, the learned strategy consists in uttering the greeting prompt to which the user should answer by providing some arguments. Then the system performs a database query with the retrieved attributes and presents the results to the user. Sometimes, the user doesn’t provide any attribute when answering to the greeting prompt or the value is not recognized at all by the ASR model (very low
This results in presenting almost the whole database when the user only provides one argument when prompted by the greeting. This is why there is a so big difference between
Signals | Intentions | Expanded Intentions |
a0 → sys0 | greeting | Hello! How may I help you? |
u0 | Table= ‘Notebook’ | I’d like to buy a Notebook. |
o0 | Table = ‘Notebook’ CL = high |
|
a1 | dbQ | |
o1 | DB = 97 (high) | |
a2 → sys2 | As = close | Ok, here are the computers corresponding to your request: (proposes the 97 Notebooks in the DB) … |
6.2. Second Experiment: S
2
, N
U
, TC
av
Here, the same settings are used except that the
NU | NS | TCmax | TCav |
5.75 | 8.88 | 6.7 | 6.2 |
greet | constQ | openQ | expC | AllC | rel | dbQ | close |
1.00 | 0.87 | 1.24 | 0.31 | 1.12 | 0.21 | 3.13 | 1.00 |
Results show that
The mean number of user turns shows that only 5.75 turns are usually needed to reach an accurate result set because the computer configurations are sufficiently different so as not to need too many attributes in the database query to provided accurate results. Thus, the system doesn’t ask for all the attribute values to the user. Further investigations would show that the system takes advantage of the structure of the database and asks for attributes allowing extracting the desired records as fast as possible.
Signals | Intentions | Expanded Intentions |
a0 → sys0 | greeting | Hello! How may I help you? |
u0 | Table = ‘Notebook’ | I’d like to buy a Notebook. |
o0 | Table = ‘Notebook’ CL = high |
|
a1 | dbQ | |
o1 | DB = high | |
a2 → sys2 | openQ | Do you have any other preference? |
u2 | pc_mac = ‘PC’ proc_type = ‘Pentium III’ |
I’d rather like a PC with a Pentium III processor. |
o2 | pc_mac = ‘PC’ proc_type = ‘Pentium III’ CL = high |
|
a3 | dbQ | |
o3 | DB = high | |
a4 → sys4 | constQ(ram) | How much memory would you like? |
u4 | ram = 128 | 128 MB. |
a5 | dbQ | |
o5 | DB = low | |
a6 → sys6 | allC | You asked for a PC Notebook with a Pentium III processor and 128 MB memory. |
u6 | conf_table = true … |
Yes. |
a7 → sys7 | close | Ok, here are the computers corresponding to your request: (proposes the 3 results of the DB query) … |
6.3. Third Experiment: S
2
, N
S
, TC
av
The same experiment as the previous one has been performed but replacing the
NU | NS | TCmax | TCav |
6.77 | 7.99 | 6.6 | 6.1 |
greet | constQ | openQ | expC | AllC | rel | dbQ | close |
1.00 | 1.57 | 1.24 | 0.33 | 1.32 | 0.31 | 1.22 | 1.00 |
This obviously results in a decrease of the number of database queries involving a proportional decrease of the number of system turns NS. Yet, an increase of the number of user turns
This last strategy is actually optimal for the considered simulation environment (constant word error rate for all tasks) and is suitable for using with this simple application.
7. Conclusion
This chapter described a formal description of a man-machine spoken dialogue suitable to introduce a mapping between man-machine dialogues and (partially observable) Markov decision processes. This allows data-driven optimization of a dialogue manager’s interaction strategy using the reinforcement learning paradigm. Yet, such an optimization process often requires tenths of thousands of dialogues which are not accessible through real interactions with human users because of time and economical constraints. Expanding existing databases by means of dialogue simulation is a solution to this problem and several approaches can be envisioned as discussed in section 0. In this context, we described the particular task of speech-based database querying and its mapping into the MDP paradigm in terms of actions, states and rewards. Three experiments on a very simple task have shown the influence of parameterization of the MDP on the learned conversational strategy. From this, one can say first that the state space representation is a crucial point since it embeds the knowledge of the system about the interaction. Second, the reward function is also of major importance since it measures how well the system performs on the task by simulating the perception of the dialogue quality from the users’ point of view. Performance measure is a key of RL. The three experiments described in the last section showed the influence of these parameters on the learned strategy and concluded that a correctly parameterized RL algorithm could result in an acceptable dialogue strategy while little changes in the parameters could lead to silly strategies unsuitable for use in real conditions.
8. Future Works
Data-driven optimization of spoken dialogue strategies is an emerging area of research and lots of problems still remain. One of the first is to find tractable algorithms to train real size dialogue systems. Indeed, the standard RL algorithms are suitable for small tasks as described in section 5 but real applications can exhibit up to several million of states, possibly with continuous observations (Williams et al 2005). The curse of dimensionality is therefore of particular interest in the area of spoken dialogue systems. Several attempts to tackle this problem in the framework of spoken dialogue systems can be found in the literature by scaling up MDPs using supervised learning (Henderson et al 2005) and hierarchical learning (Cuayáhuitl et al 2007). Also algorithms for tractable solutions to the optimization of spoken dialogue systems via the POMDP paradigm can be found in (Poupart et al 2005, Young 2006). This preliminary work in the field of generalisation and hierarchical learning shows the interest of the community in these techniques. Another problem to tackle is the development of realistic user models, easily trainable from data and suitable for training RL-based dialogue managers. Different approaches are being studied such as the recently proposed agenda-based user model (Schatzmann et al 2007b) that can be trained by an Expectation-Maximisation algorithm from data, or user models based on dynamic Bayesian networks (Pietquin & Dutoit 2006a). Assessing such user models in terms of quality of the trained strategies and similarity with real user behavior is of course primordial (Schatzmann et al 2005, Georgila et al 2006, Rieser & Lemon 2006). On another hand, it might be interesting to see how to use learned strategies to help human developers to design optimal strategies. Indeed, the solution may be in computer-aided design more than fully automated design (Pietquin & Dutoit 2003). Finally, designing a complete dialogue system using an end-to-end probabilistic framework, from speech recognition to speech synthesis systems automatically trained on real data, is probably the next step (Lemon & Pietquin 2007).
Acknowledgments
The research presented in this chapter has been funded by the ‘First Europe’ program of the Belgian Walloon Region, the SIMILAR European Network of Excellence and the French Lorraine Region.
References
- 1.
Allen J. 1994 , Benjamin Cummings, 1987, Second Edition, 1994. - 2.
Carletta J. 1996 Assessing Agreement on Classification Tasks: the Kappa Statistic. , 22(2), 1996, 249-254. - 3.
Cuayáhuitl H. Renals S. Lemon O. Shimodaira H. 2007 Hierarchical Dialogue Optimization Using Semi-Markov Decision Processes, in , Anvers (Belgium), 2007. - 4.
Dutoit T. 1997 . Kluwer Academic Publishers, Dordrecht,0-79234-498-7 - 5.
Frampton M. Lemon O. 2006 Learning more effective dialogue strategies using limited dialogue move features, in , 2006. - 6.
Georgila K. Henderson J. Lemon O. 2005 Learning User Simulations for Information State Update Dialogue Systems, in , Lisbon (Portugal) 2005. - 7.
Georgila K. Henderson J. Lemon O. 2006 User simulation for spoken dialogue systems: Learning and evaluation, in , Pittsburgh, 2006. - 8.
Graesser A. Van Lehn K. Rosé C. Jordan P. Harter D. 2001 Intelligent Tutoring Systems with Conversational Dialogue. in22 4), 2001,39 52 . - 9.
Henderson J. Lemon O. Georgila K. 2005 Hybrid Reinforcement/Supervised Learning for Dialogue Policies from COMMUNICATOR data, in , 2005,68 75 . - 10.
Lemon O. Pietquin O. 2007 Machine learning for spoken dialogue systems, in , Anvers (Belgium), August 2007. - 11.
Levin E. Pieraccini R. Eckert W. 1997 Learning dialogue strategies within the Markov decision process framework, in (ASRU’97), December 1997. - 12.
Levin E. Pieraccini R. Eckert W. 2000 A stochastic model of human-machine interaction for learning dialog strategies, in ,8 1 11 23 , 2000. - 13.
Lopez-Cozar R. de la Torre A. Segura J. Rubio A. 2003 Assesment of dialogue systems by means of a new simulation technique, in ,40 3 387 407 , May 2003. - 14.
Pietquin O. Renals S. 2002 Asr system modelling for automatic evaluation and optimization of dialogue systems, in , Orlando, (USA, FL), May 2002. - 15.
Pietquin O. Dutoit T. 2003 Aided Design of Finite-State Dialogue Management Systems, in , Baltimore (USA, MA), 2003. - 16.
Pietquin O. 2004 , Presses Universitaires de Louvain,29303446362004 - 17.
Pietquin O. 2005 A probabilistic description of man-machine spoken communication, in , Amsterdam (The Netherlands), July 2005. - 18.
Pietquin O. Beaufort R. 2005 Comparing ASR Modeling Methods for Spoken Dialogue Simulation and Optimal Strategy Learning. In , Lisbon, Portugal (2005) - 19.
Pietquin O. 2006a Consistent goal-directed user model for realistic man-machine task-oriented spoken dialogue simulation, in , Toronto, Canada, July 2006. - 20.
Pietquin O. 2006b Machine learning for spoken dialogue management : an experiment with speech-based database querying, in , J. Euzenat and J. Domingue, Eds.,4183 of Lecture Notes in Artificial Intelligence,172 180 . Springer Verlag, 2006. - 21.
Pietquin O. Dutoit T. 2006a A probabilistic framework for dialog simulation and optimal strategy learning, in ,14 2 589 599 , March 2006. - 22.
Pietquin O. Dutoit T. 2006b Dynamic Bayesian networks for NLU simulation with applications to dialog optimal strategy learning, in , May 2006. - 23.
Poupart P. Williams J. Young S. 2006 Partially observable Markov decision processes with continuous observations for dialogue management, in , 2006. - 24.
Rabiner L. Juang B. H. 1993 , Prentice Hall, Signal Processing Series, 1993. - 25.
Reiter E. Dale R. 2000 , Cambridge University Press, Cambridge, 2000. - 26.
Rieser V. Lemon O. 2006 Cluster-based user simulations for learning dialogue strategies and the super evaluation metric, in , 2006. - 27.
Schatzmann J. Georgila K. Young S. 2005 Quantitative evaluation of user simulation techniques for spoken dialogue systems, in , September 2005. - 28.
Schatzmann J. Weilhammer K. Stuttle M. Young S. 2007a A survey of statistical user simulation techniques for reinforcement-learning of dialogue management strategies, in21 2 97 126 . - 29.
Schatzmann J. Thomson B. Young S. 2007b Statistical User Simulation with a Hidden Agenda. In , Antwerp, 2007. - 30.
Scheffler K. Young S. 2001 Corpus-based dialogue simulation for automatic strategy learning and evaluation, in , 2001. - 31.
Singh S. Kearns M. Litman D. Walker M. 1999 Reinforcement learning for spoken dialogue systems, in , 1999. - 32.
Young S. 2006 Using POMDPs for dialog management, in , 2006. - 33.
Young S. Schatzmann J. Weilhammer K. Ye H. 2007 The hidden information state approach to dialog management, in , April 2007. - 34.
Walker M. Litman D. Kamm C. Abella A. 1997 PARADISE: A Framework for Evaluating Spoken Dialogue Agents. in , Madrid, Spain (1997) 271-280. - 35.
Watkins C. 1989 . PhD thesis, Psychology Department, Cambridge University, Cambridge, England, 1989. - 36.
Williams J. Young S. 2005 Scaling up POMDPs for dialogue management: the summary POMDP method, in , 2005. - 37.
Williams J. Poupart P. Young S. 2005 Partially Observable Markov Decision Processes with Continuous Observations for Dialogue Management, in , Lisbon (Portugal), 2005.