Open access peer-reviewed chapter

Optimising Spoken Dialogue Strategies within the Reinforcement Learning Paradigm

By Olivier Pietquin

Published: January 1st 2008

DOI: 10.5772/5284

Downloaded: 3044

1. Introduction

Human-Computer Interfaces are now widely studied and become one of the major interests among the scientific community. More and more electronic devices surround people in their day-to-day life. This exponential incursion of electronics in homes, cars and work places is not only due to its ability to ease the achievement of common and boring tasks or the continuously decreasing prices but also because the increasing “user-friendliness” of interfaces makes it easier to use.

Being studied for more than fifty years, speech and natural language processing knew major progresses during the two last decades. It is now feasible to build real Spoken Dialogue Systems (SDS) interacting with human users through voice-enabled interactions. Speech often appears as a natural way to interact for a human being and it provides potential benefits such as hand-free access to machines, ergonomics and greater efficiency of interaction. Yet, speech-based interfaces design has been an expert job for a long time. It necessitates good skills in speech technologies and low-level programming. Moreover, rapid design and reusability of previously designed systems are almost impossible. For these reasons, but not only, people are less used to interact with speech-based interfaces which are therefore thought as less intuitive.

Designing and optimizing a SDS is not only the combination of speech and language processing systems such as Automatic Speech Recognition (ASR) (Rabiner & Juang 1993), Spoken Language Understanding (SLU) (Allen 1998), Natural Language Generation (NLG) (Reiter & Dale 2000), and Text-to-Speech (TTS) synthesis (Dutoit 1997) systems. It also requires the development of dialogue strategies taking at least into account the performances of these subsystems (and others), the nature of the task (e.g. form filling (Pietquin & Dutoit 2006a), tutoring (Graesser et al 2001), robot control, or database querying (Pietquin 2006b)), and the user’s behaviour (e.g. cooperativeness, expertise (Pietquin 2004)). The great variability of these factors makes rapid design of dialogue strategies and reusability across tasks of previous work very complex. Most often, such a design is a cyclic process composed of strategy hand-coding, prototype releases and expansive and time consuming user tests. In addition, there is also no guarantee that hand-crafted strategies developed by experts are anything close to optimal. Because it provides data-driven methods and objective clues about performances, statistical learning of optimal dialogue strategies became a leading domain of research (Lemon & Pietquin, 2007). The goal of such approaches is to reduce the number of design cycles (Fig. 1).

Figure 1.

1. Optimization for minimizing the number of design cycles

Supervised learning for such an optimization problem would require examples of ideal (sub)strategies which are typically unknown. Indeed, no one can actually provide an example of what would have objectively been the perfect sequencing of exchanges after having participated to a dialogue. Humans have a greater propensity to criticize what is wrong than to provide positive proposals. In this context, reinforcement learning using Markov Decision Processes (MDPs) (Levin et al 1998, Singh et al 1999, Scheffler & Young 2001, Pietquin & Dutoit 2006a, Frampton & Lemon 2006) and Partially Observable MDP (POMDPs) (Poupart et al 2005, Young 2006) has become a particular focus.


2. Formalism

2.1. Definitions

In the rest of this chapter, a dialoguewill be referred to as an interaction between two agentsbased on sequential turn taking. In most of the cases, this interaction is goal-directedand both agents cooperate in order to achieve an aim (or accomplish a task). In the case of a man-machine dialog, one of the agents is a human userwhile the other is a computer (or system). In the particular case in which the interaction uses speech as the main communication mean, the computer implements a Spoken Dialogue System(SDS) while a system using several means of communication is referred to as a Multimodal Dialogue System(MMDS). When the man-machine dialog is dedicated to the realisation of a particular task (or set of tasks) it is called a task-orienteddialogue system. When one of the agents is an SDS, the dialogue consists of a sequence of utterancesexchanged at each turn. A spoken utteranceis the acoustic realisation of the intentionsor conceptsor dialog actsone of the agents wants to communicate to the other and is expressed as a wordsequence.

2.2. Formal description of man-machine spoken dialog

A man-machine spoken dialog can be considered as a sequential process in which a human user and a Dialog Manager (DM) are communicating using speech through speech and language processing modules (Fig. 2). The role of the DM is to define the sequencing of spoken interactions and therefore to take decisions about what to do at a given time (providing information, asking for information, closing the dialog, etc.). A Spoken Dialog System is often meant to provide information to a user; this is why it is generally connected to a Knowledge Base (KB) through its DM. The dialog is therefore regarded as a turn-taking process in which pieces of information are processed sequentially by a set of modules and perform a cycle going from the DM to the user and back. At each turn tthe DM generates a communicative act set ataccording to its internal state stand corresponding to its decision about what to do in that state. Dialogue communicative acts (shortly dialogue acts) can be of different kind such as greeting the user, ask a constraining question to the user, provide information to the user, ask for confirmation about some information to the user, query a database, close the dialogue. This act set is then transformed into a linguistic representation lt (generally a text) by a natural language processing module. The textual representation lt serves as an input to a text-to-speech synthesizer to produce a system spoken output syst. The TTS and the NLG modules are therefore spoken output generation modules. To this spoken solicitation, the user answers by a new spoken utterance ut according to what he could understand from syst, to his/her knowledge kt (about the task, the interaction history, the world in general) and to the goal gt s/he is trying to achieve by interacting with the system. Both spoken utterances syst and ut can be mixed with some background noise nt. The noisy user utterance is in turn processed by an automatic speech recognition system, which produces a written word sequence wt as a result and a confidence measure CLASR about this result. A natural language understanding module subsequently tries to extract communicative acts (or concepts) ct from wt (possibly helped by CLASR). The NLU module also provides some confidence measure CLNLU about this processing. The NLU and ASR sub-systems are speech input processing modules. The set {ct, CLASR, CLNLU} composes an observation ot which can be considered as the result of the processing of the DM communicative acts atby its environment. The dialogue manager then computes a new internal state st+1 according to this observation.

The following paragraphs will use this description of a man-machine dialog as a base to build a probabilistic model.

Figure 2.

2. Man-Machine Spoken Communication.

2.3. MDP and Reinforcement Learning

In our vision of the MDP formalism, a discrete-time system interacting with its stochastic environment through actions is described by a finite or infinite number of states {si} in which a given number of actions {aj} can be performed. To each state-action pair is associated a transition probability T giving the probability of stepping from state sat time tto state s’ at time t+1 after having performed action awhen in state s. To this transition is also associated a reinforcement signal (or reward) rt+1 describing how good was the result of action awhen performed in state s. Formally, an MDP is thus completely defined by a 4-tuple {S, A, T, R} where Sis the state space, Ais the action set, T is a transition probability distribution over the state space and R is the expected reward distribution. The couple {T, R} defines the dynamics of the system:


These last expressions assume that the Markov property is met, which means that the system’s functioning is fully defined by its one-step dynamics and that its behavior from state swill be identical whatever the path followed before reaching s. To control a system described as an MDP (choosing actions to perform in each state), one would need a strategyor policymapping states to actions: (s) = P(a|s) (or (s) = aif the strategy is deterministic).

In this framework, a RL agentis a system aiming at optimally mapping states to actions, that is finding the best strategy * so as to maximize an overall return Rwhich is a function (most often a discounted return is used i.e. a weighted sum of immediate rewards) of all the immediate rewards rt.


If the probabilities of equations (1) and (2) are known, an analytical solution can be computed by dynamic programming, otherwise the system has to learn the optimal strategy by a trial-and-error process. RL is therefore about how to optimally map situations to actions by trying and observing environment’s feedback. In the most challenging cases, actions may affect not only the immediate reward, but also the next situation and, through that, all subsequent rewards. Trial-and-error search and delayed rewards are the two main features of RL. Different techniques are described in the literature, in the following the Watkin’s Q() algorithm (Watkin 1989) will be used.

3. Human-Machine Dialogue and Markov Decision Process

From the point of view of the dialogue manager, the interaction can probabilistically be described by the joint probability of the signals at, otand st+1 given the history of the interaction (Pietquin 2005):

P(st+1,ot,at|st,nt,at1,st1,nt1,,a0,s0,n0)= P(st+1|ot,at,st,nt,at1,st1,nt1,,a0,s0,n0)Task Model       P(ot|at,st,nt,at1,st1,nt1,,a0,s0,n0)Environment P(at|st,nt,at1,st1,nt1,,a0,s0,n0)DME5

In (5), the first term stands for the task modelthat helps building a new dialogue manager internal state thanks to the perceived observation, the second term stands for the response of the environment to the dialogue manager stimulation, and the third stands for the dialogue manager decision process.

3.1. Markov Property and Random Noise

In the case of a SDS, the Markov Property is met if the dialogue manager choice about the action atto perform at time tand the according transition probability for stepping to state st+1 at time t+1 are only conditioned by the state stat time tand not of previous states and actions. From now on, the Markov Property will be assumed. It can anyway be met by a judicious choice of the DM state representation, which should embed the history of the interaction into the current state. Such a state representation is said informational.

Let’s illustrate this on a train ticket booking system. When accessing such a system a customer can book a ticket by providing information about the cities of departure and arrival and a desired time of departure. Like in a 3-slot-filling application, three bits of information (sometimes called attributes) should therefore be transferred from the user to the system. A very simple way to build the state space is to represent the dialogue state as a vector of three Boolean values (e.g. [dep arr time]) set to trueif the corresponding attribute is supposed to be known by the system and to falseotherwise. An ideal dialogue for such an application and the corresponding dialogue state evolution is shown in Table 1.

SpeakerSpoken UtteranceDialogue state
SystemHello, how may I help you?[false false false]
UserI’d like to go to Edinburgh.
SystemWhat’s your departure city?[false true false]
UserI want to leave from Glasgow.
SystemWhen do you want to go from Glasgow to Edinburgh ?[true true false]
UserOn Saturday morning.
SystemOk, seats are available in train n° xxx …[true true true]

Table 1.

Ideal dialogue in a train ticket booking application.

To meet the Markov property with such a state representation, we have to assume that the system behaves the same whatever the order in which the slots where filled (and by the way, whatever the values of the attributes). The Markov assumption is also made about the environment; that is the user behaves the same whatever the filling order as well. These are of course strong assumptions but we will see later that they lead to satisfactory results.

Finally, most often the noise is considered as being random so as to have independence between ntand nt-1. Eq. (5) then simplifies as follow:

P(st+1,ot,at|st,nt)=P(st+1|ot,at,st,nt)Task Model P(ot|at,st,nt)Environment.P(at|st,nt)DME6

3.2. Dialogue Management as an MDP

As claimed in paragraph 0 and depicted on Fig. 2, a task-oriented (or goal-directed) man-machine dialogue can be seen as a turn-taking process in which a human user and a dialogue manager exchange information through different channels processing speech inputs and outputs (ASR, TTS...). In our problem, the dialogue manager’s action (or dialogue act) selection strategyhas to be optimized, the dialogue manager should thus be our learning agent.

Figure 3.

Dialogue management as an MDP.

As shown on Fig. 3, the environmentmodeled by the MDP comprises everything but the dialogue manager, i.e. the human user, the communication channels (ASR, TTS …), and any external information source (database, sensors etc.). In this context, at each turn tthe dialogue manager has to choose an action ataccording to its interaction strategyso as to complete the task it has been designed for. The RL agent has therefore to choose an action among greetings, spoken utterances (constraining questions, confirmations, relaxation, data presentation etc.), database queries, dialogue closure etc. They result in a response from the DM environment (user speech input, database records etc.), considered as an observation ot, which usually leads to a DM internal stateupdate according to the task model (Eq. 6).

3.3. Reward Function

To fit totally to the MDP formalism, a reinforcement signal rtis required. In (Singh et al1999) it is proposed to use the contribution of an action to the user’s satisfaction. Although this seems very subjective, some studies have shown that such a reward could be approximated by a linear combination of the task completion (TC) and objective measures cirelated to the system performances (Walker et al 1997):


where Nis a Z-score normalization function that normalises the results to have mean 0 and standard deviation 1 and wiare non-zero weights. This way, each weight ( and wi) expresses the relative importance of each term of the sum in the performance of the system. The task completion can for example be measured by the kappa () coefficient (Carletta 1996):


where P(A) is the proportion of correct interpretations of user’s utterances by the system and P(E) is the proportion of correct interpretations occurring by chance. One can see that = 1 when the system performs perfect interpretation (P(A) = 1) and = 0 when the only correct interpretations were obtained by chance (P(A) = P(E)). In order to compute weights and wi, a large number of users are asked to answer a satisfaction survey after having used the system while costs ci are measured during the interaction. The questionnaire comprises around 9 statements on a five-point Likert scale and the overall satisfaction is computed as the mean value of collected ratings. A Multivariate Linear Regression is then applied using the results of the survey as the dependent variable and the weights as independent variables. In practice, the significant performance measures ci are mainly the duration of the dialogue and the ASR and NLU performances.

3.4. Partial Observability

If a direct mapping between observations and system (or dialogue) states exists, the process is completely observable and the task model (see Eq. 6) can easily be built. Yet, it is rarely the case that the observations are directly linked to the dialogue state. Indeed, the real dialogue state at time tis related to the information the user intended to transmit to the system until time tduring the interaction. This information being transmitted through error prone statistical speech recognition and understanding systems, it can occur that the observation doesn’t contain only the information meant by the user but a probability distribution over a set of possible bits of information. Indeed, the output of a speech recognition system is usually a list of N word sequences (named N-bests list), each of them being associated with a confidence level that can be considered as a probability of the word sequence being correct given the spoken utterance (and maybe the context). This N-bests list serves as an input to the natural language understanding module which in turn provides a list of concept sequences associated to confidence levels.

This is typically what happens in partially observable environments where a probability distribution is drawn over possible states given the observations. For this reason, emerging research is focused on the optimization of spoken dialogue systems in the framework of Partially Observable Markov Decision Processes (POMDPs) (Poupart et al 2005, Young 2006)

4. Learning Dialogue Policies using Simulation

Using the framework described previously, it is theoretically possible to automatically learn spoken dialogue policies allowing natural conversation between human users and computers. This learning process should be realised online, through real interactions with users. One could even imagine building the reinforcement signal from direct queries to the user about his/her satisfaction after each interaction (Fig. 4).

Figure 4.

Ideal learning process.

For several reasons, direct learning through interactions is made difficult. First, a human user would probably react badly to some of the exploratory actions the system would choose since they might be completely incoherent. Anyway a very large number of interactions are required (typically tens of thousands of dialogues for standard dialogue systems) to train such a system. This is why data driven learning as been proposed so as to take advantage of existing databases for bootstrapping the learning process. Two methods were initially investigated: learning the state transition probabilities and the reward distribution from data (Singh et al, 1999) or learning parameters of a simulation environment mainly reproducing the behaviour of the user (Levin et al 2000). The second method is today preferred (Fig. 5). Indeed, whatever the data set available, it is unlikely that it contains every possible state transitions and it allows exploring the entire state and policy space. Dialogue simulation is therefore necessary for expanding the existing data sets and learning optimal policies. Most often, the dialogue is simulated at the intention level rather than at the word sequence or speech signal level, as it would be in the real world. An exception can be found in (Lopez Cozar et al 2003). Here, we regard an intention as the minimal unit of information that a dialogue participant can express independently. Intentions are closely related to concepts, speech acts or dialogue acts. For example, the sentence "I'd like to buy a desktop computer" is based on the concept buy(desktop). It is considered as unnecessary to model environment behavior at a lower level, because strategy optimization is a high level concept. Additionally, concept-based communication allows error modeling of all the parts of the system, including natural language understanding (Pietquin & Renals 2002, Pietquin & Dutoit 2006b). More pragmatically, it is simpler to automatically generate concepts compared with word sequences (and certainly speech signals), as a large number of utterances can express the same intention while it should not influence the dialogue manager strategy. Table 2 describes such a simulation process. The intentions have been expanded in the last column for comprehensiveness purposes. The signals column refers to notations of section 0.

SignalsIntentionsExpanded Intentions
sys0greetingHello! How may I help you?
u0arr_city = ‘Paris’I’d like to go to Paris.
sys1const(arr_time)When do you prefer to arrive?
u1arr_time = ‘1.00 PM’I want to arrive around 1 PM.
sys2rel(arr_time)Don’t you prefer to arrive later?
u2rel = falseNo.
sys3conf(arr_city)Can you confirm you want to go to Paris?
u3conf = trueYes !

Table 2.

Simulated dialogue at the intention level (‘const’ stands for constraining question, ‘rel’ for relaxation and ‘conf’ for confirmation).

This approach requires modelling the environment of the dialogue manager as a stochastic system and to learn the parameters of this model from data. It has been a topic of research since the early 2000’s (Levin et al 2000, Scheffler & Young 2001, Pietquin 2004). Most of the research is now focused on simulating the user (Georgila et al 2005, Pietquin 2006a, Schatzmann et al 2007a) and assessing the quality of a user model for training a reinforcement learning agent is an important track (Schatzmann et al 2005, Rieser & Lemon 2006, Georgila et al 2006). Modelling the errors introduced by the ASR and NLU systems is also a major topic of research (Scheffler & Young 2001, Lopez Cozar et al 2003, Pietquin & Beaufort 2005, Pietquin & Dutoit 2006b).

Figure 5.

Learning via simulation.

5. Speech-Based Database Querying

We will illustrate reinforcement learning based dialogue optimization on the particular task of a speech-based database querying system. In such an application, the user wants to extract from a database one or a list of records selected according to specific features provided by a user through speech-based interactions.

In the following, several experiments made on a database containing 350 computer configurations are described. The database is split into 2 tables (for notebooks and desktops), each of them containing 6 fields: pc_mac (pc or mac), processor_type, processor_speed, ram_size, hdd_size and brand. The goal of the dialogue system is therefore to extract one or a short list of computer configurations from the database and to present it the user. To do so, the system will have to gather information about which computer features the user wants. In the following, the application is described in terms of actions, states and rewards so as to be mapped to the Markov decision processes paradigm.

5.1. Action Set

The task involves database querying. Therefore possible systems actions do not only imply interactions with the user (such as spoken questions, confirmation requests or assertions) but also with the database (such as database querying). The action set contains 8 generic actions:

greet: greeting (e.g. “How may I help you ?”).

constQ(arg): ask to constrain the value of arg.

openQ: ask an open ended question.

expC(arg): ask to confirm the value of arg.

allC: ask for a confirmation of all the arguments.

rel(arg): ask to relax the value of arg.

dbQ([args]): perform a database query thanks to retrieved information.

close: present data and close the dialogue session.

The value of argmay be the table’s type (notebook or desktop) or one of the 6 table fields. Notice that there is no data presentation action because it will be considered that the data presentation is included in the ‘close’ action. This means that, when the dialogue is closed by the dialogue manager, it systematically presents to the user the content of the last retrieved recordset.

5.2. State Space

The way the state space is built is very important and several state variables can be envisioned for describing the same task. Yet, some general considerations might be taken into account:

The state representation should contain enough information about the history of the dialogue so as to assume the Markov property to be met.

State spaces are often considered as informational in that sense that they are built thanks to the amount of information the DM could retrieve from the environment until it reached the current state.

The state representation must embed enough information so as to give an accurate representation of the situation to which an action has to be associated (it is not as obvious as it sounds).

The state space must be kept as small as possible since the reinforcement learning algorithms converge in linear time with the number of states of the underlying Markov decision process.

According to these considerations and the particular task of database querying, two slightly different state spaces where built to describe the task as an MDP so as to illustrate the sensitivity of the method to the state space representation. In the first representation, referred to as S1in the following, each state is represented by two features.

A vector of 7 boolean values (one for each value of arg). Like in the example of paragraph 0, each of these values is set to trueif the corresponding value of argis known (for example if the user specified to search in the notebooks table, the fist value is set to true). This is a way to meet the Markov property (informational state).

Information about the Confidence Level (CL) of each value set to true. The confidence level is usually a real number ranging between 0 and 1 computed by the speech and/or language analysis subsystems (ASR and NLU) and providing information about the confidence of the system in the result of its processing. To keep the size of the state space reasonable, we only considered 2 possible values for the confidence level: Highor Low(i.e. Highmeans CL ≥ 0.8 and Lowmeans CL < 0.8).

Notice that ‘dbQ’ actions will only include values with a Highconfidence level. For each value of arg, there are 3 different possibilities for the corresponding slot in the state representation: {value = false, CL= undef}, {value = true, CL= Low}, {value = true, CL= High}. This leads to 37 possible states.

The second way to represent the state space is built on the same basis but an additional state variable NDBis added to take the number of records returned by the last database query into account. This variable can also take only two values (Highor Low) and is set according to the comparison of the query result size and a predefined threshold. If no ‘dbQ’ action has been performed, the NDBvariable is initialized with the Highvalue (an empty query would provide the whole database as a result). This state space representation will be referred to as S2in the following.

5.3. Reward Function

Once again, there is not a unique way to build the reward function and slight differences in the choices can result in large variations in the learned strategy. To illustrate this, some simple functions will be described in the following. According to (Walker et al, 1997), the reward function (which is here a cost function that we will try to minimize) should rely on an estimate of the dialogue time duration (D), the ASR performances (ASR) and the task completion (TC) so as to approximate the user’s satisfaction using objective measures:


In this last expression, the wxfactors are positive weights. Considering the estimate of the time duration, two values are actually available: the number of user turns D= NU(the number of turns perceived by the user) and the number of system turns D= NS(including database queries as well).

On another hand, the task completion is not always easy to define. The kappacoefficient defined in (Carletta 1996) and section 0 is one possibility but didn’t always prove to correlate well with the perceived task completion. For the purpose of this experiment, two simple task completion measures will be defined:


In these last expressions, #(GUR) is the number of common values in the user’s goal GU(the user goal is supposed to have the same structure as an existing database record and is set before the dialogue begins) and one of the records Rpresented to the user at the end of a dialogue. When a value is not present in the user goal it is considered as common (if a field is not important to the user, it is supposed to match any value). The first task completion measure TCmaxindicates how close the closest record in the presented results is. The second TCavmeasures the mean number of common values between the user’s goal and each presented record.

Finally, the ASR performance measures will be provided by the confidence levels (CL) computed by the ASR system after each speech recognition task.


6. Experiments

The number of required interactions between a RL agent and its environment is quite large (104 dialogues at least in our case). So, it has been mandatory to simulate most of the dialogues for reasons explained in section 0. An intention-based simulation environment has therefore been built as described in (Pietquin & Dutoit 2006a). It simulates ASR errors using a constant Word Error Rate (WER). It generates confidence levels as real numbers ranging between 0 and 1 according to a distribution measured on a real system. If the system has to recognize more than one argument at a time, the CLis the product of individual CLs obtained for each recognition task (so it usually decreases). Other ASR simulation models can be considered (Pietquin & Beaufort 2005) but it is out of the scope of this introduction to the technique.

Several experimental results obtained with different settings of the state space and the reward function will be exposed in the following. These settings are obtained by combining in three different ways the parameters S1, S2, NU, NS, TCmax, TCavmentioned before.Results are described in terms of average number of turns (user and system turns), average task completion measures (TCmaxand TCav) for the performance and action occurrence frequency during a dialogue session to get a clue about the learned strategy. These results are obtained by simulating 10,000 dialogues with the learned strategy.

6.1. First Experiment: S1, NU, TCmax

The first experiment is based on the smaller state space S1(without any information about the number of retrieved records). The dialogue cost is computed thanks to the number of user turns NUas a measure of the time duration and the TCmaxvalue as the task completion measure. Results are shown in the following tables.


Table 3.

Performances of the learned strategy for the {S1, NU, TCmax} configuration.


Table 4.

Learned strategy for the {S1, NU, TCmax} configuration.

When looking at the three first columns of the performance table (Table 3), the learned strategy doesn’t look so bad. It actually has a short duration in terms of user turns as well as in system turns and has a very high task completion rate in terms of TCmaxmeasure. Yet the TCavshows a very low mean value.

When looking to the average frequency of actions in table, one can see that the only action addressed to the user that happens frequently during a dialogue is the greeting action. Others almost never occur. Actually, the learned strategy consists in uttering the greeting prompt to which the user should answer by providing some arguments. Then the system performs a database query with the retrieved attributes and presents the results to the user. Sometimes, the user doesn’t provide any attribute when answering to the greeting prompt or the value is not recognized at all by the ASR model (very low CLvalue), so the strategy is to perform a constraining question (and not an open-ended question) that will provide an argument with a better CL. Sometimes the provided arguments have still a poor CLand an explicit confirmation is requested. Sometimes the provided arguments don’t correspond to any valid record in the database so the strategy is to ask for relaxation of one argument (this also explains why the number of database queries is greater than 1). The value of TCmaxis not maximal because sometimes the dialogue fails.

This results in presenting almost the whole database when the user only provides one argument when prompted by the greeting. This is why there is a so big difference between TCmaxand TCav. The desired record is actually in the presented data (TCmaxis high) but is very difficult to find (TCavis low). The learned strategy is definitely not suitable for a real system, specially if the record set has to be presented vocally. An example of dialogue is shown in Table 5, where the signal column refers to signals used on Fig. 2 and in section 0.

SignalsIntentionsExpanded Intentions
a0 → sys0greetingHello! How may I help you?
u0Table= ‘Notebook’I’d like to buy a Notebook.
o0Table = ‘Notebook’
CL = high
o1DB = 97 (high)
a2 → sys2As = closeOk, here are the computers corresponding to your request: (proposes the 97 Notebooks in the DB) …

Table 5.

Dialogue sample for the {S1, NU, TCmax} configuration.

6.2. Second Experiment: S2, NU, TCav

Here, the same settings are used except that the NDBvariable is added to the state variables and the task completion is measured with TCav.


Table 6.

Performances of the learned strategy for the {S2, NU, TCav} configuration.


Table 7.

Learned strategy for the {S2, NU, TCav} configuration.

Results show that TCmaxand TCavare close to each other, so the presented results are more accurate but the number of turns has increased. The number of system turns particularly exhibits higher values. This is obviously explained by the increase of database queries. Looking at Table 7 one can see that the learned strategy tries to maximize the TCavvalue while minimizing the number of user turns and maximizing recognition performance. To do so, it systematically performs a database query after having retrieved information from the user. The number of results being among the state variables, the agent learned not to present the results when in a state with a high NDBvalue. If this value is too high after the greeting, the learner tries to reach a state where it is lower. Thus it almost systematically performs an ‘openQ’ action after the greeting so as to get as much information as possible in a minimum of turns (this explains the 1.24 value). Yet, this often results in lower CL values, thus it also performs a confirmation of all the fields before presenting any result. Sometimes, more information is provided after the greeting and only a constraining question is required to gather enough information to reach a state with less results. A constraining question is preferred in this case because it leads to better recognition results.

The mean number of user turns shows that only 5.75 turns are usually needed to reach an accurate result set because the computer configurations are sufficiently different so as not to need too many attributes in the database query to provided accurate results. Thus, the system doesn’t ask for all the attribute values to the user. Further investigations would show that the system takes advantage of the structure of the database and asks for attributes allowing extracting the desired records as fast as possible.

SignalsIntentionsExpanded Intentions
a0 → sys0greetingHello! How may I help you?
u0Table = ‘Notebook’I’d like to buy a Notebook.
o0Table = ‘Notebook’
CL = high
o1DB = high
a2 → sys2openQDo you have any other preference?
u2pc_mac = ‘PC’
proc_type = ‘Pentium III’
I’d rather like a PC with a Pentium III processor.
o2pc_mac = ‘PC’
proc_type = ‘Pentium III’
CL = high
o3DB = high
a4 → sys4constQ(ram)How much memory would you like?
u4ram = 128128 MB.
o5DB = low
a6 → sys6allCYou asked for a PC Notebook with a Pentium III processor and 128 MB memory.
u6conf_table = true
a7 → sys7closeOk, here are the computers corresponding to your request: (proposes the 3 results of the DB query) …

Table 8.

Dialogue sample for the {S2, NU, TCav} configuration.

6.3. Third Experiment: S2, NS, TCav

The same experiment as the previous one has been performed but replacing the NUmeasure of time duration by the NSmeasure. It actually makes sense since in a real application, the database could be much larger than the one used here. Thus, the database queries could be much more time consuming.


Table 9.

Performances of the learned strategy for the {S2, NS, TCav} configuration.


Table 10.

Learned strategy for the {S2, NS, TCav} configuration.

This obviously results in a decrease of the number of database queries involving a proportional decrease of the number of system turns NS. Yet, an increase of the number of user turns NUis also observed. By examining the action frequencies, one can notice that the number of constraining questions increased resulting in an increase of NU. Indeed, the learned strategy implies gathering enough information from the user before performing a database query. This explains why the systems ask more constraining questions.

This last strategy is actually optimal for the considered simulation environment (constant word error rate for all tasks) and is suitable for using with this simple application.

7. Conclusion

This chapter described a formal description of a man-machine spoken dialogue suitable to introduce a mapping between man-machine dialogues and (partially observable) Markov decision processes. This allows data-driven optimization of a dialogue manager’s interaction strategy using the reinforcement learning paradigm. Yet, such an optimization process often requires tenths of thousands of dialogues which are not accessible through real interactions with human users because of time and economical constraints. Expanding existing databases by means of dialogue simulation is a solution to this problem and several approaches can be envisioned as discussed in section 0. In this context, we described the particular task of speech-based database querying and its mapping into the MDP paradigm in terms of actions, states and rewards. Three experiments on a very simple task have shown the influence of parameterization of the MDP on the learned conversational strategy. From this, one can say first that the state space representation is a crucial point since it embeds the knowledge of the system about the interaction. Second, the reward function is also of major importance since it measures how well the system performs on the task by simulating the perception of the dialogue quality from the users’ point of view. Performance measure is a key of RL. The three experiments described in the last section showed the influence of these parameters on the learned strategy and concluded that a correctly parameterized RL algorithm could result in an acceptable dialogue strategy while little changes in the parameters could lead to silly strategies unsuitable for use in real conditions.

8. Future Works

Data-driven optimization of spoken dialogue strategies is an emerging area of research and lots of problems still remain. One of the first is to find tractable algorithms to train real size dialogue systems. Indeed, the standard RL algorithms are suitable for small tasks as described in section 5 but real applications can exhibit up to several million of states, possibly with continuous observations (Williams et al 2005). The curse of dimensionality is therefore of particular interest in the area of spoken dialogue systems. Several attempts to tackle this problem in the framework of spoken dialogue systems can be found in the literature by scaling up MDPs using supervised learning (Henderson et al 2005) and hierarchical learning (Cuayáhuitl et al 2007). Also algorithms for tractable solutions to the optimization of spoken dialogue systems via the POMDP paradigm can be found in (Poupart et al 2005, Young 2006). This preliminary work in the field of generalisation and hierarchical learning shows the interest of the community in these techniques. Another problem to tackle is the development of realistic user models, easily trainable from data and suitable for training RL-based dialogue managers. Different approaches are being studied such as the recently proposed agenda-based user model (Schatzmann et al 2007b) that can be trained by an Expectation-Maximisation algorithm from data, or user models based on dynamic Bayesian networks (Pietquin & Dutoit 2006a). Assessing such user models in terms of quality of the trained strategies and similarity with real user behavior is of course primordial (Schatzmann et al 2005, Georgila et al 2006, Rieser & Lemon 2006). On another hand, it might be interesting to see how to use learned strategies to help human developers to design optimal strategies. Indeed, the solution may be in computer-aided design more than fully automated design (Pietquin & Dutoit 2003). Finally, designing a complete dialogue system using an end-to-end probabilistic framework, from speech recognition to speech synthesis systems automatically trained on real data, is probably the next step (Lemon & Pietquin 2007).


The research presented in this chapter has been funded by the ‘First Europe’ program of the Belgian Walloon Region, the SIMILAR European Network of Excellence and the French Lorraine Region.

© 2008 The Author(s). Licensee IntechOpen. This chapter is distributed under the terms of the Creative Commons Attribution-NonCommercial-ShareAlike-3.0 License, which permits use, distribution and reproduction for non-commercial purposes, provided the original is properly cited and derivative works building on this content are distributed under the same license.

How to cite and reference

Link to this chapter Copy to clipboard

Cite this chapter Copy to clipboard

Olivier Pietquin (January 1st 2008). Optimising Spoken Dialogue Strategies within the Reinforcement Learning Paradigm, Reinforcement Learning, Cornelius Weber, Mark Elshaw and Norbert Michael Mayer, IntechOpen, DOI: 10.5772/5284. Available from:

chapter statistics

3044total chapter downloads

More statistics for editors and authors

Login to your personal dashboard for more detailed statistics on your publications.

Access personal reporting

Related Content

This Book

Next chapter

Water Allocation Improvement in River Basin Using Adaptive Neural Fuzzy Reinforcement Learning Approach

By Abolpour B., Javan M. and Karamouz M.

Related Book

First chapter

Different Tools on Multi-Objective Optimization of a Hybrid Artificial Neural Network – Genetic Algorithm for Plasma Chemical Reactor Modelling

By Nor Aishah Saidina Amin and I. Istadi

We are IntechOpen, the world's leading publisher of Open Access books. Built by scientists, for scientists. Our readership spans scientists, professors, researchers, librarians, and students, as well as business professionals. We share our knowledge and peer-reveiwed research papers with libraries, scientific and engineering societies, and also work with corporate R&D departments and government entities.

More About Us