Open access peer-reviewed chapter

# Iteration Algorithms in Markov Decision Processes with State-Action-Dependent Discount Factors and Unbounded Costs

By Fernando Luque-Vásquez and J. Adolfo Minjárez-Sosa

Submitted: April 18th 2016Reviewed: July 28th 2016Published: December 28th 2016

DOI: 10.5772/65044

## Abstract

This chapter concerns discrete time Markov decision processes under a discounted optimality criterion with state-action-dependent discount factors, possibly unbounded costs, and noncompact admissible action sets. Under mild conditions, we show the existence of stationary optimal policies and we introduce the value iteration and the policy iteration algorithms to approximate the value function.

### Keywords

• discounted optimality
• non-constant discount factor
• value iteration
• policy iteration
• Markov decision processes

AMS 2010 subject classifications: 93E10, 90C40

## 1. Introduction

In this chapter we study Markov decision processes (MDPs) with Borel state and action spaces under a discounted criterion with state-action–dependent discount factors, possibly unbounded costs and noncompact admissible action sets. That is, we consider discount factors of the form

α(xn,an),E1

where xnand anare the state and the action at time n,respectively, playing the following role during the evolution of the system. At the initial state x0,the controller chooses an action a0and a cost c(x0,a0)is incurred. Then the system moves to a new state x1according to a transition law. Once the system is in state x1the controller selects an action a1and incurs a discounted cost α(x0,a0)c(x1,a1).Next the system moves to a state x2and the process is repeated. In general, for the stage n1,the controller incurs the discounted cost

k=0n1α(xk,ak)c(xn,an),E2

and our objective is to show the existence of stationary optimal control policies under the corresponding performance index, as well as to introduce approximation algorithms, namely, value iteration and policy iteration.

In the scenario of assuming a constant discount factor, the discounted optimality criterion in stochastic decision problems is the best understood of all performance indices, and it is widely accepted in several application problems (see, e.g., [13] and reference there in). However, such assumption might be strong or unrealistic in some economic and financial models. Indeed, in these problems the discount factors are typically functions of the interest rates, which in turn depend on the amount of currency and the decision-makers actions. Hence, we have state-action–dependent discount factors, and it is indeed these kinds of situations we are dealing with.

MDPs with non constant discount factors have been studied under different approaches (see, e.g., [48]). In particular, our work is a sequel to [8] where is studied the control problem with state-dependent discount factor. In addition, randomized discounted criteria have been analyzed in [912] where the discount factor is modeled as a stochastic process independent of the state-action pairs.

Specifically, in this chapter we study control models with state-action-dependent discount factors, focusing mainly on introducing approximation algorithms for the optimal value function (value iteration and policy iteration). Furthermore, an important feature in this work is that there is no compactness assumption on the sets of admissible actions neither continuity conditions on the cost, which, in most of the papers on MDPs, are needed to show the existence of measurable selectors and continuity or semi-continuity of the minima function. Indeed, in contrast to the previously cited references, in this work, we assume that the cost and discount factor functions satisfy the K-inf-compactness condition introduced in [13]. Then, we use a generalization of Berge’s Theorem, given in [13], to prove the existence of measurable selectors. To the best of our knowledge there are no works dealing with MDPs in the context presented in this chapter.

The remainder of the chapter is organized as follows. Section 2 contains the description of the Markov decision model and the optimality criterion. In Section 3 we introduce the assumptions on the model and we prove the convergence of the value iteration algorithm (Theorem 3.5). In Section 4 we define the policy iteration algorithm and the convergence is stated in Theorem 4.1.

Notation. Throughout the paper we shall use the following notation. Given a Borel space S—that is, a Borel subset of a complete separable metric space — B(S)denotes the Borel σ-algebra and “measurability” always means measurability with respect to B(S). Given two Borel spaces Sand S, a stochastic kernel φ(|)on Sgiven Sis a function such that φ(|s)is a probability measure on Sfor each sS, and φ(B|)is a measurable function on Sfor each BB(S).Moreover, N(N0) denotes the positive (nonnegative) integers numbers. Finally, L(S)stands for the class of lower semicontinuous functions on Sbounded below and L+(S)denotes the subclass of nonnegative functions in L(S).

## 2. Markov decision processes

Markov control model. Let

M:=(X,A,{A(x)A|xX},Q,α,c)E3

be a discrete-time Markov control model with state-action-dependent discount factors satisfying the following conditions. The state space Xand the action or control space Aare Borel spaces. For each state xX, A(x)is a nonempty Borel subset of Adenoting the set of admissible controls when the system is in state x. We denote by Kthe graph of the multifunction xA(x),that is,

K={(x,a):xX,aA(x)},E4

which is assumed to be a Borel subset of the Cartesian product of Xand A.The transition law Q(|)is a stochastic kernel on Xgiven K.Finally, α:K(0,1)and c:K(0,)are measurable functions representing the discount factor and the cost-per-stage, respectively, when the system is in state xXand the action aA(x)is selected.

The model Mrepresents a controlled stochastic system and has the following interpretation. Suppose that at time nN0the system is in the state xn=xX.Then, possibly taking into account the history of the system, the controller selects an action an=aA(x),and a discount factor α(x, a)is imposed. As a consequence of this the following happens:

1. A cost c(x,a)is incurred;

2. The system visits a new state xn+1=xXaccording to the transition law

Q(B|x,a):=Pr[xn+1B|xn=x,an=a],BB(X).E5

Once the transition to state xoccurs, the process is repeated.

Typically, in many applications, the evolution of the system is determined by stochastic difference equations of the form

xn+1=F(xn,an,ξn),nN0,E6

where {ξn}is a sequence of independent and identically distributed random variables with values in some Borel space S,independent of the initial state x0,and F:X×A×SXis a given measurable function. In this case, if θdenotes the common distribution of ξn,that is

θ(D):=P[ξnD],DB(S),nN0,E7

then the transition kernel can be written as

Q(B|x,a)=Pr[F(xn,an,ξn)B|xn=x,an=a]=θ{sS:F(x,a,s)B}=S1B[F(x,a,s)]θ(ds),BB(X),(x,a)K,E8

where 1B()represents the indicator function of the set B.

Control policies. The actions applied by the controller are chosen by mean of rules known as control policies defined as follows. Let H0:=Xand Hn:=Kn×X,n1be the spaces of admissible histories up to time n. A generic element of Hnis written as hn=(x0,a0,...,xn1,an1,xn).

Definition 2.1 A control policy (randomized, history-dependent) is a sequence π={πn} of stochastic kernels πn on A given Hn such that πn(A(xn)|hn)=1, for all hn∈ Hn,  n∈N0.

We denote by Π the set of all control policies.

Let Fbe the set of measurable selectors, that is, Fis the set of measurable function f:XAsuch that f(x)A(x)for all xX.

Definition 2.2 A control policy π={πn} is said to be:

a. deterministic if there exists a sequence {gn} of measurable functions gn:Hn→A such that

πn(C|hn)=1C[gn(hn)],hnHn,nN0,CB(A);E9

b. a Markov control policy if there exists a sequence {fn} of functions fn∈F such that

πn(C|hn)=1C[fn(xn)],hnHn,nN0,CB(A).E10

c. A Markov control policy is stationary if there exists f∈F such that fn=f for all n∈N0.

If necessary, see for example [13, 1416] for further information on those policies.

Observe that a Markov policy πis identified with the sequence {fn},and we denote π={fn}.In this case, the control applied at time nis an=fn(xn)A(xn).In particular, a stationary policy is identified with the function fF, and following a standard convention we denote by Fthe set of all stationary control policies.

To ease the notation, for each fFand xX, we write

c(x,f):=c(x,f(x)),Q(|x,f):=Q(|x,f(x)),E11

and

α(x,f):=α(x,f(x)).E12

The underlying probability space. Let (Ω, F)be the canonical measurable space consisting of the sample space Ω=K:=K×K×and its product σalgebra F.Then, under standard arguments (see, e.g., [1, 14]) for each πΠand initial state xX,there exists a probability measure Pxπon (Ω,F)such that, for all hnHn,anA(xn),nN0,CB(A),and BB(X),

Pxπ[x0=x]=1;Pxπ[anC|hn]=πn(C|hn);E13

and the Markov-like property is satisfied

Pxπ[xn+1B|hn,an]=Q(B|xn,an).E14

The stochastic process (Ω, F, Pxπ, {xn})is called Markov decision process.

Optimality criterion. We assume that the costs are discounted in a multiplicative discounted rate. That is, a cost Cincurred at stage nis equivalent to a cost CΓnat time 0,where

Γn:=k=0n1α(xk,ak)ifn1,andΓ0=1.E15

In this sense, when using a policy πΠ,given the initial state x0=x,we define the total expected discounted cost (with state-action–dependent discount factors) as

V(π,x):=Exπ[n=0Γnc(xn,an)],E16

where Exπdenotes the expectation operator with respect to the probability measure Pxπinduced by the policy π,given x0=x.

The optimal control problem associated to the control model M, is then to find an optimal policy πΠsuch that V(π,x)=V(x)for all xX,where

V(x):=infπΠV(π,x)E17

is the optimal value function (see [10]).

## 3. The value iteration algorithm

In this section we give conditions on the model that imply: (i) the convergence of the value iteration algorithm; (ii) the value function Vis a solution of the corresponding optimality equation; and (iii) the existence of stationary optimal policies. In order to guarantee that V(x)is finite for each initial state xwe suppose the following.

Assumption 3.1. There exists π0∈Π such that for all x∈X, V(π0,x)<∞.

At the end of Section 4 we give sufficient conditions for Assumption 3.1. We also require continuity and (inf-) compactness conditions to ensure the existence of "measurable minimizers." The following definition was introduced in [13].

Definition 3.2. A function u:K→R is said to be K-inf-compact on K if for each compact subset K of X and r∈R, the set

{(x,a)GrK(A):u(x,a)r}E18

is a compact subset of X×A, where GrK(A):={(x,a):x∈K,a∈A(x)}.

Assumption 3.3. (a) The one-stage cost c and the discount factor α are K-inf-compact functions on K. In addition, c is nonnegative.

(b) The transition law Qis weakly continuous, that is, the mapping

(x,a)Xu(y)Q(dy|x,a)E19

is continuous for each bounded and continuous function on X.

For each measurable function uon X,xX,and fF,we define the operators

Tu(x):=infaA(x){c(x,a)+α(x,a)Xu(y)Q(dy|x,a)}E20

and

Tfu(x):=c(x,f)+α(x,f)Xu(y)Q(dy|x,f).E21

A consequence of Assumption 3.3 is the following.

Lemma 3.4. Let u be a function in L+(X). If Assumption 3.3 holds then the function v:K→R defined by

v(x,a):=c(x,a)+α(x,a)Xu(y)Q(dy|x,a)E22

is K-inf-compact on K

Proof. First note that by the K-inf-compactness hypothesis c(,)and α(,)are l.s.c on GrK(A)for each compact subset Kof X. Then, since αand uare nonnegative functions, from Assumption 3.3 we have that v(,)is l.s.c on GrK(A). Thus, for each rR, the set

{(x,a)GrK(A):v(x,a)r}E23

is a closed subset of the compact set {(x,a)GrK(A):c(x,a)r}.Then, vis K-inf-compact on K.

Observe that the operator Tis monotone in the sense that if vuthen TvTu.In addition, from Assumption 3.3 and ([13], Theorem 3.3), we have that Tmaps L+(X)into itself. Furthermore, there exists f˜Fsuch that

Tu(x)=Tf˜u(x),xX.E24

To state our first result we define the sequence {vn}L+(X)of value iteration functions as:

v00;vn(x)=Tvn1(x),xX.E25

Since Tis monotone, note that {vn}is a nondecreasing sequence.

Theorem 3.5. Suppose that Assumptions 3.1 and 3.3 hold. Then

1. vnV.

2. Vis the minimal solution in L+(X) of the Optimality Equation, i.e.,

V(x)=TV(x)=infaA(x){c(x,a)+α(x,a)XV(y)Q(dy|x,a)}.E26
3. There exists a stationary policy f∗∈F such that, for all x∈X,V(x)=Tf∗V(x), that is

V(x)=c(x,f)+α(x,f)Xu(y)Q(dy|x,f),E27

and f∗ is an optimal policy.

Proof. Since {vn}is nondecreasing, there exists vL+(X)such that vnv.Hence, from Monotone Convergence Theorem, ([13], Lemmas 2.2, 2.3), and ([1], Lemma 4.2.4), we obtain, for each xX,vn(x)=Tvn1(x)Tv(x),as n,which, in turn implies

Tv=v.E28

Therefore, to get (a)-(b) we need to prove that v=V.To this end, observe that for all xXand π∈Π

vn(x)Ac(x,a)π(da|x)+Aα(x,a)Xvn1(x1)Q(dx1|x,a)π(da|x).E29

Then, iterating (29) we obtain

vn(x)Vn(π,x),nN,E30

where

Vn(π,x)=Exπ[t=0n1Γtc(xt,at)],E31

is the nstage discounted cost Vn.Then, letting nwe get v(x)V(π,x),for all πΠand xX.Thus,

v(x)V(x),xX.E32

On the other hand, from (28) and (24), let fFsuch that v(x)=Tfv(x),xX.Iterating this equation, we have (see (31))

v(x)=Exf[c(x,f)+t=1n1k=0t1α(xk,f)c(xt,f)]+Exf[k=0n1α(xk,f)v(xn)]Vn(f,x).E33

Hence, letting n,

v(x)V(f,x)V(x),xX.E34

Combining (32) and (34) we get v=V.

Now, let uL+(X)be an arbitrary solution of the optimality equation, that is, Tu=u.Then, applying the arguments in the proof of (34) with uinstead of vwe conclude that uV.That is, Vis minimal in L+(X).

Part (c) follows from (b) and ([13], Theorem 3.3). Indeed, there exists a stationary policy fFsuch that V(x)=TfV(x),xX.Then, iteration of this equation yields V(x)=V(f,x),which implies that fis optimal.

## 4. Policy iteration algorithm

In Theorem 3.5 is established an approximation algorithm for the value function Vby means of the sequence of the value iteration functions {vn}.In this case the sequence {vn}increase to Vand it is defined recursively. Now we present the well-known policy iteration algorithm which provides a decreasing approximation to Vin the set of the control policies.

To define the algorithm, first observe that from the Markov property (14) and applying properties of conditional expectation, for any stationary policy fFand xX,the corresponding cost V(f,x)satisfies

V(f,x)=c(x,f)+α(x,f)Exf[t=1k=0t1α(xk,f)c(xt,f)]=c(x,f)+α(x,f)XEf[c(x1,f)+t=2k=0t1α(xk,f)c(xt,f)|x1=y]Q(dy|x,f)=c(x,f)+α(x,f)XV(f,y)Q(dy|x,f)=TfV(f,x),xX.E35

Let f0Fbe a stationary policy with a finite valued cost Then, from (35),

w0(x)=c(x,f0)+α(x,f0)Xw0(y)Q(dy|x,f0)=Tf0w0(x),xX.E36

Now, let f1Fbe such that

Tw0(x)=Tf1w0(x),E37

and define

In general, we define a sequence {wn}in L+(X)as follows. Given fnF,compute Next, let fn+1Fbe such that

Tfn+1wn(x)=Twn(x),xX,E38

that is,

Tfn+1wn(x)=c(x,fn+1)+α(x,fn+1)Xwn(y)Q(dy|x,fn+1)=minaA(x){c(x,a)+α(x,a)Xwn(y)Q(dy|x,a)}=Twn(x),xX.E39

Then we define

Theorem 4.1. Under Assumptions 3.1 and 3.3, there exists a measurable nonnegative function such that and Moreover, if w satisfies

limnExπ[Γnw(xn)]=0πΠ,xX,E40

then

To prove the Theorem 4.1 we need the following result.

Lemma 4.2. Under Assumption 3.3, if u:X→R is a measurable function such that Tu is well defined, u≤Tu, and

limnExπ[Γnu(xn)]=0πΠ,xX,E41

then u≤V.

Proof. From the Markov property (14), for each π∈Π and xX,

Exπ[Γn+1u(xn+1)|hn,an]=Γn+1Xu(y)Q(dy|xn,an)E42
=Γn[c(xn,an)+α(xn,an)Xu(y)Q(dy|xn,an)c(xn,an)]E43
ΓninfaA(xn)[c(xn,a)+α(xn,a)Xu(y)Q(dy|xn,a)]Γnc(xn,an)E44
=ΓnTu(xn)Γnc(xn,an)Γnu(xn)Γnc(xn,an),E45

which, in turn implies

Γnc(xn,an)Exπ[Γnu(xn)Γn+1u(xn+1)|hn,an].E46

Therefore, for all kN(see (31)),

Vk(π,x)=Exπn=0k1Γnc(xn,an)u(x)Exπ[Γku(xk)].E47

Finally, letting k,(41) yields V(π,x)u(x),and since πis arbitrary we obtain V(x)u(x).

Proof of Theorem 4.1. According to Lemma 4.2, it is sufficient to show the existence of a function such that and To this end, from (36)–(38),

w0(x)minaA(x){c(x,a)+α(x,a)Xw0(y)Q(dy|x,a)}=Tf1w0(x)=c(x,f1)+α(x,f1)Xw0(y)Q(dy|x,f1).E48

Iterating this inequality, a straightforward calculation as in (34) shows that

w0(x)V(f1,x)=w1(x),xX.E49

In general, similar arguments yield

wnTwnwn+1,nN.E50

Therefore, there exists a nonnegative measurable function wsuch that In addition, since nN0,we have Next, letting nin (47) and applying ([17], Lemma 3.3), we obtain which implies

### 4.1. Sufficient conditions for Assumption 3.1 and (40)

An obvious sufficient condition for Assumption 3.1 and (40) is the following:

C1 (a) There exists α¯∈(0,1) such that for all (x,a)K, α(x,a)<α¯.

(b) For some constant m,0c(x,a)mfor all (x,a)K.

Indeed, under condition C1, V(π,x)m/(1α¯)for all xXand πΠ, and {wn}is a bounded sequence which in turn implies (since the boundedness of the function This fact clearly yields (40).

Other less obvious sufficient conditions are the following (see, e.g., [15, 16, 2]).

C2 (a) Condition C1 (a).

(b) There exist a measurable function W:X(1,)and constants M>0,β(1,1/α¯),such that for all (x,a)K,

supA(x)c(x,a)MW(x)E51

and

XW(y)Q(dy|x,a)βW(x).E52

First note that by condition C2 and the Markov property (14), for any policy πΠand initial state x0=xX,

Exπ[W(xn+1)|hn,an]=XW(y)Q(dy|xn,an)βW(xn),nN0.E53

Then, using properties of conditional expectation,

Exπ[W(xn+1)]βExπ[W(xn)],nN0.E54

Iterating inequality (51) we get

Exπ[W(xn)]βnW(x),nN0.E55

Therefore, by condition C2, for any policy πΠand xX,

V(π,x)Exπn=0α¯nc(xn,an)n=0Mα¯nExπW(xn)M1α¯βW(x).E56

Thus, Assumption 3.1 holds.

On the other hand, if L+W(X)denotes the subclass of all functions uin L+(X)such that

uW:=supxXh(x)W(x)<,E57

then, because wk()=V(fk+1,),from (53) and condition C2, we have that wkL+W(X)for all k=1,2,...and

limnExπ[Γnwk(xn)]=0πΠ,xX.E58

Since wwk,(40) follows from (55).

## Acknowledgments

Work supported partially by Consejo Nacional de Ciencia y Tecnologa (CONACYT) under grant CB2015/254306.

## More

© 2016 The Author(s). Licensee IntechOpen. This chapter is distributed under the terms of the Creative Commons Attribution 3.0 License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

## How to cite and reference

### Cite this chapter Copy to clipboard

Fernando Luque-Vásquez and J. Adolfo Minjárez-Sosa (December 28th 2016). Iteration Algorithms in Markov Decision Processes with State-Action-Dependent Discount Factors and Unbounded Costs, Operations Research - the Art of Making Good Decisions, Kuodi Jian, IntechOpen, DOI: 10.5772/65044. Available from:

### Related Content

Next chapter

#### Mathematical Modeling of Isothermal Drying and its Potential Application in the Design of the Industrial Drying Regimes of Clay Products

By Miloš Vasić, Zagorka Radojević and Robert Rekecki

#### Real-time Systems

Edited by Kuodi Jian

First chapter

#### Introductory Chapter: Real-Time Systems

By Kuodi Jian

We are IntechOpen, the world's leading publisher of Open Access books. Built by scientists, for scientists. Our readership spans scientists, professors, researchers, librarians, and students, as well as business professionals. We share our knowledge and peer-reveiwed research papers with libraries, scientific and engineering societies, and also work with corporate R&D departments and government entities.

View all Books