Open access peer-reviewed chapter

# Iteration Algorithms in Markov Decision Processes with State-Action-Dependent Discount Factors and Unbounded Costs

Written By

Fernando Luque-Vásquez and J. Adolfo Minjárez-Sosa

Submitted: April 18th, 2016 Reviewed: July 28th, 2016 Published: December 28th, 2016

DOI: 10.5772/65044

From the Edited Volume

## Operations Research

Edited by Kuodi Jian

Chapter metrics overview

View Full Metrics

## Abstract

This chapter concerns discrete time Markov decision processes under a discounted optimality criterion with state-action-dependent discount factors, possibly unbounded costs, and noncompact admissible action sets. Under mild conditions, we show the existence of stationary optimal policies and we introduce the value iteration and the policy iteration algorithms to approximate the value function.

### Keywords

• discounted optimality
• non-constant discount factor
• value iteration
• policy iteration
• Markov decision processes

AMS 2010 subject classifications: 93E10, 90C40

## 1. Introduction

In this chapter we study Markov decision processes (MDPs) with Borel state and action spaces under a discounted criterion with state-action–dependent discount factors, possibly unbounded costs and noncompact admissible action sets. That is, we consider discount factors of the form

α(xn,an),E1

where xn and an are the state and the action at time n, respectively, playing the following role during the evolution of the system. At the initial state x0, the controller chooses an action a0 and a cost c(x0,a0) is incurred. Then the system moves to a new state x1 according to a transition law. Once the system is in state x1 the controller selects an action a1 and incurs a discounted cost α(x0,a0)c(x1,a1). Next the system moves to a state x2 and the process is repeated. In general, for the stage n1, the controller incurs the discounted cost

k=0n1α(xk,ak)c(xn,an),E2

and our objective is to show the existence of stationary optimal control policies under the corresponding performance index, as well as to introduce approximation algorithms, namely, value iteration and policy iteration.

In the scenario of assuming a constant discount factor, the discounted optimality criterion in stochastic decision problems is the best understood of all performance indices, and it is widely accepted in several application problems (see, e.g., [13] and reference there in). However, such assumption might be strong or unrealistic in some economic and financial models. Indeed, in these problems the discount factors are typically functions of the interest rates, which in turn depend on the amount of currency and the decision-makers actions. Hence, we have state-action–dependent discount factors, and it is indeed these kinds of situations we are dealing with.

MDPs with non constant discount factors have been studied under different approaches (see, e.g., [48]). In particular, our work is a sequel to [8] where is studied the control problem with state-dependent discount factor. In addition, randomized discounted criteria have been analyzed in [912] where the discount factor is modeled as a stochastic process independent of the state-action pairs.

Specifically, in this chapter we study control models with state-action-dependent discount factors, focusing mainly on introducing approximation algorithms for the optimal value function (value iteration and policy iteration). Furthermore, an important feature in this work is that there is no compactness assumption on the sets of admissible actions neither continuity conditions on the cost, which, in most of the papers on MDPs, are needed to show the existence of measurable selectors and continuity or semi-continuity of the minima function. Indeed, in contrast to the previously cited references, in this work, we assume that the cost and discount factor functions satisfy the K-inf-compactness condition introduced in [13]. Then, we use a generalization of Berge’s Theorem, given in [13], to prove the existence of measurable selectors. To the best of our knowledge there are no works dealing with MDPs in the context presented in this chapter.

The remainder of the chapter is organized as follows. Section 2 contains the description of the Markov decision model and the optimality criterion. In Section 3 we introduce the assumptions on the model and we prove the convergence of the value iteration algorithm (Theorem 3.5). In Section 4 we define the policy iteration algorithm and the convergence is stated in Theorem 4.1.

Notation. Throughout the paper we shall use the following notation. Given a Borel space S—that is, a Borel subset of a complete separable metric space — B(S) denotes the Borel σ-algebra and “measurability” always means measurability with respect to B(S). Given two Borel spaces S and S, a stochastic kernel φ(|) on S given S is a function such that φ(|s) is a probability measure on S for each sS, and φ(B|) is a measurable function on S for each BB(S). Moreover, N (N0) denotes the positive (nonnegative) integers numbers. Finally, L(S) stands for the class of lower semicontinuous functions on S bounded below and L+(S) denotes the subclass of nonnegative functions in L(S).

## 2. Markov decision processes

Markov control model. Let

M:=(X,A,{A(x)A|xX},Q,α,c)E3

be a discrete-time Markov control model with state-action-dependent discount factors satisfying the following conditions. The state space X and the action or control space A are Borel spaces. For each state xX, A(x) is a nonempty Borel subset of A denoting the set of admissible controls when the system is in state x. We denote by K the graph of the multifunction xA(x), that is,

K={(x,a):xX,aA(x)},E4

which is assumed to be a Borel subset of the Cartesian product of X and A. The transition law Q(|) is a stochastic kernel on X given K. Finally, α:K(0,1) and c:K(0,) are measurable functions representing the discount factor and the cost-per-stage, respectively, when the system is in state xX and the action a A(x) is selected.

The model M represents a controlled stochastic system and has the following interpretation. Suppose that at time nN0 the system is in the state xn=xX. Then, possibly taking into account the history of the system, the controller selects an action an=aA(x), and a discount factor α(x, a) is imposed. As a consequence of this the following happens:

1. A cost c(x,a) is incurred;

2. The system visits a new state xn+1=xX according to the transition law

Q(B|x,a):=Pr[xn+1B|xn=x,an=a],BB(X).E5

Once the transition to state x occurs, the process is repeated.

Typically, in many applications, the evolution of the system is determined by stochastic difference equations of the form

xn+1=F(xn,an,ξn),nN0,E6

where {ξn} is a sequence of independent and identically distributed random variables with values in some Borel space S, independent of the initial state x0, and F:X×A×SX is a given measurable function. In this case, if θ denotes the common distribution of ξn, that is

θ(D):=P[ξnD],DB(S),nN0,E7

then the transition kernel can be written as

Q(B|x,a)=Pr[F(xn,an,ξn)B|xn=x,an=a]=θ{sS:F(x,a,s)B}=S1B[F(x,a,s)]θ(ds),BB(X),(x,a)K,E8

where 1B() represents the indicator function of the set B.

Control policies. The actions applied by the controller are chosen by mean of rules known as control policies defined as follows. Let H0:=X and Hn:=Kn×X, n1 be the spaces of admissible histories up to time n. A generic element of Hn is written as hn=(x0,a0,...,xn1,an1,xn).

Definition 2.1 A control policy (randomized, history-dependent) is a sequence π={πn} of stochastic kernels πn on A given Hn such that πn(A(xn)|hn)=1, for all hn Hn,nN0.

We denote by Π the set of all control policies.

Let F be the set of measurable selectors, that is, F is the set of measurable function f:XA such that f(x)A(x) for all xX.

Definition 2.2 A control policy π={πn} is said to be:

a. deterministic if there exists a sequence {gn} of measurable functions gn:HnA such that

πn(C|hn)=1C[gn(hn)],hnHn,nN0,CB(A);E9

b. a Markov control policy if there exists a sequence {fn} of functions fnF such that

πn(C|hn)=1C[fn(xn)],hnHn,nN0,CB(A).E10

c. A Markov control policy is stationary if there exists fF such that fn=f for all nN0.

If necessary, see for example [13, 1416] for further information on those policies.

Observe that a Markov policy π is identified with the sequence {fn}, and we denote π={fn}. In this case, the control applied at time n is an=fn(xn)A(xn). In particular, a stationary policy is identified with the function fF, and following a standard convention we denote by F the set of all stationary control policies.

To ease the notation, for each fF and xX, we write

c(x,f):=c(x,f(x)),Q(|x,f):=Q(|x,f(x)),E11

and

α(x,f):=α(x,f(x)).E12

The underlying probability space. Let (Ω, F) be the canonical measurable space consisting of the sample space Ω=K:=K×K× and its product σalgebra F. Then, under standard arguments (see, e.g., [1, 14]) for each πΠ and initial state xX, there exists a probability measure Pxπ on (Ω,F) such that, for all hnHn, anA(xn), nN0, CB(A), and BB(X),

Pxπ[x0=x]=1;Pxπ[anC|hn]=πn(C|hn);E13

and the Markov-like property is satisfied

Pxπ[xn+1B|hn,an]=Q(B|xn,an).E14

The stochastic process (Ω, F, Pxπ, {xn}) is called Markov decision process.

Optimality criterion. We assume that the costs are discounted in a multiplicative discounted rate. That is, a cost C incurred at stage n is equivalent to a cost CΓn at time 0, where

Γn:=k=0n1α(xk,ak)ifn1,andΓ0=1.E15

In this sense, when using a policy πΠ, given the initial state x0=x, we define the total expected discounted cost (with state-action–dependent discount factors) as

V(π,x):=Exπ[n=0Γnc(xn,an)],E16

where Exπ denotes the expectation operator with respect to the probability measure Pxπ induced by the policy π, given x0=x.

The optimal control problem associated to the control model M, is then to find an optimal policy πΠ such that V(π,x)=V(x) for all xX, where

V(x):=infπΠV(π,x)E17

is the optimal value function (see [10]).

## 3. The value iteration algorithm

In this section we give conditions on the model that imply: (i) the convergence of the value iteration algorithm; (ii) the value function V is a solution of the corresponding optimality equation; and (iii) the existence of stationary optimal policies. In order to guarantee that V(x) is finite for each initial state x we suppose the following.

Assumption 3.1. There exists π0Π such that for all xX, V(π0,x)<.

At the end of Section 4 we give sufficient conditions for Assumption 3.1. We also require continuity and (inf-) compactness conditions to ensure the existence of "measurable minimizers." The following definition was introduced in [13].

Definition 3.2. A function u:KR is said to be K-inf-compact on K if for each compact subset K of X and rR, the set

{(x,a)GrK(A):u(x,a)r}E18

is a compact subset of X×A, where GrK(A):={(x,a):xK,aA(x)}.

Assumption 3.3. (a) The one-stage cost c and the discount factor α are K-inf-compact functions on K. In addition, c is nonnegative.

(b) The transition law Q is weakly continuous, that is, the mapping

(x,a)Xu(y)Q(dy|x,a)E19

is continuous for each bounded and continuous function on X.

For each measurable function u on X,xX, and fF, we define the operators

Tu(x):=infaA(x){c(x,a)+α(x,a)Xu(y)Q(dy|x,a)}E20

and

Tfu(x):=c(x,f)+α(x,f)Xu(y)Q(dy|x,f).E21

A consequence of Assumption 3.3 is the following.

Lemma 3.4. Let u be a function in L+(X). If Assumption 3.3 holds then the function v:KR defined by

v(x,a):=c(x,a)+α(x,a)Xu(y)Q(dy|x,a)E22

is K-inf-compact on K

Proof. First note that by the K-inf-compactness hypothesis c(,) and α(,) are l.s.c on GrK(A) for each compact subset K of X. Then, since α and u are nonnegative functions, from Assumption 3.3 we have that v(,) is l.s.c on GrK(A). Thus, for each rR, the set

{(x,a)GrK(A):v(x,a)r}E23

is a closed subset of the compact set {(x,a)GrK(A):c(x,a)r}. Then, v is K-inf-compact on K.

Observe that the operator T is monotone in the sense that if vu then TvTu. In addition, from Assumption 3.3 and ([13], Theorem 3.3), we have that T maps L+(X) into itself. Furthermore, there exists f˜F such that

Tu(x)=Tf˜u(x),xX.E24

To state our first result we define the sequence {vn}L+(X) of value iteration functions as:

v00;vn(x)=Tvn1(x),xX.E25

Since T is monotone, note that {vn} is a nondecreasing sequence.

Theorem 3.5. Suppose that Assumptions 3.1 and 3.3 hold. Then

1. vnV.

2. V is the minimal solution in L+(X) of the Optimality Equation, i.e.,

V(x)=TV(x)=infaA(x){c(x,a)+α(x,a)XV(y)Q(dy|x,a)}.E26
3. There exists a stationary policy fF such that, for all xX,V(x)=TfV(x), that is

V(x)=c(x,f)+α(x,f)Xu(y)Q(dy|x,f),E27

and f is an optimal policy.

Proof. Since {vn} is nondecreasing, there exists vL+(X) such that vnv. Hence, from Monotone Convergence Theorem, ([13], Lemmas 2.2, 2.3), and ([1], Lemma 4.2.4), we obtain, for each xX, vn(x)=Tvn1(x)Tv(x),as n, which, in turn implies

Tv=v.E28

Therefore, to get (a)-(b) we need to prove that v=V. To this end, observe that for all xX and π∈Π

vn(x)Ac(x,a)π(da|x)+Aα(x,a)Xvn1(x1)Q(dx1|x,a)π(da|x).E29

Then, iterating (29) we obtain

vn(x)Vn(π,x),nN,E30

where

Vn(π,x)=Exπ[t=0n1Γtc(xt,at)],E31

is the nstage discounted cost Vn. Then, letting n we get v(x)V(π,x), for all πΠ and xX. Thus,

v(x)V(x),xX.E32

On the other hand, from (28) and (24), let fF such that v(x)=Tfv(x), xX. Iterating this equation, we have (see (31))

v(x)=Exf[c(x,f)+t=1n1k=0t1α(xk,f)c(xt,f)] +Exf[k=0n1α(xk,f)v(xn)] Vn(f,x).E33

Hence, letting n,

v(x)V(f,x)V(x),xX.E34

Combining (32) and (34) we get v=V.

Now, let uL+(X) be an arbitrary solution of the optimality equation, that is, Tu=u. Then, applying the arguments in the proof of (34) with u instead of v we conclude that uV. That is, V is minimal in L+(X).

Part (c) follows from (b) and ([13], Theorem 3.3). Indeed, there exists a stationary policy fF such that V(x)=TfV(x),xX. Then, iteration of this equation yields V(x)=V(f,x), which implies that f is optimal.

## 4. Policy iteration algorithm

In Theorem 3.5 is established an approximation algorithm for the value function V by means of the sequence of the value iteration functions {vn}. In this case the sequence {vn} increase to V and it is defined recursively. Now we present the well-known policy iteration algorithm which provides a decreasing approximation to V in the set of the control policies.

To define the algorithm, first observe that from the Markov property (14) and applying properties of conditional expectation, for any stationary policy fF and xX, the corresponding cost V(f,x) satisfies

V(f,x)=c(x,f)+α(x,f)Exf[t=1k=0t1α(xk,f)c(xt,f)]=c(x,f)+α(x,f)XEf[c(x1,f)+t=2k=0t1α(xk,f)c(xt,f)|x1=y]Q(dy|x,f)=c(x,f)+α(x,f)XV(f,y)Q(dy|x,f)=TfV(f,x),xX.E35

Let f0F be a stationary policy with a finite valued cost

Then, from (35),

w0(x)=c(x,f0)+α(x,f0)Xw0(y)Q(dy|x,f0)=Tf0w0(x),xX.E36

Now, let f1F be such that

Tw0(x)=Tf1w0(x),E37

and define

In general, we define a sequence {wn} in L+(X) as follows. Given fnF, compute

Next, let fn+1F be such that

Tfn+1wn(x)=Twn(x),xX,E38

that is,

Tfn+1wn(x)=c(x,fn+1)+α(x,fn+1)Xwn(y)Q(dy|x,fn+1)=minaA(x){c(x,a)+α(x,a)Xwn(y)Q(dy|x,a)}=Twn(x),xX.E39

Then we define

Theorem 4.1. Under Assumptions 3.1 and 3.3, there exists a measurable nonnegative function such that and Moreover, if w satisfies

limnExπ[Γnw(xn)]=0πΠ,xX,E40

then

To prove the Theorem 4.1 we need the following result.

Lemma 4.2. Under Assumption 3.3, if u:XR is a measurable function such that Tu is well defined, uTu, and

limnExπ[Γnu(xn)]=0πΠ,xX,E41

then uV.

Proof. From the Markov property (14), for each π∈Π and xX,

Exπ[Γn+1u(xn+1)|hn,an]=Γn+1Xu(y)Q(dy|xn,an)E42
=Γn[c(xn,an)+α(xn,an)Xu(y)Q(dy|xn,an)c(xn,an)]E43
ΓninfaA(xn)[c(xn,a)+α(xn,a)Xu(y)Q(dy|xn,a)]Γnc(xn,an)E44
=ΓnTu(xn)Γnc(xn,an)Γnu(xn)Γnc(xn,an),E45

which, in turn implies

Γnc(xn,an)Exπ[Γnu(xn)Γn+1u(xn+1)|hn,an].E46

Therefore, for all kN (see (31)),

Vk(π,x)=Exπn=0k1Γnc(xn,an)u(x)Exπ[Γku(xk)].E47

Finally, letting k, (41) yields V(π,x)u(x), and since π is arbitrary we obtain V(x)u(x).

Proof of Theorem 4.1. According to Lemma 4.2, it is sufficient to show the existence of a function

such that
and
To this end, from (36)–(38),

w0(x)minaA(x){c(x,a)+α(x,a)Xw0(y)Q(dy|x,a)}=Tf1w0(x) =c(x,f1)+α(x,f1)Xw0(y)Q(dy|x,f1).E48

Iterating this inequality, a straightforward calculation as in (34) shows that

w0(x)V(f1,x)=w1(x),xX.E49

In general, similar arguments yield

wnTwnwn+1,nN.E50

Therefore, there exists a nonnegative measurable function w such that

nN0, we have
Next, letting n in (47) and applying ([17], Lemma 3.3), we obtain
which implies

### 4.1. Sufficient conditions for Assumption 3.1 and (40)

An obvious sufficient condition for Assumption 3.1 and (40) is the following:

C1 (a) There exists α¯(0,1) such that for all (x,a)K, α(x,a)<α¯.

(b) For some constant m,0c(x,a)m for all (x,a)K.

Indeed, under condition C1, V(π,x)m/(1α¯) for all xX and πΠ, and {wn} is a bounded sequence which in turn implies (since

the boundedness of the function
This fact clearly yields (40).

Other less obvious sufficient conditions are the following (see, e.g., [15, 16, 2]).

C2 (a) Condition C1 (a).

(b) There exist a measurable function W:X(1,) and constants M>0,β(1,1/α¯), such that for all (x,a)K,

supA(x)c(x,a)MW(x)E51

and

XW(y)Q(dy|x,a)βW(x).E52

First note that by condition C2 and the Markov property (14), for any policy πΠ and initial state x0=xX,

Exπ[W(xn+1)|hn,an]=XW(y)Q(dy|xn,an)βW(xn),nN0.E53

Then, using properties of conditional expectation,

Exπ[W(xn+1)]βExπ[W(xn)],nN0.E54

Iterating inequality (51) we get

Exπ[W(xn)]βnW(x),nN0.E55

Therefore, by condition C2, for any policy πΠ and xX,

V(π,x)Exπn=0α¯nc(xn,an)n=0Mα¯nExπW(xn)M1α¯βW(x).E56

Thus, Assumption 3.1 holds.

On the other hand, if L+W(X) denotes the subclass of all functions u in L+(X) such that

uW:=supxXh(x)W(x)<,E57

then, because wk()=V(fk+1,), from (53) and condition C2, we have that wkL+W(X) for all k=1,2,... and

limnExπ[Γnwk(xn)]=0πΠ,xX.E58

Since wwk, (40) follows from (55).

## Acknowledgments

Work supported partially by Consejo Nacional de Ciencia y Tecnologa (CONACYT) under grant CB2015/254306.

## References

1. 1. O. Hernández-Lerma, J.B. Lasserre, Discrete-Time Markov Control Processes: Basic Optimality Criteria. Springer-Verlag, New York, NY, 1996.
2. 2. O. Hernández-Lerma, J.B. Lasserre, Further Topics on Discrete-Time Markov Control Processes. Springer-Verlag, New York, NY, 1999.
3. 3. M.L. Puterman, Markov Decision Processes: Discrete Stochastic Dynamic Programming. Wiley, New York, NY, 1994.
4. 4. Y. Carmon, A. Shwartz, Markov decision processes with exponentially representable discounting. Oper. Res. Lett. 37 (2009), 51–55.
5. 5. E.A. Feinberg, A. Shwartz, Constrained dynamic programming with two discount factors: applications and an algorithm. IEEE Trans. Autom. Control, 44 (1999), 628–631.
6. 6. K. Hinderer, Foundations of non-stationary dynamical programming with discrete time parameter, Lecture Notes Oper. Res. 33 Springer, New York, NY, 1970.
7. 7. M. Schäl, Conditions for optimality in dynamic programming and for the limit of n-stages optimal policies to be optimal, Z. Wahr. Verw. Geb. 32 (1975), 179–196.
8. 8. Q. Wei, X. Guo, Markov decision processes with state-dependent discount factors and unbounded rewards/costs. Oper. Res. Lett. 39 (2011), 369–374.
9. 9. J. González-Hernández, R.R. López-Martnez, R. Pérez-Hernández, Markov control processes with randomized discounted cost in Borel space. Math. Meth. Oper. Res., 65 (2007), 27–44.
10. 10. J. González-Hernández, R.R. López-Martnez, J.A. Minjárez-Sosa, Adaptive policies for stochastic systems under a randomized discounted criterion. Bol. Soc. Mat. Mex., 14 (2008), 149–163.
11. 11. J. González-Hernández, R.R. López-Martnez, J.A. Minjárez-Sosa, Approximation, estimation and control of stochastic systems under a randomized discounted cost criterion. Kybernetika, 45 (2009), 737–754.
12. 12. J. González-Hernández, R.R. López-Martnez, J.A. Minjárez-Sosa, J.R. Gabriel-Arguelles, Constrained Markov control processes with randomized discounted cost criteria: occupation measures and extremal points. Risk Decis. Anal., 4 (2013), 163–176.
13. 13. E.A. Feinberg, P.O. Kasyanov, N.V. Zadoianchuck, Berge’s theorem for non compact image sets, J. Math. Anal. Appl. 397 (2013), 255–259.
14. 14. E.B. Dynkin, A.A. Yushkevich, Controlled Markov Processes. Springer-Verlag, New York, NY, 1979.
15. 15. E.I. Gordienko, O. Hernández-Lerma, Average cost Markov control processes with weighted norms: existence of canonical policies. Appl. Math. (Warsaw), 23 (1995), 199–218.
16. 16. E.I. Gordienko, O. Hernández-Lerma, Average cost Markov control processes with weighted norms: value iterations. Appl. Math. (Warsaw), 23 (1995), 219–237.
17. 17. O. Hernández-Lerma, W. Runggaldier, Monotone approximations for convex stochastic control problems. J. Math. Syst., Est. Control, 4 (1994), 99–140.

Written By

Fernando Luque-Vásquez and J. Adolfo Minjárez-Sosa

Submitted: April 18th, 2016 Reviewed: July 28th, 2016 Published: December 28th, 2016