Iteration Algorithms in Markov Decision Processes with State-Action-Dependent Discount Factors and Unbounded Costs

Fernando Luque-Vásquez; J. Adolfo Minjárez-Sosa

doi:10.5772/65044

Abstract

This chapter concerns discrete time Markov decision processes under a discounted optimality criterion with state-action-dependent discount factors, possibly unbounded costs, and noncompact admissible action sets. Under mild conditions, we show the existence of stationary optimal policies and we introduce the value iteration and the policy iteration algorithms to approximate the value function.

Keywords

discounted optimality
non-constant discount factor
value iteration
policy iteration
Markov decision processes

Author Information

Show +

Fernando Luque-Vásquez
- Department of Mathematics, University of Sonora, Hermosillo, Sonora, México
J. Adolfo Minjárez-Sosa*
- Department of Mathematics, University of Sonora, Hermosillo, Sonora, México

*Address all correspondence to: aminjare@gauss.mat.uson.mx

AMS 2010 subject classifications: 93E10, 90C40

1. Introduction

In this chapter we study Markov decision processes (MDPs) with Borel state and action spaces under a discounted criterion with state-action–dependent discount factors, possibly unbounded costs and noncompact admissible action sets. That is, we consider discount factors of the form

α(xn,an),E1

where xn and an are the state and the action at time n, respectively, playing the following role during the evolution of the system. At the initial state x0, the controller chooses an action a0 and a cost c(x0,a0) is incurred. Then the system moves to a new state x1 according to a transition law. Once the system is in state x1 the controller selects an action a1 and incurs a discounted cost α(x0,a0)c(x1,a1). Next the system moves to a state x2 and the process is repeated. In general, for the stage n≥1, the controller incurs the discounted cost

∏k=0n−1α(xk,ak)c(xn,an),E2

and our objective is to show the existence of stationary optimal control policies under the corresponding performance index, as well as to introduce approximation algorithms, namely, value iteration and policy iteration.

In the scenario of assuming a constant discount factor, the discounted optimality criterion in stochastic decision problems is the best understood of all performance indices, and it is widely accepted in several application problems (see, e.g., [1–3] and reference there in). However, such assumption might be strong or unrealistic in some economic and financial models. Indeed, in these problems the discount factors are typically functions of the interest rates, which in turn depend on the amount of currency and the decision-makers actions. Hence, we have state-action–dependent discount factors, and it is indeed these kinds of situations we are dealing with.

MDPs with non constant discount factors have been studied under different approaches (see, e.g., [4–8]). In particular, our work is a sequel to [8] where is studied the control problem with state-dependent discount factor. In addition, randomized discounted criteria have been analyzed in [9–12] where the discount factor is modeled as a stochastic process independent of the state-action pairs.

Specifically, in this chapter we study control models with state-action-dependent discount factors, focusing mainly on introducing approximation algorithms for the optimal value function (value iteration and policy iteration). Furthermore, an important feature in this work is that there is no compactness assumption on the sets of admissible actions neither continuity conditions on the cost, which, in most of the papers on MDPs, are needed to show the existence of measurable selectors and continuity or semi-continuity of the minima function. Indeed, in contrast to the previously cited references, in this work, we assume that the cost and discount factor functions satisfy the K-inf-compactness condition introduced in [13]. Then, we use a generalization of Berge’s Theorem, given in [13], to prove the existence of measurable selectors. To the best of our knowledge there are no works dealing with MDPs in the context presented in this chapter.

The remainder of the chapter is organized as follows. Section 2 contains the description of the Markov decision model and the optimality criterion. In Section 3 we introduce the assumptions on the model and we prove the convergence of the value iteration algorithm (Theorem 3.5). In Section 4 we define the policy iteration algorithm and the convergence is stated in Theorem 4.1.

Notation. Throughout the paper we shall use the following notation. Given a Borel space S—that is, a Borel subset of a complete separable metric space — B(S) denotes the Borel σ-algebra and “measurability” always means measurability with respect to B(S). Given two Borel spaces S and S′, a stochastic kernel φ(⋅|⋅) on S given S′ is a function such that φ(⋅|s′) is a probability measure on S for each s′∈S′, and φ(B|⋅) is a measurable function on S′ for each B∈B(S). Moreover, N (N0) denotes the positive (nonnegative) integers numbers. Finally, L(S) stands for the class of lower semicontinuous functions on S bounded below and L+(S) denotes the subclass of nonnegative functions in L(S).

2. Markov decision processes

Markov control model. Let

M:=(X,A,{A(x)⊂A|x∈X},Q,α,c)E3

be a discrete-time Markov control model with state-action-dependent discount factors satisfying the following conditions. The state space X and the action or control space A are Borel spaces. For each state x∈X, A(x) is a nonempty Borel subset of A denoting the set of admissible controls when the system is in state x. We denote by K the graph of the multifunction x↦A(x), that is,

K={(x,a):x∈X,a∈A(x)},E4

which is assumed to be a Borel subset of the Cartesian product of X and A. The transition law Q(⋅|⋅) is a stochastic kernel on X given K. Finally, α:K→(0,1) and c:K→(0,∞) are measurable functions representing the discount factor and the cost-per-stage, respectively, when the system is in state x∈X and the action a∈ A(x) is selected.

The model M represents a controlled stochastic system and has the following interpretation. Suppose that at time n∈N0 the system is in the state xn=x∈X. Then, possibly taking into account the history of the system, the controller selects an action an=a∈A(x), and a discount factor α(x, a) is imposed. As a consequence of this the following happens:

A cost c(x,a) is incurred;
The system visits a new state xn+1=x′∈X according to the transition law

Q(B|x,a):=Pr[xn+1∈B|xn=x,an=a],B∈B(X).E5

Once the transition to state x′ occurs, the process is repeated.

Typically, in many applications, the evolution of the system is determined by stochastic difference equations of the form

xn+1=F(xn,an,ξn),n∈N0,E6

where {ξn} is a sequence of independent and identically distributed random variables with values in some Borel space S, independent of the initial state x0, and F:X×A×S→X is a given measurable function. In this case, if θ denotes the common distribution of ξn, that is

θ(D):=P[ξn∈D],D∈B(S),n∈N0,E7

then the transition kernel can be written as

Q(B|x,a)=Pr[F(xn,an,ξn)∈B|xn=x,an=a]=θ{s∈S:F(x,a,s)∈B}=∫S1B[F(x,a,s)]θ(ds),B∈B(X),(x,a)∈K,E8

where 1B(⋅) represents the indicator function of the set B.

Control policies. The actions applied by the controller are chosen by mean of rules known as control policies defined as follows. Let H0:=X and Hn:=Kn×X, n≥1 be the spaces of admissible histories up to time n. A generic element of Hn is written as hn=(x0,a0,...,xn−1,an−1,xn).

Definition 2.1 A control policy (randomized, history-dependent) is a sequence π={πn} of stochastic kernels πn on A given Hn such that πn(A(xn)|hn)=1, for all hn∈ Hn, n∈N0.

We denote by Π the set of all control policies.

Let F be the set of measurable selectors, that is, F is the set of measurable function f:X→A such that f(x)∈A(x) for all x∈X.

Definition 2.2 A control policy π={πn} is said to be:

a. deterministic if there exists a sequence {gn} of measurable functions gn:Hn→A such that

πn(C|hn)=1C[gn(hn)],∀hn∈Hn,n∈N0,C∈B(A);E9

b. a Markov control policy if there exists a sequence {fn} of functions fn∈F such that

πn(C|hn)=1C[fn(xn)],∀hn∈Hn,n∈N0,C∈B(A).E10

In addition

c. A Markov control policy is stationary if there exists f∈F such that fn=f for all n∈N0.

If necessary, see for example [1–3, 14–16] for further information on those policies.

Observe that a Markov policy π is identified with the sequence {fn}, and we denote π={fn}. In this case, the control applied at time n is an=fn(xn)∈A(xn). In particular, a stationary policy is identified with the function f∈F, and following a standard convention we denote by F the set of all stationary control policies.

To ease the notation, for each f∈F and x∈X, we write

c(x,f):=c(x,f(x)),Q(⋅|x,f):=Q(⋅|x,f(x)),E11

and

α(x,f):=α(x,f(x)).E12

The underlying probability space. Let (Ω, F) be the canonical measurable space consisting of the sample space Ω=K∞:=K×K×⋅⋅⋅ and its product σ−algebra F. Then, under standard arguments (see, e.g., [1, 14]) for each π∈Π and initial state x∈X, there exists a probability measure Pxπ on (Ω,F) such that, for all hn∈Hn, an∈A(xn), n∈N0, C∈B(A), and B∈B(X),

Pxπ[x0=x]=1;Pxπ[an∈C|hn]=πn(C|hn);E13

and the Markov-like property is satisfied

Pxπ[xn+1∈B|hn,an]=Q(B|xn,an).E14

The stochastic process (Ω, F, Pxπ, {xn}) is called Markov decision process.

Optimality criterion. We assume that the costs are discounted in a multiplicative discounted rate. That is, a cost C incurred at stage n is equivalent to a cost CΓn at time 0, where

Γn:=∏k=0n−1α(xk,ak)ifn≥1,andΓ0=1.E15

In this sense, when using a policy π∈Π, given the initial state x0=x, we define the total expected discounted cost (with state-action–dependent discount factors) as

V(π,x):=Exπ[∑n=0∞Γnc(xn,an)],E16

where Exπ denotes the expectation operator with respect to the probability measure Pxπ induced by the policy π, given x0=x.

The optimal control problem associated to the control model M, is then to find an optimal policy π∗∈Π such that V(π∗,x)=V(x) for all x∈X, where

V(x):=infπ∈ΠV(π,x)E17

is the optimal value function (see [10]).

3. The value iteration algorithm

In this section we give conditions on the model that imply: (i) the convergence of the value iteration algorithm; (ii) the value function V is a solution of the corresponding optimality equation; and (iii) the existence of stationary optimal policies. In order to guarantee that V(x) is finite for each initial state x we suppose the following.

Assumption 3.1. There exists π0∈Π such that for all x∈X, V(π0,x)<∞.

At the end of Section 4 we give sufficient conditions for Assumption 3.1. We also require continuity and (inf-) compactness conditions to ensure the existence of "measurable minimizers." The following definition was introduced in [13].

Definition 3.2. A function u:K→R is said to be K-inf-compact on K if for each compact subset K of X and r∈R, the set

{(x,a)∈GrK(A):u(x,a)≤r}E18

is a compact subset of X×A, where GrK(A):={(x,a):x∈K,a∈A(x)}.

Assumption 3.3. (a) The one-stage cost c and the discount factor α are K-inf-compact functions on K. In addition, c is nonnegative.

(b) The transition law Q is weakly continuous, that is, the mapping

(x,a)→∫Xu(y)Q(dy|x,a)E19

is continuous for each bounded and continuous function on X.

For each measurable function u on X,x∈X, and f∈F, we define the operators

Tu(x):=infa∈A(x){c(x,a)+α(x,a)∫Xu(y)Q(dy|x,a)}E20

and

Tfu(x):=c(x,f)+α(x,f)∫Xu(y)Q(dy|x,f).E21

A consequence of Assumption 3.3 is the following.

Lemma 3.4. Let u be a function in L+(X). If Assumption 3.3 holds then the function v:K→R defined by

v(x,a):=c(x,a)+α(x,a)∫Xu(y)Q(dy|x,a)E22

is K-inf-compact on K

Proof. First note that by the K-inf-compactness hypothesis c(⋅,⋅) and α(⋅,⋅) are l.s.c on GrK(A) for each compact subset K of X. Then, since α and u are nonnegative functions, from Assumption 3.3 we have that v(⋅,⋅) is l.s.c on GrK(A). Thus, for each r∈R, the set

{(x,a)∈GrK(A):v(x,a)≤r}E23

is a closed subset of the compact set {(x,a)∈GrK(A):c(x,a)≤r}. Then, v is K-inf-compact on K.

Observe that the operator T is monotone in the sense that if v≥u then Tv≥Tu. In addition, from Assumption 3.3 and ([13], Theorem 3.3), we have that T maps L+(X) into itself. Furthermore, there exists f˜∈F such that

Tu(x)=Tf˜u(x),x∈X.E24

To state our first result we define the sequence {vn}⊂L+(X) of value iteration functions as:

v0≡0;vn(x)=Tvn−1(x),x∈X.E25

Since T is monotone, note that {vn} is a nondecreasing sequence.

Theorem 3.5. Suppose that Assumptions 3.1 and 3.3 hold. Then

vn↗V.
V is the minimal solution in L+(X) of the Optimality Equation, i.e.,
V(x)=TV(x)=infa∈A(x){c(x,a)+α(x,a)∫XV(y)Q(dy|x,a)}.E26
There exists a stationary policy f∗∈F such that, for all x∈X,V(x)=Tf∗V(x), that is

V(x)=c(x,f∗)+α(x,f∗)∫Xu(y)Q(dy|x,f∗),E27

and f∗ is an optimal policy.

Proof. Since {vn} is nondecreasing, there exists v∈L+(X) such that vn↗v. Hence, from Monotone Convergence Theorem, ([13], Lemmas 2.2, 2.3), and ([1], Lemma 4.2.4), we obtain, for each x∈X, vn(x)=Tvn−1(x)→Tv(x),as n→∞, which, in turn implies

Tv=v.E28

Therefore, to get (a)-(b) we need to prove that v=V. To this end, observe that for all x∈X and π∈Π

vn(x)≤∫Ac(x,a)π(da|x)+∫Aα(x,a)∫Xvn−1(x1)Q(dx1|x,a)π(da|x).E29

Then, iterating (29) we obtain

vn(x)≤Vn(π,x),n∈N,E30

where

Vn(π,x)=Exπ[∑t=0n−1Γtc(xt,at)],E31

is the n−stage discounted cost Vn. Then, letting n→∞ we get v(x)≤V(π,x), for all π∈Π and x∈X. Thus,

v(x)≤V(x),x∈X.E32

On the other hand, from (28) and (24), let f∈F such that v(x)=Tfv(x), x∈X. Iterating this equation, we have (see (31))

v(x)=Exf[c(x,f)+∑t=1n−1∏k=0t−1α(xk,f)c(xt,f)] +Exf[∏k=0n−1α(xk,f)v(xn)] ≥Vn(f,x).E33

Hence, letting n→∞,

v(x)≥V(f,x)≥V(x),x∈X.E34

Combining (32) and (34) we get v=V.

Now, let u∈L+(X) be an arbitrary solution of the optimality equation, that is, Tu=u. Then, applying the arguments in the proof of (34) with u instead of v we conclude that u≥V. That is, V is minimal in L+(X).

Part (c) follows from (b) and ([13], Theorem 3.3). Indeed, there exists a stationary policy f∗∈F such that V(x)=Tf∗V(x),x∈X. Then, iteration of this equation yields V(x)=V(f∗,x), which implies that f∗ is optimal.

4. Policy iteration algorithm

In Theorem 3.5 is established an approximation algorithm for the value function V by means of the sequence of the value iteration functions {vn}. In this case the sequence {vn} increase to V and it is defined recursively. Now we present the well-known policy iteration algorithm which provides a decreasing approximation to V in the set of the control policies.

To define the algorithm, first observe that from the Markov property (14) and applying properties of conditional expectation, for any stationary policy f∈F and x∈X, the corresponding cost V(f,x) satisfies

V(f,x)=c(x,f)+α(x,f)Exf[∑t=1∞∏k=0t−1α(xk,f)c(xt,f)]=c(x,f)+α(x,f)∫XEf[c(x1,f)+∑t=2∞∏k=0t−1α(xk,f)c(xt,f)|x1=y]Q(dy|x,f)=c(x,f)+α(x,f)∫XV(f,y)Q(dy|x,f)=TfV(f,x),x∈X.E35

Let f0∈F be a stationary policy with a finite valued cost

Then, from (35),

w0(x)=c(x,f0)+α(x,f0)∫Xw0(y)Q(dy|x,f0)=Tf0w0(x), x∈X.E36

Now, let f1∈F be such that

Tw0(x)=Tf1w0(x),E37

and define

In general, we define a sequence {wn} in L+(X) as follows. Given fn∈F, compute

Next, let fn+1∈F be such that

Tfn+1wn(x)=Twn(x),x∈X,E38

that is,

Tfn+1wn(x)=c(x,fn+1)+α(x,fn+1)∫Xwn(y)Q(dy|x,fn+1)=mina∈A(x){c(x,a)+α(x,a)∫Xwn(y)Q(dy|x,a)}=Twn(x),x∈X.E39

Then we define

Theorem 4.1. Under Assumptions 3.1 and 3.3, there exists a measurable nonnegative function such that and Moreover, if w satisfies

limn→∞Exπ[Γnw(xn)]=0∀π∈Π,x∈X,E40

then

To prove the Theorem 4.1 we need the following result.

Lemma 4.2. Under Assumption 3.3, if u:X→R is a measurable function such that Tu is well defined, u≤Tu, and

limn→∞Exπ[Γnu(xn)]=0∀π∈Π,x∈X,E41

then u≤V.

Proof. From the Markov property (14), for each π∈Π and x∈X,

Exπ[Γn+1u(xn+1)|hn,an]=Γn+1∫Xu(y)Q(dy|xn,an)E42

=Γn[c(xn,an)+α(xn,an)∫Xu(y)Q(dy|xn,an)−c(xn,an)]E43

≥Γninfa∈A(xn)[c(xn,a)+α(xn,a)∫Xu(y)Q(dy|xn,a)]−Γnc(xn,an)E44

=ΓnTu(xn)−Γnc(xn,an)≥Γnu(xn)−Γnc(xn,an),E45

which, in turn implies

Γnc(xn,an)≥Exπ[Γnu(xn)−Γn+1u(xn+1)|hn,an].E46

Therefore, for all k∈N (see (31)),

Vk(π,x)=Exπ∑n=0k−1Γnc(xn,an)≥u(x)−Exπ[Γku(xk)].E47

Finally, letting k→∞, (41) yields V(π,x)≥u(x), and since π is arbitrary we obtain V(x)≥u(x).

Proof of Theorem 4.1. According to Lemma 4.2, it is sufficient to show the existence of a function

such that

and

To this end, from (36)–(38),

w0(x)≥mina∈A(x){c(x,a)+α(x,a)∫Xw0(y)Q(dy|x,a)}=Tf1w0(x) =c(x,f1)+α(x,f1)∫Xw0(y)Q(dy|x,f1).E48

Iterating this inequality, a straightforward calculation as in (34) shows that

w0(x)≥V(f1,x)=w1(x),x∈X.E49

In general, similar arguments yield

wn≥Twn≥wn+1,n∈N.E50

Therefore, there exists a nonnegative measurable function w such that

In addition, since

∀n∈N0, we have

Next, letting n→∞ in (47) and applying ([17], Lemma 3.3), we obtain

which implies

4.1. Sufficient conditions for Assumption 3.1 and (40)

An obvious sufficient condition for Assumption 3.1 and (40) is the following:

C1 (a) There exists α¯∈(0,1) such that for all (x,a)∈K, α(x,a)<α¯.

(b) For some constant m,0≤c(x,a)≤m for all (x,a)∈K.

Indeed, under condition C1, V(π,x)≤m/(1−α¯) for all x∈X and π∈Π, and {wn} is a bounded sequence which in turn implies (since

the boundedness of the function

This fact clearly yields (40).

Other less obvious sufficient conditions are the following (see, e.g., [15, 16, 2]).

C2 (a) Condition C1 (a).

(b) There exist a measurable function W:X→(1,∞) and constants M>0,β∈(1,1/α¯), such that for all (x,a)∈K,

supA(x)c(x,a)≤MW(x)E51

and

∫XW(y)Q(dy|x,a)≤βW(x).E52

First note that by condition C2 and the Markov property (14), for any policy π∈Π and initial state x0=x∈X,

Exπ[W(xn+1)|hn,an]=∫XW(y)Q(dy|xn,an)≤βW(xn),∀n∈N0.E53

Then, using properties of conditional expectation,

Exπ[W(xn+1)]≤βExπ[W(xn)],∀n∈N0.E54

Iterating inequality (51) we get

Exπ[W(xn)]≤βnW(x),∀n∈N0.E55

Therefore, by condition C2, for any policy π∈Π and x∈X,

V(π,x)≤Exπ∑n=0∞α¯nc(xn,an)≤∑n=0∞Mα¯nExπW(xn)≤M1−α¯βW(x).E56

Thus, Assumption 3.1 holds.

On the other hand, if L+W(X) denotes the subclass of all functions u in L+(X) such that

‖u‖W:=supx∈Xh(x)W(x)<∞,E57

then, because wk(⋅)=V(fk+1,⋅), from (53) and condition C2, we have that wk∈L+W(X) for all k=1,2,... and

limn→∞Exπ[Γnwk(xn)]=0∀π∈Π,x∈X.E58

Since w≤wk, (40) follows from (55).

Acknowledgments

Work supported partially by Consejo Nacional de Ciencia y Tecnologa (CONACYT) under grant CB2015/254306.

References

1. O. Hernández-Lerma, J.B. Lasserre, Discrete-Time Markov Control Processes: Basic Optimality Criteria. Springer-Verlag, New York, NY, 1996.
2. O. Hernández-Lerma, J.B. Lasserre, Further Topics on Discrete-Time Markov Control Processes. Springer-Verlag, New York, NY, 1999.
3. M.L. Puterman, Markov Decision Processes: Discrete Stochastic Dynamic Programming. Wiley, New York, NY, 1994.
4. Y. Carmon, A. Shwartz, Markov decision processes with exponentially representable discounting. Oper. Res. Lett. 37 (2009), 51–55.
5. E.A. Feinberg, A. Shwartz, Constrained dynamic programming with two discount factors: applications and an algorithm. IEEE Trans. Autom. Control, 44 (1999), 628–631.
6. K. Hinderer, Foundations of non-stationary dynamical programming with discrete time parameter, Lecture Notes Oper. Res. 33 Springer, New York, NY, 1970.
7. M. Schäl, Conditions for optimality in dynamic programming and for the limit of n-stages optimal policies to be optimal, Z. Wahr. Verw. Geb. 32 (1975), 179–196.
8. Q. Wei, X. Guo, Markov decision processes with state-dependent discount factors and unbounded rewards/costs. Oper. Res. Lett. 39 (2011), 369–374.
9. J. González-Hernández, R.R. López-Martnez, R. Pérez-Hernández, Markov control processes with randomized discounted cost in Borel space. Math. Meth. Oper. Res., 65 (2007), 27–44.
10. J. González-Hernández, R.R. López-Martnez, J.A. Minjárez-Sosa, Adaptive policies for stochastic systems under a randomized discounted criterion. Bol. Soc. Mat. Mex., 14 (2008), 149–163.
11. J. González-Hernández, R.R. López-Martnez, J.A. Minjárez-Sosa, Approximation, estimation and control of stochastic systems under a randomized discounted cost criterion. Kybernetika, 45 (2009), 737–754.
12. J. González-Hernández, R.R. López-Martnez, J.A. Minjárez-Sosa, J.R. Gabriel-Arguelles, Constrained Markov control processes with randomized discounted cost criteria: occupation measures and extremal points. Risk Decis. Anal., 4 (2013), 163–176.
13. E.A. Feinberg, P.O. Kasyanov, N.V. Zadoianchuck, Berge’s theorem for non compact image sets, J. Math. Anal. Appl. 397 (2013), 255–259.
14. E.B. Dynkin, A.A. Yushkevich, Controlled Markov Processes. Springer-Verlag, New York, NY, 1979.
15. E.I. Gordienko, O. Hernández-Lerma, Average cost Markov control processes with weighted norms: existence of canonical policies. Appl. Math. (Warsaw), 23 (1995), 199–218.
16. E.I. Gordienko, O. Hernández-Lerma, Average cost Markov control processes with weighted norms: value iterations. Appl. Math. (Warsaw), 23 (1995), 219–237.
17. O. Hernández-Lerma, W. Runggaldier, Monotone approximations for convex stochastic control problems. J. Math. Syst., Est. Control, 4 (1994), 99–140.

[1] 1. O. Hernández-Lerma, J.B. Lasserre, Discrete-Time Markov Control Processes: Basic Optimality Criteria. Springer-Verlag, New York, NY, 1996.

[2] 2. O. Hernández-Lerma, J.B. Lasserre, Further Topics on Discrete-Time Markov Control Processes. Springer-Verlag, New York, NY, 1999.

[3] 3. M.L. Puterman, Markov Decision Processes: Discrete Stochastic Dynamic Programming. Wiley, New York, NY, 1994.

[4] 4. Y. Carmon, A. Shwartz, Markov decision processes with exponentially representable discounting. Oper. Res. Lett. 37 (2009), 51–55.

[5] 5. E.A. Feinberg, A. Shwartz, Constrained dynamic programming with two discount factors: applications and an algorithm. IEEE Trans. Autom. Control, 44 (1999), 628–631.

[6] 6. K. Hinderer, Foundations of non-stationary dynamical programming with discrete time parameter, Lecture Notes Oper. Res. 33 Springer, New York, NY, 1970.

[7] 7. M. Schäl, Conditions for optimality in dynamic programming and for the limit of n-stages optimal policies to be optimal, Z. Wahr. Verw. Geb. 32 (1975), 179–196.

[8] 8. Q. Wei, X. Guo, Markov decision processes with state-dependent discount factors and unbounded rewards/costs. Oper. Res. Lett. 39 (2011), 369–374.

[9] 9. J. González-Hernández, R.R. López-Martnez, R. Pérez-Hernández, Markov control processes with randomized discounted cost in Borel space. Math. Meth. Oper. Res., 65 (2007), 27–44.

[10] 10. J. González-Hernández, R.R. López-Martnez, J.A. Minjárez-Sosa, Adaptive policies for stochastic systems under a randomized discounted criterion. Bol. Soc. Mat. Mex., 14 (2008), 149–163.

[11] 11. J. González-Hernández, R.R. López-Martnez, J.A. Minjárez-Sosa, Approximation, estimation and control of stochastic systems under a randomized discounted cost criterion. Kybernetika, 45 (2009), 737–754.

[12] 12. J. González-Hernández, R.R. López-Martnez, J.A. Minjárez-Sosa, J.R. Gabriel-Arguelles, Constrained Markov control processes with randomized discounted cost criteria: occupation measures and extremal points. Risk Decis. Anal., 4 (2013), 163–176.

[13] 13. E.A. Feinberg, P.O. Kasyanov, N.V. Zadoianchuck, Berge’s theorem for non compact image sets, J. Math. Anal. Appl. 397 (2013), 255–259.

[14] 14. E.B. Dynkin, A.A. Yushkevich, Controlled Markov Processes. Springer-Verlag, New York, NY, 1979.

[15] 15. E.I. Gordienko, O. Hernández-Lerma, Average cost Markov control processes with weighted norms: existence of canonical policies. Appl. Math. (Warsaw), 23 (1995), 199–218.

[16] 16. E.I. Gordienko, O. Hernández-Lerma, Average cost Markov control processes with weighted norms: value iterations. Appl. Math. (Warsaw), 23 (1995), 219–237.

[17] 17. O. Hernández-Lerma, W. Runggaldier, Monotone approximations for convex stochastic control problems. J. Math. Syst., Est. Control, 4 (1994), 99–140.

Iteration Algorithms in Markov Decision Processes with State-Action-Dependent Discount Factors and Unbounded Costs

Operations Research - the Art of Making Good Decisions

Abstract

Keywords

Author Information

Fernando Luque-Vásquez

J. Adolfo Minjárez-Sosa*

1. Introduction

2. Markov decision processes

3. The value iteration algorithm

4. Policy iteration algorithm

4.1. Sufficient conditions for Assumption 3.1 and (40)

Acknowledgments

References

Mathematical Modeling of Isothermal Drying and its Potential Application in the Design of the Industrial Drying Regimes of Clay Products

Iteration Algorithms in Markov Decision Processes with State-Action-Dependent Discount Factors and Unbounded Costs

Operations Research - the Art of Making Good Decisions

Abstract

Keywords

Author Information

Fernando Luque-Vásquez

J. Adolfo Minjárez-Sosa*

1. Introduction

2. Markov decision processes

3. The value iteration algorithm

4. Policy iteration algorithm

4.1. Sufficient conditions for Assumption 3.1 and (40)

Acknowledgments

References

Continue reading from the same book

Operations Research