Open access peer-reviewed chapter

Iteration Algorithms in Markov Decision Processes with State-Action-Dependent Discount Factors and Unbounded Costs

By Fernando Luque-Vásquez and J. Adolfo Minjárez-Sosa

Submitted: April 18th 2016Reviewed: July 28th 2016Published: December 28th 2016

DOI: 10.5772/65044

Downloaded: 1364


This chapter concerns discrete time Markov decision processes under a discounted optimality criterion with state-action-dependent discount factors, possibly unbounded costs, and noncompact admissible action sets. Under mild conditions, we show the existence of stationary optimal policies and we introduce the value iteration and the policy iteration algorithms to approximate the value function.


  • discounted optimality
  • non-constant discount factor
  • value iteration
  • policy iteration
  • Markov decision processes

AMS 2010 subject classifications: 93E10, 90C40

1. Introduction

In this chapter we study Markov decision processes (MDPs) with Borel state and action spaces under a discounted criterion with state-action–dependent discount factors, possibly unbounded costs and noncompact admissible action sets. That is, we consider discount factors of the form


where xnand anare the state and the action at time n,respectively, playing the following role during the evolution of the system. At the initial state x0,the controller chooses an action a0and a cost c(x0,a0)is incurred. Then the system moves to a new state x1according to a transition law. Once the system is in state x1the controller selects an action a1and incurs a discounted cost α(x0,a0)c(x1,a1).Next the system moves to a state x2and the process is repeated. In general, for the stage n1,the controller incurs the discounted cost


and our objective is to show the existence of stationary optimal control policies under the corresponding performance index, as well as to introduce approximation algorithms, namely, value iteration and policy iteration.

In the scenario of assuming a constant discount factor, the discounted optimality criterion in stochastic decision problems is the best understood of all performance indices, and it is widely accepted in several application problems (see, e.g., [13] and reference there in). However, such assumption might be strong or unrealistic in some economic and financial models. Indeed, in these problems the discount factors are typically functions of the interest rates, which in turn depend on the amount of currency and the decision-makers actions. Hence, we have state-action–dependent discount factors, and it is indeed these kinds of situations we are dealing with.

MDPs with non constant discount factors have been studied under different approaches (see, e.g., [48]). In particular, our work is a sequel to [8] where is studied the control problem with state-dependent discount factor. In addition, randomized discounted criteria have been analyzed in [912] where the discount factor is modeled as a stochastic process independent of the state-action pairs.

Specifically, in this chapter we study control models with state-action-dependent discount factors, focusing mainly on introducing approximation algorithms for the optimal value function (value iteration and policy iteration). Furthermore, an important feature in this work is that there is no compactness assumption on the sets of admissible actions neither continuity conditions on the cost, which, in most of the papers on MDPs, are needed to show the existence of measurable selectors and continuity or semi-continuity of the minima function. Indeed, in contrast to the previously cited references, in this work, we assume that the cost and discount factor functions satisfy the K-inf-compactness condition introduced in [13]. Then, we use a generalization of Berge’s Theorem, given in [13], to prove the existence of measurable selectors. To the best of our knowledge there are no works dealing with MDPs in the context presented in this chapter.

The remainder of the chapter is organized as follows. Section 2 contains the description of the Markov decision model and the optimality criterion. In Section 3 we introduce the assumptions on the model and we prove the convergence of the value iteration algorithm (Theorem 3.5). In Section 4 we define the policy iteration algorithm and the convergence is stated in Theorem 4.1.

Notation. Throughout the paper we shall use the following notation. Given a Borel spaceS—that is, a Borel subset of a complete separable metric space — B(S)denotes the Borel σ-algebra and “measurability” always means measurability with respect to B(S). Given two Borel spaces Sand S, a stochastic kernelφ(|)on Sgiven Sis a function such that φ(|s)is a probability measure on Sfor each sS, and φ(B|)is a measurable function on Sfor each BB(S).Moreover, N(N0) denotes the positive (nonnegative) integers numbers. Finally, L(S)stands for the class of lower semicontinuous functions on Sbounded below and L+(S)denotes the subclass of nonnegative functions in L(S).


2. Markov decision processes

Markov control model. Let


be a discrete-time Markov control model with state-action-dependent discount factors satisfying the following conditions. The state space Xand the action or control space Aare Borel spaces. For each state xX, A(x)is a nonempty Borel subset of Adenoting the set of admissible controls when the system is in state x. We denote by Kthe graph of the multifunction xA(x),that is,


which is assumed to be a Borel subset of the Cartesian product of Xand A.The transition law Q(|)is a stochastic kernel on Xgiven K.Finally, α:K(0,1)and c:K(0,)are measurable functions representing the discount factor and the cost-per-stage, respectively, when the system is in state xXand the action aA(x)is selected.

The model Mrepresents a controlled stochastic system and has the following interpretation. Suppose that at time nN0the system is in the state xn=xX.Then, possibly taking into account the history of the system, the controller selects an action an=aA(x),and a discount factor α(x, a)is imposed. As a consequence of this the following happens:

  1. A cost c(x,a)is incurred;

  2. The system visits a new state xn+1=xXaccording to the transition law


Once the transition to state xoccurs, the process is repeated.

Typically, in many applications, the evolution of the system is determined by stochastic difference equations of the form


where {ξn}is a sequence of independent and identically distributed random variables with values in some Borel space S,independent of the initial state x0,and F:X×A×SXis a given measurable function. In this case, if θdenotes the common distribution of ξn,that is


then the transition kernel can be written as


where 1B()represents the indicator function of the set B.

Control policies. The actions applied by the controller are chosen by mean of rules known as control policies defined as follows. Let H0:=Xand Hn:=Kn×X,n1be the spaces of admissible histories up to time n. A generic element of Hnis written as hn=(x0,a0,...,xn1,an1,xn).

Definition 2.1A control policy (randomized, history-dependent) is a sequenceπ={πn}of stochastic kernelsπnonAgivenHnsuch thatπn(A(xn)|hn)=1,for allhnHn,nN0.

We denote by Π the set of all control policies.

Let Fbe the set of measurable selectors, that is, Fis the set of measurable function f:XAsuch that f(x)A(x)for all xX.

Definition 2.2A control policyπ={πn}is said to be:

a.deterministic if there exists a sequence{gn}of measurable functionsgn:HnAsuch that


b.a Markov control policy if there exists a sequence{fn}of functionsfnFsuch that


In addition

c.A Markov control policy is stationary if there existsfFsuch thatfn=ffor allnN0.

If necessary, see for example [13, 1416] for further information on those policies.

Observe that a Markov policy πis identified with the sequence {fn},and we denote π={fn}.In this case, the control applied at time nis an=fn(xn)A(xn).In particular, a stationary policy is identified with the function fF, and following a standard convention we denote by Fthe set of all stationary control policies.

To ease the notation, for each fFand xX, we write




The underlying probability space. Let (Ω, F)be the canonical measurable space consisting of the sample space Ω=K:=K×K×and its product σalgebra F.Then, under standard arguments (see, e.g., [1, 14]) for each πΠand initial state xX,there exists a probability measure Pxπon (Ω,F)such that, for all hnHn,anA(xn),nN0,CB(A),and BB(X),


and the Markov-like property is satisfied


The stochastic process (Ω, F, Pxπ, {xn})is called Markov decision process.

Optimality criterion. We assume that the costs are discounted in a multiplicative discounted rate. That is, a cost Cincurred at stage nis equivalent to a cost CΓnat time 0,where


In this sense, when using a policy πΠ,given the initial state x0=x,we define the total expected discounted cost (with state-action–dependent discount factors) as


where Exπdenotes the expectation operator with respect to the probability measure Pxπinduced by the policy π,given x0=x.

The optimal control problem associated to the control model M, is then to find an optimal policy πΠsuch that V(π,x)=V(x)for all xX,where


is the optimal value function (see [10]).

3. The value iteration algorithm

In this section we give conditions on the model that imply: (i) the convergence of the value iteration algorithm; (ii) the value function Vis a solution of the corresponding optimality equation; and (iii) the existence of stationary optimal policies. In order to guarantee that V(x)is finite for each initial state xwe suppose the following.

Assumption 3.1.There existsπ0Πsuch that for allxX,V(π0,x)<.

At the end of Section 4 we give sufficient conditions for Assumption 3.1. We also require continuity and (inf-) compactness conditions to ensure the existence of "measurable minimizers." The following definition was introduced in [13].

Definition 3.2.A functionu:KRis said to beK-inf-compact onKif for each compact subsetKofXandrR, the set


is a compact subset ofX×A,whereGrK(A):={(x,a):xK,aA(x)}.

Assumption 3.3.(a) The one-stage costcand the discount factorαareK-inf-compact functions onK. In addition,cis nonnegative.

(b) The transition law Qis weakly continuous, that is, the mapping


is continuous for each bounded and continuous function onX.

For each measurable function uon X,xX,and fF,we define the operators




A consequence of Assumption 3.3 is the following.

Lemma 3.4.Letube a function inL+(X).If Assumption 3.3 holds then the functionv:KRdefined by


isK-inf-compact onK

Proof. First note that by the K-inf-compactness hypothesis c(,)and α(,)are l.s.c on GrK(A)for each compact subset Kof X. Then, since αand uare nonnegative functions, from Assumption 3.3 we have that v(,)is l.s.c on GrK(A). Thus, for each rR, the set


is a closed subset of the compact set {(x,a)GrK(A):c(x,a)r}.Then, vis K-inf-compact on K.

Observe that the operator Tis monotone in the sense that if vuthen TvTu.In addition, from Assumption 3.3 and ([13], Theorem 3.3), we have that Tmaps L+(X)into itself. Furthermore, there exists f˜Fsuch that


To state our first result we define the sequence {vn}L+(X)of value iteration functions as:


Since Tis monotone, note that {vn}is a nondecreasing sequence.

Theorem 3.5.Suppose that Assumptions 3.1 and 3.3 hold. Then

  1. vnV.

  2. Vis the minimal solution inL+(X)of the Optimality Equation, i.e.,

  3. There exists a stationary policyfFsuch that, for allxX,V(x)=TfV(x),that is


andfis an optimal policy.

Proof. Since {vn}is nondecreasing, there exists vL+(X)such that vnv.Hence, from Monotone Convergence Theorem, ([13], Lemmas 2.2, 2.3), and ([1], Lemma 4.2.4), we obtain, for each xX,vn(x)=Tvn1(x)Tv(x),as n,which, in turn implies


Therefore, to get (a)-(b) we need to prove that v=V.To this end, observe that for all xXand π∈Π


Then, iterating (29) we obtain




is the nstage discounted cost Vn.Then, letting nwe get v(x)V(π,x),for all πΠand xX.Thus,


On the other hand, from (28) and (24), let fFsuch that v(x)=Tfv(x),xX.Iterating this equation, we have (see (31))


Hence, letting n,


Combining (32) and (34) we get v=V.

Now, let uL+(X)be an arbitrary solution of the optimality equation, that is, Tu=u.Then, applying the arguments in the proof of (34) with uinstead of vwe conclude that uV.That is, Vis minimal in L+(X).

Part (c) follows from (b) and ([13], Theorem 3.3). Indeed, there exists a stationary policy fFsuch that V(x)=TfV(x),xX.Then, iteration of this equation yields V(x)=V(f,x),which implies that fis optimal.

4. Policy iteration algorithm

In Theorem 3.5 is established an approximation algorithm for the value function Vby means of the sequence of the value iteration functions {vn}.In this case the sequence {vn}increase to Vand it is defined recursively. Now we present the well-known policy iteration algorithm which provides a decreasing approximation to Vin the set of the control policies.

To define the algorithm, first observe that from the Markov property (14) and applying properties of conditional expectation, for any stationary policy fFand xX,the corresponding cost V(f,x)satisfies


Let f0Fbe a stationary policy with a finite valued cost

Then, from (35),


Now, let f1Fbe such that


and define

In general, we define a sequence {wn}in L+(X)as follows. Given fnF,compute

Next, let fn+1Fbe such that


that is,


Then we define

Theorem 4.1.Under Assumptions 3.1 and 3.3, there exists a measurable nonnegative functionsuch thatandMoreover, ifwsatisfies



To prove the Theorem 4.1 we need the following result.

Lemma 4.2.Under Assumption 3.3, ifu:XRis a measurable function such thatTuis well defined,uTu, and



Proof. From the Markov property (14), for each π∈Π and xX,


which, in turn implies


Therefore, for all kN(see (31)),


Finally, letting k,(41) yields V(π,x)u(x),and since πis arbitrary we obtain V(x)u(x).

Proof of Theorem 4.1. According to Lemma 4.2, it is sufficient to show the existence of a function

such that
To this end, from (36)–(38),


Iterating this inequality, a straightforward calculation as in (34) shows that


In general, similar arguments yield


Therefore, there exists a nonnegative measurable function wsuch that

In addition, since
nN0,we have
Next, letting nin (47) and applying ([17], Lemma 3.3), we obtain
which implies

4.1. Sufficient conditions for Assumption 3.1 and (40)

An obvious sufficient condition for Assumption 3.1 and (40) is the following:

C1(a) There exists α¯(0,1)such that for all (x,a)K, α(x,a)<α¯.

(b) For some constant m,0c(x,a)mfor all (x,a)K.

Indeed, under condition C1, V(π,x)m/(1α¯)for all xXand πΠ, and {wn}is a bounded sequence which in turn implies (since

the boundedness of the function
This fact clearly yields (40).

Other less obvious sufficient conditions are the following (see, e.g., [15, 16, 2]).

C2(a) Condition C1 (a).

(b) There exist a measurable function W:X(1,)and constants M>0,β(1,1/α¯),such that for all (x,a)K,




First note that by condition C2 and the Markov property (14), for any policy πΠand initial state x0=xX,


Then, using properties of conditional expectation,


Iterating inequality (51) we get


Therefore, by condition C2, for any policy πΠand xX,


Thus, Assumption 3.1 holds.

On the other hand, if L+W(X)denotes the subclass of all functions uin L+(X)such that


then, because wk()=V(fk+1,),from (53) and condition C2, we have that wkL+W(X)for all k=1,2,...and


Since wwk,(40) follows from (55).


Work supported partially by Consejo Nacional de Ciencia y Tecnologa (CONACYT) under grant CB2015/254306.

© 2016 The Author(s). Licensee IntechOpen. This chapter is distributed under the terms of the Creative Commons Attribution 3.0 License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

How to cite and reference

Link to this chapter Copy to clipboard

Cite this chapter Copy to clipboard

Fernando Luque-Vásquez and J. Adolfo Minjárez-Sosa (December 28th 2016). Iteration Algorithms in Markov Decision Processes with State-Action-Dependent Discount Factors and Unbounded Costs, Operations Research - the Art of Making Good Decisions, Kuodi Jian, IntechOpen, DOI: 10.5772/65044. Available from:

chapter statistics

1364total chapter downloads

More statistics for editors and authors

Login to your personal dashboard for more detailed statistics on your publications.

Access personal reporting

Related Content

This Book

Next chapter

Mathematical Modeling of Isothermal Drying and its Potential Application in the Design of the Industrial Drying Regimes of Clay Products

By Miloš Vasić, Zagorka Radojević and Robert Rekecki

Related Book

First chapter

Introductory Chapter: Real-Time Systems

By Kuodi Jian

We are IntechOpen, the world's leading publisher of Open Access books. Built by scientists, for scientists. Our readership spans scientists, professors, researchers, librarians, and students, as well as business professionals. We share our knowledge and peer-reveiwed research papers with libraries, scientific and engineering societies, and also work with corporate R&D departments and government entities.

More About Us