Iteration Algorithms in Markov Decision Processes with State-Action-Dependent Discount Factors and Unbounded Costs

This chapter concerns discrete time Markov decision processes under a discounted optimality criterion with state-action-dependent discount factors, possibly unbounded costs, and noncompact admissible action sets. Under mild conditions, we show the existence of stationary optimal policies and we introduce the value iteration and the policy iteration algorithms to approximate the value function.


Introduction
In this chapter we study Markov decision processes (MDPs) with Borel state and action spaces under a discounted criterion with state-action-dependent discount factors, possibly unbounded costs and noncompact admissible action sets.That is, we consider discount factors of the form ( , ), n n x a a (1) © 2016 The Author(s).Licensee InTech.This chapter is distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0),which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
where and are the state and the action at time , respectively, playing the following role during the evolution of the system.At the initial state 0 , the controller chooses an action 0 and a cost ( 0 , 0 ) is incurred.Then the system moves to a new state 1 according to a transition law.Once the system is in state 1 the controller selects an action 1 and incurs a discounted cost ( 0 , 0 ) ( 1 , 1 ) .Next the system moves to a state 2 and the process is repeated.In general, for the stage ≥ 1, the controller incurs the discounted cost and our objective is to show the existence of stationary optimal control policies under the corresponding performance index, as well as to introduce approximation algorithms, namely, value iteration and policy iteration.
In the scenario of assuming a constant discount factor, the discounted optimality criterion in stochastic decision problems is the best understood of all performance indices, and it is widely accepted in several application problems (see, e.g., [1][2][3] and reference there in).However, such assumption might be strong or unrealistic in some economic and financial models.Indeed, in these problems the discount factors are typically functions of the interest rates, which in turn depend on the amount of currency and the decision-makers actions.Hence, we have state-actiondependent discount factors, and it is indeed these kinds of situations we are dealing with.
MDPs with non constant discount factors have been studied under different approaches (see, e.g., [4][5][6][7][8]).In particular, our work is a sequel to [8] where is studied the control problem with state-dependent discount factor.In addition, randomized discounted criteria have been analyzed in [9][10][11][12] where the discount factor is modeled as a stochastic process independent of the state-action pairs.Specifically, in this chapter we study control models with state-action-dependent discount factors, focusing mainly on introducing approximation algorithms for the optimal value function (value iteration and policy iteration).Furthermore, an important feature in this work is that there is no compactness assumption on the sets of admissible actions neither continuity conditions on the cost, which, in most of the papers on MDPs, are needed to show the existence of measurable selectors and continuity or semi-continuity of the minima function.Indeed, in contrast to the previously cited references, in this work, we assume that the cost and discount factor functions satisfy the -inf-compactness condition introduced in [13].Then, we use a generalization of Berge's Theorem, given in [13], to prove the existence of measurable selectors.To the best of our knowledge there are no works dealing with MDPs in the context presented in this chapter.
The remainder of the chapter is organized as follows.Section 2 contains the description of the Markov decision model and the optimality criterion.In Section 3 we introduce the assumptions on the model and we prove the convergence of the value iteration algorithm (Theorem 3.5).In Section 4 we define the policy iteration algorithm and the convergence is stated in Theorem 4.1.
Notation.Throughout the paper we shall use the following notation.Given a Borel space - that is, a Borel subset of a complete separable metric space -ℬ() denotes the Borel -algebra and "measurability" always means measurability with respect to ℬ().Given two Borel spaces and ′, a stochastic kernel ( ⋅ | ⋅ ) on given ′ is a function such that ( ⋅ | ′) is a probability measure on for each ′ ∈ ′, and ( | ⋅ ) is a measurable function on ′ for each ∈ ℬ() .Moreover, ℕ (ℕ 0 ) denotes the positive (nonnegative) integers numbers.Finally, () stands for the class of lower semicontinuous functions on bounded below and + () denotes the subclass of nonnegative functions in ().

Markov decision processes
be a discrete-time Markov control model with state-action-dependent discount factors satisfying the following conditions.The state space and the action or control space are Borel spaces.For each state ∈ , () is a nonempty Borel subset of denoting the set of admissible controls when the system is in state .We denote by the graph of the multifunction (), that is, which is assumed to be a Borel subset of the Cartesian product of and .The transition law ( ⋅ | ⋅ ) is a stochastic kernel on given .Finally, : (0,1) and : (0,∞) are measurable functions representing the discount factor and the cost-per-stage, respectively, when the system is in state ∈ and the action ∈ () is selected.
The model ℳ represents a controlled stochastic system and has the following interpretation.Suppose that at time ∈ ℕ 0 the system is in the state = ∈ .Then, possibly taking into account the history of the system, the controller selects an action = ∈ (), and a discount factor (, ) is imposed.As a consequence of this the following happens: The system visits a new state + 1 = ′ ∈ according to the transition law Once the transition to state ′ occurs, the process is repeated.
Typically, in many applications, the evolution of the system is determined by stochastic difference equations of the form where is a sequence of independent and identically distributed random variables with values in some Borel space , independent of the initial state 0 , and : × × is a given measurable function.In this case, if denotes the common distribution of , that is then the transition kernel can be written as where 1 ( ⋅ ) represents the indicator function of the set .
Control policies.The actions applied by the controller are chosen by mean of rules known as control policies defined as follows.Let ℍ 0 : = and ℍ : = × , ≥ 1 be the spaces of admissible histories up to time .A generic element of ℍ is written as ℎ = ( 0 , 0 ,..., − 1 , − 1 , ) .
We denote by Π the set of all control policies.
Let be the set of measurable selectors, that is, is the set of measurable function : such that () ∈ () for all ∈ .
Definition 2.2 A control policy = is said to be: a. deterministic if there exists a sequence of measurable functions b. a Markov control policy if there exists a sequence of functions ∈ such that In addition c.A Markov control policy is stationary if there exists ∈ such that = for all ∈ ℕ 0 .
Observe that a Markov policy is identified with the sequence , and we denote = .
In this case, the control applied at time is = ( ) ∈ ( ) .In particular, a stationary policy is identified with the function ∈ , and following a standard convention we denote by the set of all stationary control policies.
Optimality criterion.We assume that the costs are discounted in a multiplicative discounted rate.That is, a cost incurred at stage is equivalent to a cost Γ at time 0, where In this sense, when using a policy ∈ , given the initial state 0 = , we define the total expected discounted cost (with state-action-dependent discount factors) as =0 ( , ) := ( , ) , where denotes the expectation operator with respect to the probability measure induced by the policy , given 0 = .
The optimal control problem associated to the control model ℳ, is then to find an optimal policy * ∈ such that ( * ,) = () for all ∈ , where is the optimal value function (see [10]).

The value iteration algorithm
In this section we give conditions on the model that imply: (i) the convergence of the value iteration algorithm; (ii) the value function is a solution of the corresponding optimality equation; and (iii) the existence of stationary optimal policies.In order to guarantee that () is finite for each initial state we suppose the following.
At the end of Section 4 we give sufficient conditions for Assumption 3.1.We also require continuity and (inf-) compactness conditions to ensure the existence of "measurable minimizers."The following definition was introduced in [13].

Assumption 3.3. (a)
The one-stage cost and the discount factor are -inf-compact functions on . In addition, is nonnegative.
(b) The transition law is weakly continuous, that is, the mapping is continuous for each bounded and continuous function on .
For each measurable function on , ∈ , and ∈ , we define the operators and A consequence of Assumption 3.3 is the following.
Observe that the operator is monotone in the sense that if ≥ then ≥ .In addition, from Assumption 3.3 and ([13], Theorem 3.3), we have that maps + () into itself.Furthermore, there exists ∈ such that To state our first result we define the sequence ⊂ + () of value iteration functions as: Since is monotone, note that is a nondecreasing sequence.
c.There exists a stationary policy * ∈ such that, for all ∈ ,() = * (), that is and * is an optimal policy.
Now, let ∈ + () be an arbitrary solution of the optimality equation, that is, = .Then, applying the arguments in the proof of (34) with instead of we conclude that ≥ .That is, is minimal in + () .

Policy iteration algorithm
In Theorem 3.5 is established an approximation algorithm for the value function by means of the sequence of the value iteration functions .In this case the sequence increase to and it is defined recursively.Now we present the well-known policy iteration algorithm which provides a decreasing approximation to in the set of the control policies.
To define the algorithm, first observe that from the Markov property ( 14) and applying properties of conditional expectation, for any stationary policy ∈ and ∈ , the corresponding cost (,) Let 0 ∈ be a stationary policy with a finite valued cost Then, from (35), Now, let 1 ∈ be such that and define In general, we define a sequence in + () as follows.Given ∈ , compute Next, let + 1 ∈ be such that

Theorem 3 . 5 .
Suppose that Assumptions 3.1 and 3.3 hold.Then a. .b. is the minimal solution in + () of the Optimality Equation, i.e., ( ) (14)ations Research -the Art of Making Good DecisionsTo prove the Theorem 4.1 we need the following result.From the Markov property(14), for each π∈Π and ∈ , Lemma 4.2.Under Assumption 3.3, if : ℝ is a measurable function such that is well defined, Proof.