Open access peer-reviewed chapter

Some Applications of the Partially Observable Markov Decision Processes

Written By

Doncho S. Donchev

Submitted: 01 December 2022 Reviewed: 28 January 2023 Published: 17 March 2023

DOI: 10.5772/intechopen.1001135

From the Edited Volume

Markov Model - Theory and Applications

Hammad Khalil and Abuzar Ghaffari

Chapter metrics overview

57 Chapter Downloads

View Full Metrics

Abstract

We give two examples of reduction of a POMDPs model to a MDP model with a complete information. In both MDP models, corresponding Bellman’s equations allow to construct optimal policies. The first example is the classical discrete time disorder problem. We include it into the framework of POMDPs in contrast to the standard approach based on optimal stopping rules. In the second model, we consider the steps of a transaction under threat as a trajectory of POMDP. We show how to define the elements of the model in order to achieve an optimal balance between the successful end of the transaction and minimum applications of mitigating actions.

Keywords

  • transaction models
  • intrusion detection and information filtering
  • cost functions
  • partially observable Markov decision processes
  • disorder problem
  • Bellman’s equation

1. Introduction

Methods based on the Markov decision process (MDP) have become an essential part of the reinforcement learning in environments in which we can observe directly the evolution of the controlled object. In many situations, however, we suffer from limited capabilities to do that. The partially observable Markov decision processes (POMDPs) extend the MDP framework allowing us for decision making under the conditions of uncertainty. The classical approach to POMDPs suggests that in order to find the optimal policy in that model one should perform its reduction to a MDP model with complete information which consists of the following steps:

  1. Find suitable sufficient statistics that play the role of a controlled process in the MDP model with complete information;

  2. Define all elements of the MDP model;

  3. Solve Bellman’s equation for the model from step 2.

Unfortunately, there is a lack of examples for which this procedure can be completed successfully. In this chapter, we present two POMDPs models by focusing on the algorithm described above and show how the optimal policies for these POMDPs can be represented.

Firstly, we revisit the discrete time disorder problem which was stated and investigated in details by Shiryaev [1]. The formal setting of this problem is the following. On some measurable space ΩF are given random variables θ,ξ1,ξ2, and a probability measure Pπ such that

Pπθ=0=π,Pπθ=n=1πp1pn1,n1,

where 0<p<1,0π1 are some constants, and the probability of any event A=ω:ξ1x1ξnxn is equal to

PπA=
=πP1A+1πi=0n1p1piP0ξ1x1ξixi
×P1ξi+1xi+1ξnxn+1π1pP0A.

In the last formula both measures P0 and P1 define c.d.fs. Fix=Piξ1x,i=0,1 with densities pix,i=0,1. W.r.t. Pi,i=0,1 all random variables ξ1,ξ2, are i.i.d.

Let Fξ=Fnξ,F0ξ=ϕΩ,Fnξ=σξ1ξ2ξn,n=1,2, be a filtration adapted to the observations ξ1,ξ2,. The problem is to find a Fξ-stopping time τ which minimizes the risk

ρπτ=Pπτ<θ+cEπτθ+,

where c is a positive constant. A solution to that problem in terms of optimal stopping rules and martingale techniques can be found in [1].

In Section 3, we present a different approach to this problem including it into the framework of POMDPs. We show that steps 1 and 2 of the reduction can be performed successfully. Moreover, we define asymptotic solutions to Bellman’s equation and describe a procedure which allows to find them for some densities p0x and p1x.

Section 4 is devoted to applications of POMDPs in the cybersecurity. The Contemporary Intrusion Detection Systems (IDS) are widely used in network management and cybersecurity frameworks for detecting and classifying potential security threats of unauthorized intrusions. The unauthorized intrusions during transaction processing are particularly dangerous because they can lead to significant financial losses. There are a number of security measures which can be used to counteract, but the success of their use depends on the precision of the detection and the time. In both network-based IDS, which use signature information [2], and host-based IDS, which use behavioral information [3] the data analysis utilizes a variety of methods for Machine Learning (ML).

In our recent papers, see [4, 5], we developed an approach based on the POMDPs which allows to perform a quantitative assessment of the impact of the parameters of the model on different important characteristics of the transaction. A deficiency of the model presented in [4, 5] is the assumption for a constant intrusion rate. Here, we relax it. Moreover, we reveal the crucial role of the costs of counteractions in building the model. On the one hand, they should favor the successful end of the transaction, but, on the other hand, prevent the unnecessary use of counteractions when there is no an attack. We show that the costs can be found as optimal solutions of some linear program. We complete this section by solving Bellman’s equation and describing the optimal policy in the model.

In order to make this chapter selfcontained we make a general description of POMDPs models in Section 2.

Advertisement

2. A general description of models with incomplete information

We remind the general framework of the model with incomplete information for controlled Partially Observable Markov Decision Processes (POMDP). For more details we refer the reader to [6, 7]. The model consists of the following elements:

  1. The state space X. A non-empty Borel space.

  2. An action space A. A non-empty Borel space.

  3. An observation space Z. A non-empty Borel space.

  4. A transition kernel tdxn+1xnan. A probability measure on X depending on both the last state xn and the last action an.

  5. An observation kernel sdzn+1xn+1an. A probability measure on Z depending on both the current state xn+1 and the last action an.

  6. An initial observation s0dz0x0.

  7. A running reward function gxnan.

  8. A horizon N. A positive integer or . If N< we introduce also a final reward RxN.

We assume that all transition kernels and rewards are measurable.

The evolution of the POMDP is the following. At time n,n=0,1,2, it jumps from state xn to state xn+1 according to the law tdxn+1xnan. We do not observe both xn and xn+1. Instead, we observe zn+1, whose distribution is sdzn+1xn+1an, and add this observation to the information available till time n+1: in+1=z0a0znanzn+1. The initial information i0 coincides with the initial observation z0 which depends on the initial state x0. We assume that the initial distribution p of x0 is also known.

Definition 1 A policy is a sequence π=μ0μ1μN1 of measures μndanpin such that μnAnpin=1, where An is the set of all admissible actions corresponding to the initial distribution p and the information in.

Besides the information in, we consider also the set of all histories Hn till time n. Each history hnHn includes not only in but also all states visited by the process till time n:hn=x0z0a0x1z1a1xnznan. According to Ionesku-Tulchea’s theorem to each initial distribution p and policy π on the set Hn there corresponds a measure Pnπp given by the formula:

PnπpX0Z0A0XnZnAn=
=X0Z0A0XnZnμnAapin×
×sdznxnan1t(dxnxn1,an1μ0da0pz0s0dz0x0pdx0.

Here X0Z0A0XnZnAn is a cylindrical subset of Hn. The problem is to find a policy π which maximizes the total reward

JNπp=ENπpn=0N1gxnan+RxN,E1

where ENπp is an expectation w.r.t. the measure PNπp, and the term RxN is missing if N=.

If N=, then in order to ensure convergence in (1) often introduce a discounting. However, the infinite horizon model which we study here is such that in (1) with probability 1 there are a finite number nonzero terms. So, we do not need to take care for the convergence.

The general approach to the optimal control of POMDP consists in a reduction of problem (1) to a MDP with a complete information. We pay no attention to that reduction in a full generality, referring the reader to [6, 7]. However, in the next section we consider in details all steps of the reduction.

Advertisement

3. The disorder problem

In this section we develop an approach to the discrete time disorder problem based on POMDPs. We define the elements of the general model described in Section 2 as follows:

  1. The state space X=0,1,2 consists of three states 0,1 and 2.

  2. The action space A=waitstop consists of the two possible actions we can apply.

  3. The observation space Z coincides with the support of both densities p0z and p1z. We assume that the supports of p0z and p1z coincide for, otherwise, in some cases we will observe directly the current state of the controlled process.

  4. We define the transition kernel t by:

    t10wait=p,t00wait=1p,t11wait=1,
    t20stop=t21stop=t22=1.

  5. The observation kernel is

    sdz0wait=p0zdz,sdz1wait=p1zdz.

  6. The running reward is:

    g0wait=1,g1wait=c<0,g0stop=g1stop=0,g2=0.

  7. The initial distribution on X is

    Px0=0=1π,Px0=1=π,Px0=2=0.

  8. The horizon N=.

Let us clarify how this model relates to the disorder problem as stated above. The moment of disorder θ which is Gep-distributed is in fact the time of the first visit of the controlled process at state 1. Our goal is to fix θ as precisely as possible making use of the action stop. If we are in state 0 then the disorder still not occur, and the right decision is to wait. This yields an income equal to 1, and our total reward increases by the same amount. However, if we miss θ (and therefore being in state 1) then applying the action wait is certainly a wrong decision which results in paying a positive penalty c>0. The only way to avoid such a situation is to stop the process, and to go to the absorbing state 2. Since we apply the action stop once, say at time τ, we can identify any policy with that time, and consider the following problem: Maximize the total expected reward

Jτπ=Eπn=0τ1gxnwait,E2

where Eπ is an expectation corresponding to the initial distribution given by Element 7 of the model.

Remark. In contrast to the optimal stopping approach where we receive our income when applying the action stop, formula (2) shows that now the situation is completely different—the whole reward is a result of waiting actions whereas the action stop yields no income.

3.1 Filtration equations. Sufficient statistics. Bellman’s equation

The total probability formula implies the following representation of Jτπ:

Jτπ=1πJτ0+πJτ1.E3

Evidently, Jτ1=0 because, being aware that the process starts at state 1, the optimal policy recommends an immediate stop. In order to find Jτ0 we follow a standard procedure of reducing the POMDP model to a model with complete information.

The first step of this procedure consists in solving the filtration equations. Denote by fni,i=0,1,n=1,2, (resp. rni,i=0,1,n=1,2,) the prior (resp. posterior) probabilities of states i,i=0,1 at time n. Namely, fni is the probability of being in state i before observing ξn whereas rni is the same probability after the observation has occur. Let ξ1=z. Since at the beginning we certainly know that the process starts at state 0, and applying the Bayes rule we get

f10=1p,f11=p,E4
r10=f10p0zf10p0z+f11p1z,r11=f11p1zf10p0z+f11p1z.E5

Similarly, under the assumption that ξn=z, for any n>1 one holds

fn0=1prn10,fn1=prn10+rn11,E6
rn0=fn0p0zfn0p0z+fn1p1z,rn1=fn1p1zfn0p0z+fn1p1z.E7

The general theory recommends using suitable posterior probabilities, say rn1,n=1,2,, as sufficient statistics. However, in our case it is more convenient to introduce other sufficient statistics. Indeed, formulas (6) and (7) imply that

fn1fn0=prn10+rn111prn10=p1p+11prn11rn10.E8
rn1rn0=p1zp0zfn1fn0=p1zp0zp1p+11prn11rn10.E9

Let us set

Xn=rn1prn0+1,n=1,2,E10

Then, (17) reduces to

Xn=Xn1/az+1,az=1pp0zp1z.E11

The linear relation between Xn and Xn1 in (11) makes the process Xn very convenient to work with, and suitable to be the sufficient statistics for our problem.

In order to complete the reduction to a fully observable MDP it remains to express the reward gn in terms of Xn, and to find the conditional distribution of ξn provided that Xn1=y. Making use of (10), and the fact that rn0+rn1=1, we get

rn0=11+pXn1,rn1=pXn11+pXn1.E12

On the other hand, the expected reward of the waiting action at step n is equal to rn0crn1. Thus, in view of (12), the reward in the reduced MDP model should be

gXn=1c+1pXn11+pXn1.E13

Comparing (8) and (10) we get

fn1fn0=p1pXn1.E14

Making use of the first equalities in (6), (12), as well as (14), we obtain

PξndzXn1=y=dzfn1p1z+fn0p0z
=dzfn0fn1fn0p1z+p0zE15
=dzfn0py1pp1z+p0z
=dzpy1+py1p1z+1p1+py1p0z.

Thus, in the reduced fully observable MDP model the controlled process is the Markov chain Xn=Xn1/aξn+1. Its transitions are according to formula (15), and each waiting action yields an income (3). The action stop yields no income and terminates the process. We want to maximize the total reward

Rτy=En=1τ1gXnX1=y,10=0,

where τ is the time when we apply the action stop. Let us set

Vy=supτRτy.

The function Vy satisfies Bellman’s equation

Vy=1c+1py11+py1
+ZVy/az+1pyp1z1+py1+1pp0z1+py1dz+.E16

It is strictly positive on some interval 1y0,y0=y0cp, and is equal to zero if yy0. If the threshold y0 is known, then the optimal stopping rule is

τ=minn1:Xn>y0.

3.2 Asymptotic solutions to Bellman’s equation

Multiplying (16) by 1+py1 we get

1+py1Vy=1cpy1+ZVyaz+1pyp1z+1pp0zdz+.E17

Making use of representation py=1+py/az1az, and taking into consideration the second equality in (11) we see that the integral in the right-hand side of (17) is equal to

ZVyaz+11+py/azazp1zdz.

The substitution az=t reduces the last integral to

aZVyt+11+pytt2p1atdtt,E18

where p1at=p1a1t/aa1t is the transformation of the density p1 with the function at.

Let us introduce a new unknown function Uy=1+py1Vy. Then, in terms of Uy, and in view of (18), Eq. (17) reduces to

Uy=1cpy1+aZUyt+1t2p1atdtt+.E19

Let c=0. Then, the optimal policy does not recommend using the action stop at all, since then both states 1 and 2 become indistinguishable and we can only lose if the process is in state 0. Therefore, in this case the function U satisfies the equation

Uy=1+aZUyt+1t2p1atdtt.

This observation motivates us to introduce asymptotic solutions to Eq. (16).

Definition 2 The function Vy=Uy/1+py1+,y1 is an asymptotic solution to the eq. (16) as c0, if the function Uy satisfies the equation

Uy=1cpy1+aZUyt+1t2p1atdtt.E20

The asymptotic optimal threshold y0 is the root of the equation Uy=0.

We are going to apply Mellin’s transform to Eq. (20). In order to be consistent with that transform, we assume that the set aZ (and therefore Z) does not intersect with the negative part of the real line.

Let us recall that the Mellin transform Mfs of function fx,x>0, is defined by the formula

Mfs=0xs1fxdx,sΩf.E21

It is an analytical function inside domain Ωf which is usually either a half-plane or a stripe of the complex plane parallel to the imaginary axis. The Mellin transform is invertible and the inversion formula is

fx=12πiγiγ+iMfsxsds,γΩf.E22

We are looking for solutions to Eq. (20) which belong to the class of Meyer’s G-functions. These functions have the property of their Mellin’s transform being

Γb1+s,,bm+s,1a1s,,1ansam+1+s,,ar+s,1bm+1s,,1bqs
i=1mΓbi+si=1nΓ1aisi=m+1rΓai+si=m+1qΓ1bis,E23

where Γ is the Euler gamma function, 0mq,0nr, and ai and bj,i=1,,r,j=1,,q are numbers. Formally, Meyer’s G-function is given by the integral

Grqmnzarbq=Grqmnza1arb1bq=
12πiγiγ+iΓb1+s,,bm+s,1a1s,,1ansam+1+s,,ar+s,1bm+1s,,1bqszsds,E24

where γ is a number such that all poles bjk,k=0,1, (resp. 1ai+k,k=0,1,) remain in the left (resp. right) half-plane w.r.t. the line Res=γ. The properties of Meyer’s G-function as well as tables containing representations of many elementary and special functions in terms of Meyer’s G-functions can be found in [8], Chapter 8.

The following formulas represent Meyer’s G-functions in terms of generalized hypergeometric functions:

Grqmnzarbq
=k=1mΓb1bk,̂,bmbk,1+bka1,,1+bkanan+1bk,,arbk,1+bkbm+1,,1+bkbqzbk
×rFq11+bkar1+bkbq'1rmnz,E25
rq,bjbk0,±1±2,
=k=1nΓaka1,̂,akan,1+b1ak,,1+bmakakbm+1,,akbq,1+an+1ak,,1+arakzak1
×qFr11+bqak1+ar'ak1smnz1E26
rq,ajak0,±1,±2,

Here, arak=a1ak,,ak1ak,ak+1ak,,arak and Γa1ak̂amakΓa1akΓak1akΓak+1akΓamak.

In order to apply the Mellin transform to Eq. (12), we need the following lemma for the Mellin transform of the composition of two functions.

Lemma 1 Let fx and φx be two functions, Fs=Mfs and assume that the Mellin transform of function φxu exists for all u from the line Reu=β which belongs to Ωf. Let us set Gsu=Mφxus. Then

Mfφxs=12πiβiβ+iGsuFudu.E27

Proof. We apply the inversion formula (22) to the right-hand side of (27) taking γ to be such that line Res=γ belongs to Ωφu. Changing the order of integration, we get

12πiγiγ+ixsds12πiβiβ+iGsuFudu
=12πiβiβ+iFuduγiγ+iGsuxsds
=12πiβiβ+iFuφxudu=fφx,

which ends the proof.

Let us set Fs=MUs,As=Mt2p1ats. Applying Mellin’s transform to (20), and making use of the formula for the transform of a convolution, we get

Fs=cps+11+cps+MWy+1sAs.E28

Applying Lemma 1 to both functions Uy and y+1 we obtain

MWy+1s=12πiβiβ+iBsusFudu,

where Bsus is the beta-function, and β is such that the line of integration belongs to ΩU. Substituting the last formula in (28) we come to the equation

Fs=cps+11+cps+12πiβiβ+iBsusFuduAs.E29

If Uy is a G-function, then Fs is of the form (23), and so is the function under the integral in (29). The integral itself can be found by formula (24), and evaluated according to formulas (25) and (26). Thus, we can find As, p1at, and finally densities p0z and p1z which correspond to the function p1at. The function Vy=Uy/1+py1+ will be an asymptotic solution to Eq. (16), corresponding to these densities.

This approach has been applied in [9], where we have found the stationary distribution of the Markov chain Xn=ξnXn1+1 in case of some special i.i.d. random variables ξn,n=1,2,.

Advertisement

4. The role of the cost functions for preventing intrusions in transactions under threat

Here, we briefly describe the model studied in [4, 5] which we denote by M. It consists of the following elements:

  1. State space S=safedangerdeadend—corresponds to the different types of situations from the point of view of the risk they pose:

    1. safe Situations along the transactions in absence of any threats;

    2. danger Situations in which the system is under the influence of security threats but still able to recover and

    3. deadend Situations in which the system experiences severity and crashes completely under the security threats.

  2. Action space A=noactrespond—corresponds to the two types of actions from risk perspective:

    1. noact—no action intervention, the system goes straight to the next situation according to the recommended action and continues the normal track of execution of the current transaction, and

    2. respond—counteraction, which brings the system back to a safe situation after malicious action deviating the transaction from its normal course.

  3. Observation space Z=nothreatthreatcrash—corresponds to the different types of events from risk viewpoint:

    1. nothreat—asynchronous event, which is non-threatening and does not require counteraction;

    2. threat—detection of malicious intervention which requires counteraction, and

    3. crash—losing control of the system without chance for recovery.

  4. Transition kernel qsn+1snan+1—probability of the transition from situation sn to situation sn+1 under control cn+1, calculated as follows:

    • qsafesafe=p, p is the probability for absence of threats after transition from a safe situation;

    • qdangersafe=1p, 1p is the probability for presence of threats after transition to from safe situation;

    • qsafedangerrespond=1 because the counteraction in dangerous situation eliminates the threat;

    • qdeadenddangernoact=1 because the absence of counteraction in dangerous situation leads to inevitable crash of the system, and

    • qdeadenddeadend=1 since there is no way out of the crash.

  5. Occurrence kernel tznsn—probability of the occurrence of event zn in situation sn, calculated as follows:

    • tnothreatsafe=p11—probability of not observing threat occurrence in a safe state (true negative);

    • tnothreatdanger=p12—probability of not observing threat occurence in a dangerous stage (false negative);

    • tthreatsafe=p21—probability of observing threat occurrence in a safe state (false positive);

    • tthreatdanger=p22—probability of observing threat occurrence in a dangerous state (true positive), and

    • tcrashdeadend=1—probability of observing the system crash under threat.

  6. Costs—quantitative measures of the costs of taking actions which can be interpreted differently, depending on the needs; we are considering it to be the delay caused by the additional counteractions to neutralize the detected threats:

    1. Running cost rc calculated as follows: rnoact=0, rrespond=c where c>0 is the penalty for executing counteraction respond;

    2. Final reward RsF. This is the reward we get in the final state sF. We define it as follows: Rsafe=Rdanger=1,Rdeadend=0. Thus, we consider the final state sF as an indicator whether the transaction was successful or not.

  7. Horizon N—length of the transaction, measured by the number of situations along the transaction.

A security decision ϕn is a function which on each step n of the transaction chooses either noact or respond. Decision policy π=ϕ1ϕ2ϕN is a collection of security decisions such that on each step n of the transaction, ϕn depends only on the past observations, and the prior probabilities of the states before the transaction has begun. We assume that the prior probability of state deadend is 0, since otherwise any policy makes no sense. Therefore, the sum of the prior probabilities of the other two states is equal to 1, and the prior distribution of the states is determined by the prior probability x of state safe. In [4] we studied a problem with a criterion

vπx=ExπRsFcK,E30

where Exπ is the expectation, corresponding to the policy π and the prior probability x, and K is the total number of times when we apply the action respond. A value function of the POMDP model is the function

vx=maxπvπx.E31

The policy π such that vx=vπx is an optimal policy.

Element 6 of the model M, the running cost, is crucial. The point is that if we allow it to be arbitrary, we can face the following two extreme situations:

  • Case 1: c is too small. Then the optimal policy may recommend applying mitigating action respond permanently, without taking into consideration any observations;

  • Case 2: c is too large. Then we would not be able to neutralize all threats even if we know exactly when they occur, i.e., when both the false positives and the false negatives are equal to 0. This could happen because the total cost of the mitigating actions may exceed 1, which is the maximum reward we can get for a successful end of the transaction.

In [5], we suggested the following choice of c. Let m be the maximum number when the action respond can be used. Then, we have

mc1,m+1c1,E32

and therefore,

1m+1c1m.E33

Since the number of attacks during the transaction has a Binomial distribution with parameters 1p and N, a suitable choice of m is it to be 1α-quantile of this distribution for a given significance level α.

Remark. Formula (33) assigns to c values inside the interval to which it must belong. However, in fact we should select between the end points of that interval. If we take c=1/m+1, then we favor the successful end of the transaction. Otherwise, if c=1/m, we focus on avoiding a unnecessary use of mitigating actions.

In the model described above, the probability of attacks, 1p, remains unchanged during the transaction. This makes the model too restrictive. Indeed, in the real life, if the computer is infected before the transaction, it is more likely to expect intrusions at its early steps rather than in the end. Moreover, some transaction steps like entering an IBAN for money transfer or credentials, are more risky than the others, purely technical, steps of the transactions. Here, we relax this restriction assuming that p=p1p2pN is a vector, whose entry pn is equal to the probability to stay in the state safe at the n-th step of the transaction. It is worth noting that p1 is exactly the prior probability x in the model M, and 1p1 is a probability of attacks which, however, occur not during the transaction, but before it. We shall refer to this model as a generalized model, and denote it by M.

Let us discuss the running costs in this model. As we have seen, they are very important, since they control the actions we apply on different steps of the transaction. If the probability pn is large, then the corresponding running cost cn should be large as well, in order to prevent a useless choice of a mitigating action. Evidently, such an action should be applied in case of a probability pn which is small enough, and we favor its application taking cn to be small, too. Thus, cn depends on pn (but not only), and we define the running costs in this model as a vector c=c1c2cN. The other elements of both models M and M coincide. Eq. (30) in model M reads as

vπx=ExπRsFi=1Nci1ai=respond.E34

We stress the dependence of cn on pn, making the following basic assumption.

Assumption 1 For any two steps i and j of the transaction, one holds ci=cj if pi=pj.

This assumption trivially holds in model M, and it implies that the running cost in that model is a constant.

4.1 Determining costs c1,c2,,cN

Here, we are going to determine the costs c1,c2,,cN, trying to satisfy the following conditions:

  • guarantee a successful end of the transaction with a probability which is large enough;

  • avoid a useless application of the action respond when the current state of the transaction is safe.

Unlike the model M when, given N,p, and a significance level α, one should take c according to the remark after formula (33), here the situation is more complicated.

We shall consider the transaction as a random trial whose basic space Ω is the set of all diadic representation of integers between 0 and 2N1:

Ω=ω=w1w2wNwn=0or1n=12N.

We assume that wn=1 (resp. wn=0) corresponds to the event when there is (resp. there is no) an intrusion on the n-th step of the transaction. Each elementary event ωΩ is determined by the set

Aω212N,Aω=n:wn=1,E35

where 212N is the set of all subsets of 12N.

Formula (35) implies that there exists 1–1 correspondence between Ω and 212N. If necessary, we shall denote by ωA the elementary event ωΩ, that corresponds to the set A12N. The probability qω of ωΩ can be calculated by the formula

qω=n=1Npn1wn1pnwn.ωΩ.E36

Let us fix a significance level α, and let

q1q2q2NE37

be the ordered statistics of series qω,ωΩ. Denote by Ai the set given by (35), that corresponds to qi1, and define the integer I by the formula:

I=mini:j=1iqj>αandqi<qi+1.E38

4.1.1 Case 1

All probabilities pn,n=1,2,,N, are different.

First, we shall discuss the case when all probabilities pn,n=1,2,,N are different. Similarly to the inequalities in (32), we want that the costs c1,c2,,cN satisfy the constraint

j=1Nwjcj=jAicj1,ω=ωAi,i=1,2,,I,E39
j=1Nwjcj=jAicj1,ω=ωAi,i=I+1,I+2,,2N.E40

The mean cost of the transaction is equal to

C=i=1N1pici.E41

In view of the remark after formula (33) is natural to try to minimize C, in order to favor the successful end of the transaction. Thus, we come to the following linear program: minimize (41), under constraint (39) and (40). Evidently, if N is large, this problem seems to become hopeless to solve even with the help of supercomputers.

4.1.2 Case 2

Some of the probabilities p1,p2,,pN, coincide.

Now, assume that some of the probabilities p1,p2,,pN coincide. At a first glance in this case the problem becomes even more complicated, because, in view of Assumption 1, to the inequality constraint (40) and (41), one should add equality constraint ci=cj for all pairs of states ij such that pi=pj. However, it turns out that the more equal probabilities among p1,p2,,pN there are, the simplier the problem becomes. Let

P1<P2<<PK,K<N,

be all different numbers among p1,p2,,pN. Then, we can represent the set 12N as a union of disjoint subsets B1,B2,,BK such that

Bk=n:pn=Pk,k=1,2,,K.

Assumption 1 implies that not only the probabilities pn,n=1,2,,N, but also the costs cn,n=1,2,,N coincide on each set B1,B2,,BK. Denote them by C1,C2,,CK. Then, in terms of C1,C2,,CK, both the constraint (39) and (40) as well as formula (41) reduce to

k=1KdikCk1,i=1,2,,I,E42
k=1KdikCk1,i=I+1,I+2,,2N,E43
C=k=1KDk1PkCk,E44

where dik,k=1,2,..,K,i=1,2,,2N, and Dk,k=1,2,..,K, are cardinalities of sets AiBk, and Bk, respectively. Thus, the problem becomes: minimize (44) under constraint (43) and (44). The dimensionality of this problem is K rather than N. Moreover, the number of constraint will be less than 2N. In order to understand this, we give the following definition.

Definition 3 We say that the vector a=a1a2aK majorates the vector b=b1b2bK, and write ab, if akbk,k=1,2,..,K..

Denote by Δ the matrix with entries dik, and let δi=di1di2diK,i=1,2,,2N, be its rows. If i,jI and δiδj, then we can discard the constraint δi, because it will hold automatically if we satisfy δj. Similarly, if i,j>I, and δiδj, then we can discard the constraint δj, because it will hold automatically if we satisfy δi. Thus, the number of constraint in (5) (resp. (4)) is equal to the cardinality of the maximum subset of 12I (resp. I+1I+22N) such that for any two i and j belonging to this subset neither δiδj nor δjδi.

4.2 An example

A transaction consists of N=12 steps with probabilities

p=0.6,0.9,0.9,0.8,0.7,0.9,0.9,0.9,0.9,0.9,0.8,0.9.

We have K=4,P=0.6,0.7,0.8,0.9,D=1,1,2,8. The linear program (42), (43) reduces to: minimize

0.4C1+0.3C2+0.4C3+0.8C4,

under constraint

C1+3C41,
C2+3C41,
C3+3C41,
C3+C41/2,
C2+C3+2C41,
C1+C3+2C41,
C1+C2+2C41,
C1+C2+2C3+C41,

where

0C11,0C21,0C31/2,C4=1/3.

Its solution is C1=0,C2=C3=1/6,C4=1/3. Thus, in the first state of the transaction, which is most dangerous, one should pay a zero cost for counteracting. Therefore, we shall certainly prevent an attack in this state, if it occurs.

4.3 Description of the optimal policy

Here, we briefly describe the optimal policy in model M, referring the reader for more details to [4], where, in case of the model M, both the value function (31) and the optimal policy have been found by performing a standard reduction to a fully observable MDP. Sufficient statistics in this reduction are the posterior probabilities of the states of the transaction. Since the state deadend is absorbing, the total probability of the other two states, safe and danger, is 1, and we can consider the posterior probability of either of them, say safe, as a controlled process in the reduced MDP model. The state space of that process is 01, where the isolated point corresponds to state deadend. Consider the following functions defined for x01:

Γ1x1x=p21xp21x+p221x,Γ2x1x=p11xp11x+p121x,E45
Vnx=maxπExπ10xN+11Σk=nNck1ak=respond,n=1,N,E46

where x=xn is the posterior probability of state safe at time n. The function Γ1x1x (resp. Γ2x1x) in (45) is the posterior probability of state safe, if we detect a danger (resp. nothreat), and the prior probability of state safe is x. The functions in (46) represent the maximum income we can get after the n-th step of the transaction. They satisfy Bellman’s equation

Vnx=maxVnxVnx,E47

and the final condition

VN+1x=1.E48

In (47), Vnx and Vnx are one-step ahead estimates of both actions noact and respond:

Vnx=xpnp21+1pnp22Vn+1Γ1pn1pn
+xpnp11+1pnp12Vn+1Γ2pn1pn,
Vnx=cn+pnp21+1pnp22Vn+1Γ1pn1pn
+pnp11+1pnp12Vn+1Γ2pn1pn.

The optimal strategy φn at any moment of time n=1,2,,N is the following:

φnx=noact,ifVnx=Vnxrespond,ifVnx=Vnx.

Eqs. (47) can be solved backwards, starting with the boundary condition (48). For example, for n=N we get:

VNx=max1cNx,
φNx=noact,ifx1cNabovethethresholdrespond,ifx<1cNbellowthethreshold.

The remaining iterations until reaching the beginning of the transaction can be performed recursively, taking the previously calculated solution as terminal. It is worth noting that not only φN, but also all φn,n=1,2,N1 are determined by thresholds yn,n=1,2,,N1,yN=1cN, and the optimal policy recommends using of a counteraction only if the posterior probability of state safe in time n,xn, is less than yn.

After all thresholds have been found, the optimal policy recommends the following optimal behavior:

  1. At the beginning of transaction (n=1), depending on whether we detect a threat or not, we calculate the posterior probability x1 of the state safe by the formula

    x1=Γ1p11p1,ifz1=threatΓ2p11p1,ifz1=nothreat

    If x1 is greater than the threshold y1, then we do not apply the counteraction, x1 remains unchanged, and with probability 1x1 we fall into a state deadend. Otherwise, we use respond after paying the cost c1, and the posterior probability becomes 1 since we certainly know that we are safe;

  2. On the next step we start with a prior probability equal to the just found posterior probability (x1 or 1), multiplied by p2, observe again if there is detection of a threat, calculate the posterior probability x2, compare it with the threshold y2, and so on till the end of the transaction.

Advertisement

5. Conclusions

We focus on two new applications of the theory of the Partially Observable Markov Decision Processes (POMDPs).

Firstly, we revisit the discrete time disorder problem. The classical approach to it is based on optimal stopping rules and martingale technique. We include it into the framework of POMDPs which improves the quality of detection, and allows, in some cases, to find solutions to Bellman’s equation.

The second application is in the cybersecurity. A basic problem here is the assessment of the security risk during transaction processing. We consider the transaction as a controlled POMDP and construct an optimal policy that recommends a best behavior for neutralizing the intrusions when they occur. We reveal the crucial role of the costs of counteractions in building the model. On the one hand, they should favor the successful end of the transaction, but, on the other hand, prevent the unnecessary use of counteractions when there is no an attack. We show that the costs can be found as optimal solutions of some linear program.

Advertisement

Abbreviations

POMDPPartially Observable Markov Decision Processes
MDPMarkov Decision Processes
IDSIntrusion Detection Systems
MLMachine Learning

References

  1. 1. Shiryev A. Optimal Stopping Rules. Berlin, Heidelberg: Springer; 2008
  2. 2. Kulariya M et al. Performance analysis of network intrusion detection schemes using apache spark. In: Proc. 2016 Int. Conf. On Communication and Signal Processing (ICCSP). Melmaruvathur: IEEE; 2016. pp. 1973-1977
  3. 3. Parmar JA. Classification based approach to create database policy for intrusion detection and respond anomaly requests. In: Proc. 2014 Conf. On IT in Business, Industry and Government (CSIBIG). Indore: IEEE; 2014. pp. 1-7
  4. 4. Donchev D, Tonchev D, Vassilev V. Risk assessment in transactions under threat as a partially observable Markov decision process. In: Dellolmo P et al., editors. Optimization in Artificial Inteligence and Data Science, Proc. of Int. Conference and Decision Science ODS2021, 14–17 Sep, 2021. Rome, Italy: Springer; 2021. pp. 199-212
  5. 5. Donchev D, Tonchev D, Vassilev V. Impact of false positives and false negatives on security risks in transactions under threat. In: Fischer-Hubner S, Lambrinoudakis C, Kostis G, et al., editors. Trust, Privacy and Security in Digital Business, LNCS. Vol. 12927. Springer; 2021. pp. 50-68
  6. 6. Bertsekas D, Shreve S. Stochastic Optimal Control. The Discrete Time Case. New York, San Francisko, London: Academic Press; 1978
  7. 7. Dynkin E, Yushkevich A. Controlled Markov Processes. New York: Springer; 1979
  8. 8. Brychkov U, Marychev O, Prudnikov A. Integrals and Series.Vol. 3. Moscow: Nauka; 1986
  9. 9. Donchev D. Random series with time-varying discounting. Communications in Statistics-Theory and Methods. 2011;40(16):2866-2878

Notes

  • In fact, if the inequalities in (24) are not strong, then qi does not allow us to determine the corresponding ω, and, therefore Ai. We avoid this difficulty, defining I by formula (24).

Written By

Doncho S. Donchev

Submitted: 01 December 2022 Reviewed: 28 January 2023 Published: 17 March 2023