Open access peer-reviewed chapter

An Extension of Finite-state Markov Decision Process and an Application of Grammatical Inference

By Takeshi Shibata and Ryo Yoshinaka

Published: January 1st 2008

DOI: 10.5772/5276

Downloaded: 3001

1. Introduction

In this chapter, we introduce the notion of simple context-free decision processes, which are an extension of episodic finite-state Markov decision processes (MDPs). Intuitively, a simple context-free decision process can be thought of as an episodic finite-state MDP with a stack. In fact, many reinforcement learning methods can be applied to the class of simple context-free decision processes with natural modification on their equations.On the other hand, in grammatical inference area, some non-regular subclasses of simple grammars, such as very simple grammars and right-unique simple grammars, have been found to be efficiently identifiable in the limit from positive data. Especially, the class of right-unique simple decision processes, which are simple context-free processes based on right-unique simple grammars, is a superset of the class of episodic finite-state MDPs.

Because episodic states histories are regarded as positive data, one might expect that those positive results in grammatical inference area could be applied to reinforcement learning directly.

However, one should note that grammars generating the same language can generate different probabilistic languages. While it is enough to find a process representing the target language in the scheme of identification in the limit, in reinforcement learning, one has to find a process representing the target probabilistic language.

Therefore, we need to modify the results in grammatical inference area for applying them to reinforcement learning. Actually, a grammar can be more general than another in the sense that it generates all the probabilistic languages generated by the other. Hence, finding a most general grammar gives a solution to this problem. This chapter however shows that both classes of simple grammars and right-unique simple grammars do not admit most general grammars.

Besides, we show that there is an intermediate class between right-unique simple grammars and simple grammars that admits an algorithm computing a most general grammar from any two grammars whose languages coincide.

We present an algorithm that learns the optimal actions under right-unique simple context-free processes, by concatenating the algorithm learning right-unique simple grammars from positive data, the one computing a most general grammar and the modified update equations of some usual reinforcement learning methods.

2. Notation and definitions

Before we give the definition of simple context-free MDPs, we write some standard notation and definitions and introduce subclasses of simple grammars and probabilistic grammars.

A context-free grammar (CFG) is a quadruple denoted by <V, Σ, R, S>, where V is a finite set of nonterminal symbols, Σ is a finite set of terminal symbols, R

V×(VΣ)* is a finite set of production rules, and SV is the start symbol. Let G=<V, Σ, R, S> be a CFG. We write XAZGXYZ if there is a rule A→Y. When G is clearly identified, we write simplyinstead ofG.*denotes the reflective and transitive closure of. G is said to be reduced if and only if, for all A in V, there are some x,y,z in Σ* such that S*xAz
*E1
xyz.

L(G, X) denotes a language derived from X, i.e.,{xΣ*|XG*x}. L(G)=L(G,S) is called the language of G. Let ε denote the empty sequence. If x is a sequence, let |x| denote the length of x. Let |A| for a set A denote the cardinality of A, and |G| denoteAXR|A|+|X|.

In order to be easy to read, terminal symbols and nonterminal symbols are denoted by a,b,c,… and A,B,C,… respectively, and finite sequence of terminal symbols and nonterminal symbols are denoted by …,x,y,z and ,β,γ,… respectively.

CFGs G=<V, Σ, R, S> and H=<V’, Σ, R’, S’> are equivalent modulo renaming of nonterminal symbols when there is a bijection φ:V→V’ such that A→X is in R if and only if φ(A)→φ*(X) is in R’, and φ(S)=S’. φ* is a homomorphism (VΣ)* →(VΣ)* defined recursively: φ*(ε)= ε, φ*(aX)=aφ*(X) and φ*(AX)= φ(A)φ*(X).

Definition 1. Let G=<V, Σ, R, S> be a CFG. G is called a simple grammar (SG) if and only if

G is in Greibach normal form, that is, for each rule of G is written asAaα

AaαRandAaβRimply
α=βE2

The subclasses of SGs which will appear in this chapter are defined below.

Definition 2. Let G=<V, Σ, R, S> be an SG. G is called a right-unique simple grammar (RSG) if and only ifAaαRandBaβRimply
α=βE3
.Definition 3. Let G=<V, Σ, R, S> be an SG. G is called a very simple grammar (VSG) if and only ifAaαRandBaβRimplyA=Band
α=βE4

From definitions, a VSG is an RSG, and an RSG is an SG.

Let G=<V, Σ, R, S> be an SG. A probability assignment P on G is a map from R to [0,1] such thatrR(A)P(r)=1for all A in V, whereR(A)={AXR}. A probabilistic simple grammar (PSG) is a pair of G and P, where P is probability assignment on an SG G. Let G(P) be a PSG. All of sequences of production rules that are used in the left-most derivation of SG*x, wherexL(G), is called the left Szilard language of G. When G is an SG, every x in L(G) has a unique sentence in the left Szilard language of it. Let us denote that sentence by r(G,x,1), …, r(G,x,|x|). Then, the probabilistic language of G(P),Pr[|G(P)]:Pow(Σ*)[01]is defined asPr[x|G(P)S], where

Pr[x|G(P)X]={i=1|x|P(r(GXxi))if  xL(GX) 0otherwizeE5

r(G,X,x,1),…,r(G,X,x,|x|) are the sequence of rules used in the derivation XG*x. G(P) is called consistent if and only ifxL(G)Pr[x|G(P)]=1. In order thatPr[|G(P)]is regarded as a probability on Σ*, G(P) is required to be consistent. A sufficient condition of consistency is known (Wetherell, C. S., 1980). Let M(G(P)) be a (|V|,|V|) matrix whose elementmij(G(P))represents the expectation of the number of the occurrences of the nonterminal symbolAjderivable in one step fromAi. G(P) is consistent ifρ(M(G(P)))1, whereρ(M)is the spectral radius of M.

3. An extension of finite-state Markov decision processes

3.1. A representation of episodic finite-state MDPs with grammatical formalism

In this section, first, we describe the notion of the simple context-free Markov decision processes, by using a simple example of an episodic finite-state MDP. After that, the definition of simple context-free Markov decision process will be given.

Fig.1 is an episodic finite-state MDP, which contains 5 states, {a, b, c, d, e}. ‘S’ indicates the initial state, and double circles indicate end states ({a, e}). The actions the robot can take are ‘L’ and ‘R’. Reward given to the robot is -1 for every step, and if the robot gets in the end state ‘e’, 1 is given. In this case, it is obvious that, if the robot takes ‘R’ action for every state, the best policy is acquired.

Figure 1.

An episodic finite-state MDP

A possible history of states is a sequence of states representing the transition of the robot from the initial state to an end state, such as {cba, cde, cbdba, … }. Let us call the set of possible histories the language generated by the finite-sate MDP in Fig. 1. While this language is a regular language, some regular languages cannot be generated by any finite-state MDPs. For example a singleton {aae} cannot be the language of any MDP, because each letter identifies one state. The fact that aae is a possible history implies that the robot may translate from the state a to a. Thus,aneis also a possible history for any n > 0.

Therefore, the possible histories can be also described as a language for some regular grammar G, and its language class is not required to include the class of regular languages.

The class of simple grammars is one of the subclasses of CFGs such that all languages generated by finite-state MDPs are generated by them. For example, a CFG whose rules are {S→cC, C→bB, B→a, B→cC, C→dD, D→e, D→cC} generates the language of the MDP in Fig.1. The derivation of ‘cbcde’ is written as ScCcbBcbcCcbcdDcbcde. That CFG is a simple grammar, and a regular grammar, if nonterminal symbols are regarded as states (Fig. 2).

Figure 2.

Expressing the MDP in fig.1 by a regular grammar or simple grammar

A probabilistic CFG is defined by assigning a non-negative real number to every rule, whererR(A)P(r)=1andR(A)={AXR}. In a finite-state MDP, if a policy is decided, the probabilities of histories (the measure on the language) are determined. On the MDP in Fig. 1, let us suppose that the policy is chosen as the probability of choice of ‘R’ or ‘L’ is assigned to 0.5 for every state. In that case, the probability of a sentence ‘w’ in the language is2|w|. The pair of a language and a measure on it is called a probabilistic language. When a policy which is taken by a robot is changed, the probabilistic language of MDP changes correspondingly.

Every context-free grammar generates various probabilistic languages by assigning various probabilities to each rule of it. The set of all probabilistic languages generated from a CFG G by assigning a probability to each rule of it is called the probabilistic generality of G. The probabilistic generality of G is a subset of{L(G)[01]}. For instance, suppose that G is a CFG whose rules are {S→aS, S→b}. The probabilistic generality of it is written as follows:

{P:{a*b}[01]|P(anb)=qn(1q)q[01]}E6

It is clear that the fact that grammars A and B generate the same language does not imply its probabilistic generalities of them are the same. Suppose that H is a CFG that has rules {S→aA, S→b, A→aA, A→b}. Obviously, L(G) = L(H). But the generality of H is

{P:{a*b}[01]|P(b)=1qP(aanb)=qrn(1r)q[01]}E7

So generalities of them are different from each other.

3.2. Simple context-free Markov decision processes

Simple context-free MDP is formally defined as follows

Definition 4. Let G be an simple grammar. G(U,P,C) =<V,Σ,R,S,U,P,C> is a simple context-free decision process if and only if U, P and C are the following set and functions.

U is a finite set of actions.

P : R×U → [0,1] is a probabilistic assignment. For all (A,u) in V×U,rR(A)P(ru)=1

C : Σ → (-∞,∞) is a reward function.

In the following, when G is in a subclass of SGs, we call the simple context-free decision process G(U,P,C) [the name of the subclass]-DP.

Corresponding to a given SG-DP, a sequence of discrete random variables X(1), Y(1), X(2), Y(2), … is given as follows. X(1) =S, the domain of X(i) is Σ*V* and the domain of Y(i) is U.

Pr[X(t)=xtαt|X(1)=SY(1)=u1...X(t1)=xt1αt1Y(t1)=ut1]=Pr[X(t)=xtαt|X(t1)=xt1αt1Y(t1)=ut1]={P(rut1)ifxt1αt1xtαt  with the rule r1ifxt1αt1=xtαt=xt1=xt0otherwiseE8

Clearly, X and Y are an infinite-state MDP. Every episodic finite-state MDP is equivalent to some SG-DP as we have discussed in the beginning of this section. In fact, an episodic finite-state MDP whose states area1an+kand end states arean+1an+kfor some n >0 and k >=0, can be represented by the form of SG-DP <V,Σ,R,S,U,P,C>:

V={A1(=S)...An}  Σ={a1...an+k}R={AajAj|AV1jn}{Aaj|AVn+1jn+k}E9
P(AiajAj)equals to the transition probability from i to j, andP(Aian+j)equals to the transition probability from i to n+j, where n+j is an end state. U is the same set of actions as the MDP, and C is also the same.

Policies, value-function, optimal value function, etc. are introduced below in analogues to those of MDPs. Let G(U,P,C) = <V, Σ, R, S,U,P,C> be an SG-DP.

Definition 5. A policy of G(U,P,C) is a map

VU

One of the main purposes of reinforcement learning is to determine a policy μ so as to maximise the expectation of the total reward from S.

Definition 6. A value function of G(U,P,C) under a policy μ,

Jμ:V(), is defined as
Jμ(A)=xL(GA)Pr[x|G(Pμ)A]i=1|x|C(ai) E10

wherex=a1a2a|x|andPμis the probability assignment of G under μ, namely, forAaαR,

Pμ(Aaα)=P(Aaαμ(A))E11

Whenρ(M(G(Pμ)))1for any policyμ, all value functions of G(U,P,C) are finite.

Definition 7. The optimal value function of G(U,P,C) denoted as

J*:V()is defined as
J*(A)=supμπJμ(A)E12

whereπis the set of all policies.

There exists some policyμ*such thatJμ*(A)=J*(A)

Definition 8.

μ*is called an optimal policy.

Definition 9. The optimal action-value function

Q*:V×U()is defined as
Q*(Au)=AaB1BkR(A)P(AaB1Bku)(C(a)+i=1kJ*(Bi)) E13

Definitions 5 - 9 are a natural extension of the usual definitions on reinforcement learning for finite-state MDPs, whose discounting factor equals 1. Most of well known reinforcement learning methods, such as Q-learning, TD(λ) can be applied to SG-DPs corresponding to the above definitions. In the following theorem, an extended Q-learning for SG-DPs is introduced and its convergence to the optimal action-value is established. A proof is in (Shibata et al., 2006).

Theorem 1. Assume that

ρ(M(G(Pμ)))1for any policyμ. A sequence of V×U×[0,1],(A1u1k1)(A2u2k2)is supposed to satisfy the following conditions for all (A,u) in V×U.
(Atut)=(Au)kt=  and  (Atut)=(Au)kt2 E14

The sequence of random variablesQ1Q2defined by the following iteration (the extended Q-Learning) converges to the optimal action-value function of G(U,P,C) astwith probability 1.

Qt+1(Atut)=(1kt)Qt(Atut)+kt(C(a)+i=1mmaxvUQt(Biv)) E15

where the sequenceB1Bmis randomly chosen with probabilityP(AtaB1Bmut)

4. Identification in the limit of right-unique simple grammars from positive data

4.1. Learning simple grammars

A most rigid theoretical model for learning concepts would be identification in the limit proposed by Gold (1967). Because episodic states histories are regarded as positive data, one might expect that fruits of grammatical inference on identification in the limit from positive data could be directly applied to reinforcement learning. As simple context-free decision processes are a generalization of finite Markov decision processes, we should refer to results on grammatical inference of simple grammars. It is known that however simple languages are not identifiable in the limit from positive data, because every regular language with an endmarker is a simple language and the class of whole regular languages is not identifiable in the limit from positive data (Gold 1967).

On the other hand, Yokomori (2003, 2007) has shown that very simple grammars, which form a small subset of simple grammars, are polynomial-time identifiable in the limit from positive data. A very simple grammar is a simple grammar such that each terminal symbol has exactly one production rule in which it occurs. Therefore, the grammar has exactly the same number of production rules as terminal symbols. Decision processes constructed on very simple grammars are, however, too restricted and they are no longer able to cover finite Markov decision processes.

Thus we need another richer subclass of simple grammars that should give an extension of finite Markov decision processes and at the same time it should be efficiently identifiable in the limit from positive data. Here we introduce a new class of grammars, called right-unique simple grammars, which is located between simple grammars and very simple grammars. This class still defines a small proper subclass of simple languages, but actually it satisfies the two desired properties mentioned above.

In this section, we show the efficient learnability of right-unique simple grammars.

4.2. Identification in the limit from positive data

First we let the reader recall the notion of identification in the limit from positive data established by Gold (1967). A positive presentation of a language L* is an infinite sequence of strings where all and only elements of L* appear. Each string appearing in a positive presentation is called a positive example of L*. A learning algorithmAis an algorithm which takes a positive presentation w1,w2,… as input, and outputs some infinite sequence of grammars G1,G2,…, i.e.,Ainfinitely repeats the cycle whereAreceives wi and outputs Gi for i = 1, 2,…. A learning algorithmAconverges to G on a presentation w1,w2,… if for all but finitely many i, Gi = G holds.Aidentifies a classof languages in the limit from positive data if for every positive presentation of every L,Aconverges to a grammar G generating the exact language L*.

Usually learning algorithms are supposed to output a grammar consistent with the given positive examples, i.e., the conjectured grammar generates all the examples. Moreover, they do not change the conjecture unless the current conjecture is inconsistent with the newly given example. Our learning algorithm, which will be presented in Sec. 3.4, also has this standard property.

4.3. Right-unique simple grammars

Our learning target here is right-unique simple languages defined by right-unique simple grammars. A simple grammarG=VRSis called a right-unique simple grammar (RSG) if whenever bothAaαandβaβare rules of the grammar withaandαβV*,α=βholds. We note that G is a very simple grammar if moreover we have A = B in addition toα=β

Let us see an example of an RSG. The grammar consisting of the following rules is an RSG:

SS|SS|US|pT|qTTTfT|gTT|a|b|xyUxyE16

where S,T,U are nonterminal symbols andpqfgabare terminal symbols. The generated language is a set of formulae of first-order logic in Polish notation.

The definition of RSGs allows us to define the function#Gfor each RSG G, called the shape of G, that assigns an integer to each terminal symbol as

#G(a)=|α|-1ifGhasaruleAaαforsomeAE17

This function is homomorphically extended so that#G(xy)=#G(x)+#G(y)for anyxy*(#G(ε)=0for the empty stringε). It is easy to see that wheneverα*Gxβwithx*andαβV*, we have|β|=|α|+#G(x). Moreover we have|α|+#G(x')1for any proper prefix x' of x, becauseα*Gx'+Gxβentails.εParticularly forwL(G), we have#G(w)=-1and#G(w')0for any proper prefix w' of w. In this way, the function#Gstrongly characterizes the derivations and the language of G. In general, we let us call any function#from*toa shape if it holds that#(a)-1and#(x)+#(y)=#(xy)(homomorphism) for allaand

xy*E18

We also say that a shape#is compatible with a language L if#(w)=-1and#(w')0for allwLand any proper prefix w' of w. Consequently,#Gis always compatible with.L(G)

Here we note that any language L admits a finite number of compatible shapes. This is because, if#is compatible with L and,xayLthen

#(a)=#(xay)-#(x)-#(y)-1-0+|y|E19

because x is a proper prefix of xay and#(b)-1for any.bTherefore, a language L admits at mostaacompatible shapes, wherea=min(|ay|xayL)

To simplify our discussion, here we introduce a special form of RSGs, called canonical form. Every RSG can be transformed into canonical form with preserving the language and the shape. The set of nonterminal symbols of an RSG G in canonical form of shape#is exactly

V#{S0}{[ai]a0i#(a)}E20

and every rule has the form

Aa[a0]....[a#(a)]E21

for some.AV#For instance, an RSG G consisting of the rules

SaSASbAAcE22

is converted into G' in canonical form with the rules

S0a[a0][a1]S0b[b0][a1]c[a0]a[a0][a1][a0][b0][b0]cE23

Here the start symbol S of G is divided into S0 and [a,0] in G’ and A is divided into [a,1] and [b,0]. Therefore, it is allowed to consider only RSGs in canonical form. Actually our learning algorithm for RSGs computes its conjecture in canonical form.

4.4. Learning algorithm

By definition, each shape has a finite number of RSGs in canonical form. Together with the fact that any language admits a finite number of compatible shapes, we see that there is a finite number of RSLs consistent with the given positive examples. This property is known as finite thickness. Angluin (1980) has shown that every class of languages with finite thickness is identifiable in the limit from positive data. Thus RSGs are identifiable in the limit from positive data.

Our learning algorithm outputs an RSG that generates a minimal language among all the RSLs containing the given positive examples. This strategy ensures that the algorithm finally converges to a grammar representing the target language. If the current conjecture G does not generate the target language L*, we never have,L*L(G)because of the minimality of the conjecture. Thus there is,wL*-L(G)which will appear in the positive presentation of L*. Then the algorithm eventually abandons the conjecture G and the times changing the conjecture is finite by the finite thickness of the class of RSLs. Finally the conjecture converges to a grammar generating the target language.

To enable us to analyze the efficiency of the learning task, we present a concrete learning algorithm for RSGs. The learning method for RSGs is basically same as Yokomori's (2003, 2007) algorithm for very simple grammars. The first step of our algorithm for learning RSGs is to find shapes compatible with the given positive examples. Then the algorithm computes consistent RSGs in canonical form whose shapes are those found at the first step. At last a minimal (with respect to its language) grammar among those candidate RSGs is picked up. Fig. 3 represents this learning strategy.

Figure 3.

Flow of our learning algorithm.

To enumerate all the compatible shapes for a given set L of positive examples, it is enough to check the compatibility of shapes#satisfying that-1#(a)a-1witha=min{|ay|xayL}. Deciding the compatibility of a shape with a finite language is trivially done in linear time. Therefore, the enumeration of compatible shapes is done inO(||)time simply by the brute-force search where=max{|w|wL}. This upper bound is polynomial if we fix. It would be natural to ask whether a more efficient algorithm that finds a compatible shape in polynomial time in || is possible. Concerning this question, it is known that deciding whether or not a finite language admits a compatible shape is NP-complete if we regard || as a variable.

Once we obtain a compatible shape, one can straightforwardly construct the minimum RSG in canonical form of that shape that is consistent with the given positive examples. For a given set L of examples and a shape#compatible with L, the algorithm picks up the least rules from the set

{Aa[a0]...[a#(a)]AV#a}E24

so that the resultant grammar generates all the elements of L. For instance, when two positive examples aabccc, bc are given, the only compatible shape#is such that#(a) = 1,#(b) = 0,#(c) =-1. To derive those two strings, exactly the rules in (1) are needed. Thus, the output of the learning algorithm is G'. This procedure can be done in almost linear time inLwhereLis the sum of the lengths of all the positive examples.

In general, multiple compatible shapes would be computed. In that case, we will compute grammars as many as the compatible shapes and have to choose one among those as the conjecture. The criterion is to choose a minimal grammar with respect to the language. That is, we have to solve the inclusion problem of RSGs. This procedure would be rather purely an issue of formal language theory and thus we relegate this subroutine to (Yoshinaka 2006), where it is shown that inclusion of two RSGs computed as output candidates is decidable in polynomial time inL. Let us write the upper bound of the running time of this subroutine asp(L)for a polynomial p. In order to pick up a minimal RSG among m output candidates, we execute this subroutinem-1times. Recall that we havem||whereis the length of a longest positive example. Therefore, our algorithm updates its conjecture inO(pL)||steps.

Let us see an example. Suppose that the first positive example is abbc. There are two compatible shapes#1and

#2E25

#1={a0b0c1}#2={a2b1c1}E26

The minimum consistent RSGs G1 and G2 constructed on#1and,#2respectively, have the following production rules:

G1:S0a[a0][a0]b[b0][b0]b[b0][b0]cG2:S0a[a0][a1][a2][a0]b[a1]b[a2]cE27

We have.L(G2)={abbc}L(G1)={abncn1}Therefore our algorithm outputs G2.

5. Probabilistic generality of subclasses of simple grammars

5.1. Probabilistic generalities and unifiablity

Definition 10. The Probabilistic generality of an SG G is defined as
Γ(G)={Pr[|G(P)]|P is a probability assignment on G} E28

An SG H is more general than G if and only ifΓ(G)Γ(H)

The following lemma establishes requirements forΓ(G)Γ(H)

Lemma 1. Let G = <V,Σ,R,S > and H = <V',Σ,R',S'> be reduced SGs.Γ(G)Γ(H)if and only if L(G) = L(H) and there is some mapψ:V'2V2that satisfies the following condition, whereV2={AV||R(A)|2}. For all A inV'2and all x in Σ*,S'H*xAαimplies
SG*xψ(A)βE29

Suppose that C is a subclass of SGs. In the following, we will discuss whether C has a more general grammar than arbitrary two grammars that generate the same language.

Definition 11. Let C and D be subclasses of SGs. C is unifiable within D if and only if, for all G, H in C such that L(G) = L(H), there is I in D such that

Γ(G)Γ(H)Γ(I)

The main result of this section is construction of an SG G* that is more general than a finite number of given RSGs whose languages are equivalent. However, neither the class of SGs nor the class of RSGs is unifiable within itself, as we see in what follows. The proofs for all of them use Lem. 1. In the following, we say that C is unifiable when C is unifiable within C itself.

Theorem 2. The class of SGs is not unifiable.

A proof of Theorem 2 is written in (Shibata et al., 2006). The class of RSGs is also not unifiable. This fact is showed by considering the finite language L= (a|b)(c|d)(e|f) = {ace,acf,ade,adf,bce,bcf,bde,bdf}. On the other hand, the class of VSGs is unifiable. For any VSGs G and H, L(G)=L(H) impliesΓ(G)=Γ(H). Suppose thatSG*xAαandSG*yBβ. A=B if and only ifxAαGxaα'andyBβGyaβ'from the definition of a VSG. If H is a VSG such that L(G)=L(H),SG*xAαxaα'if and only ifS'H*xCγxaγ'. Thus, there is a bijectionψ:VV', andSG*xAαif and only ifS'H*xψ(A)γ. From Lemma 1,Γ(G)=Γ(H)

5.2. A unifiable subclass of simple grammars

In this subsection, we will introduce unifiable simple grammars. The class of USGs is unifiable and is a superclass of the class of RSGs. This implies that the class of RSGs is unifiable within the class of USGs. Because the proof of unifiability for the class of USGs is constructive, the algorithm that unifies a finite number of RSGs whose languages are equal is also given.

Let G = <V,Σ,R,S > be an SG. LetσG(A)={aΣ|AaαR}. The relation between A and B given byσG(A)=σG(B)is an equivalence relation, thus letA¯denote the equivalence class containing A.A¯={A'V|σ(A)=σ(A')}. We also introduce the notationU¯={A'V|AUσ(A)=σ(A')}and

A1Am¯=A1¯Am¯E30

Definition 12. An SG G is a unifiable simple grammar (USG) if and only ifA¯=B¯,AaαRandBaβRimply
α¯=β¯E31

Neighbourhood pairs introduced below play the most important role in constructing the proof of the unifiability of the class of USGs. Before we define neighbourhood pairs, it is necessary that two new notions are introduced.

Definition 13. For a subset U of V,

up(U)={AV|BUA*xB}is called the upstream of U.Definition 14. Let T and U be subsets of V. W(T,U) denotes (V-T | TU )*.

The upstream of U is easy to compute from R. Using the above definitions, we define a neighbourhood pair as follows.

Definition 15. Let <T,U> be a pair of subsets of V. <T,U> is called a neighbourhood pair if and only if the following conditions hold.

TU=φE32

For some A in V,U=A¯

For some B in V,T=up(B¯)

For all x,S*xαimplies

αW(TU)E33

An intuitive meaning of a neighbourhood pair can be seen in the fourth condition of Definition 15. If a nonterminal symbolATappears inαsuch thatS*xα, someBUis adjoining on the right side of A. Fig. 4 is an algorithm to find a neighbourhood pair. The computational cost required to find a neighbourhood pair from G is O(|G||V|).

Figure 4.

Finding a neighbourhood pair

We explain only the abstract of the proof, which is constructing a unified USG from two arbitrary USGs. For the complete proof of it, refer to (Shibata et al., 2006).

First, we find and eliminate all neighbourhood pairs from both USGs. While some neighbourhood pair is found, we eliminate it keeping the generality of the grammar.

Figure 5.

Transformation of USGs

In Fig. 5, Φ(G, <T,U>) denotes the USG obtained by eliminating a neighbourhood pair keeping the generality. Φ(G, <T,U>) is a USG obtained by eliminating useless nonterminal symbols and rules from a USG <V’,Σ,R’,S > where

V'=(VT)(T×U)R'={Aaϕ(α)|AaαRAVT}{(AB)aϕ(αB)|AaαR, (AB)T×U}E34

The above map φ from V* to V’* is defined recursively as the following.

ϕ(ε)=εE35
ϕ(Aβ)={(AB)ϕ(γ)if AT and β=BγAϕ(β)othewizeE36

Φ(G, <T,U>) is more general than G.

Let G/σ denote a USG <V/σ,Σ,R/σ,S>, where

V/σ={A¯|AV}R/σ={A¯aα¯|AaαR}E37
,Definition 16. USGs G and H are σ-isomorphic if and only if G/σ and H/σ are equivalent modulo renaming of nonterminal symbols.

When neither a USG G nor a USG H has neighbourhood pair, L(G) = L(H) implies that G is σ-isomorphic to H. If G and H are σ-isomorphic, it is easy to unify them. In fact, let G=< V,Σ,R,S > and H=< V’,Σ,R’,S’ > be σ-isomorphic. A USG defined asGHis obtained by eliminating useless nonterminal symbols and rules from a USGVV'ΣRR(SS'), where

VV'={(AB)V×V'|σ(A)=σ(B)}RR={(AB)aαβ|AaαR and BaβR}A1AmB1Bm=(A1,B1)(AmBm)E38

Thus we have the following theorem.

Theorem 3. The class of USGs is unifiable.

As a finite language admits a finite number of compatible shapes, any RSL has a finite number of compatible shapes. This entails that any RSL is generated by only a finite number of RSGs in canonical form. In fact, for any RSG G, we can compute all the RSGs in canonical form generating the same language. It is easy to see that if G is an RSG and H is the canonical form of G, i.e., L(G)=L(H),#G=#Hand H is in canonical form, then H is more general than G. Since we have proven Theorem 3 in a constructive manner we obtain the following theorem.

Theorem 4. For every RSG G, we can construct a USG G* which satisfies the following property. For any RSG H such that L(H) = L(G),

Γ(H)Γ(G*).

Fig. 6 shows a unification algorithm of RSGs whose languages are equivalent.

Figure 6.

Unification of USGs

We finally describe only the result of the order of |G*|. The time complexity of the algorithm is mainly dominated by |G*|. |G*| isO(m(G)2amb(G)), where

m(G)=max{|H|| H is an RSG and L(H)=L(G)}E39

and amb(G) is the number of the equivalence classes with respect to σ-isomorphism whose languages equal to L(G), that is,

amb(G)=|{H/σ modulo renaming nonterminals|H is an RSG and L(H)=L(G)}| E40

6. Implementation and experiments

6.1. Implementation of the learning algorithm for RSGs and the extension of QL

In the following, we assume that environments are RSG-DPs. The class of RSG-DPs is sufficiently large and includes the class of episodic finite-state MDPs, as we have described in Section 3. Those are the reasons why we assume that environments are RSG-DPs. For instance, let us consider other classes described in this chapter. The class of VSG-DPs does not include the class of episodic finite-state MDPs. If we assume the class of SG-DPs in stead of RSG-DPs as the class of environments, it is too large to learn, that is, the class of SGs is not learnable from positive data. The class of USGs is learnable theoretically, but efficient learning method is not known yet.

Although sequences of observations can be identified as positive data of grammars and we have an efficient learning algorithm for RSGs, we cannot apply a technique of reinforcement learning with assuming that the environment is constructed on the grammar obtained by the learning algorithm in Section 4. As discussed in Section 5, it is possible that any RSG-DP based on the grammar output by the learning algorithm does not generate the same probabilistic language as the environment, though both define the same (nonprobabilistic) language. In this case, actually the RSG that bases the environment must be one candidate computed in the learning algorithm, though it is not chosen as the output by the nondeterministic choice. Here, we modify the learning algorithm in Section 4 so that it outputs all the RSGs generating the same language as the grammar output by the original algorithm. By applying the unification algorithm in Section 5 to obtained RSGs, we get a USG that is more general than the RSG that bases the environment.

Combining the learning algorithm and the unification algorithm with an extended Q-learning, we obtain the algorithm in Fig. 7. Let us call that algorithm RSG-QL. RSG-QL does not require any grammar which bases the environment to be given. Only the set of actions and the set of observations (= terminal symbols) are given.

Fig. 7 shows update of the RSG-QL for one episode, or sentence. At first, a USG G, which intends to represent the environment, is set to an episodic finite-state MDP. This is just a tentative one, and when a USG is computed from the learning algorithm and the unification algorithm, it is substituted for G. If G cannot derive a prefix x given by the environment, the algorithm abandon updating Q until the episode ends. After the episode ends, it is added to the set of histories, and new USG is computed.

There are two new external functions in Fig. 7. Those are the same things in the ordinary reinforcement learning. environ(x,u) returns an observation (= terminal symbol) when a prefix of the sequence of observations is x and the action that the learning robot takes is u. Suppose that the environment is identified with an RSG-DP H(U,P,C), then environ(x,u) randomly returns a with the probabilityP(Aaαu), whereSH*xAβ. environ(x,u) itself is not controlled by the learning algorithm directly. It is controlled by the selection of actions.

The other function is strategy(Q,G,x). strategy(Q,G,x) is the function that decides which action the learning robot takes. It can be arbitrarily chosen within the condition of Theorem 1. Actually, famous strategies such as eps-greedy and soft-max are often chosen (Sutton & Barto 1998).

For an implementation of the function enumRSGs(positive_data), i.e., the learning algorithm of RSGs from positive data, we introduce the way to save computational cost (Yokomori 2003, 2007). Suppose that input positive data isw1wt, and terminal symbols which occurs in positive data at least once area1an. LetMtbe a matrix whose elementmijis the number ofajinwi. As we have seen in Section 4, each compatible shapes=(#(a1)#(an))TsatisfiesMts=1. This implies that the number of independent variables is dimker(Mt) if other conditions to be compatible are ignored. In addition, any elements ofsis not less than -1. Thus the number of candidates of compatible shapes isO((maxi|wi|)dimker(Mt))

limtdimker(Mt)is constant for any positive data of G, so we denote it as dim(G). If an upper bound of dim(G) is known, the algorithm can wait for enumeration of candidates until dimker(Mt) becomes less than it.

Figure 7.

RSG-QL for one episode

Now let us see how our learning algorithm works through an example. Let G be an RSG whose rules are

Sa[a0][a1][a2]{[a2][c0]}d[a1]g[e2]jSb[b0][b1][c0]e[e0][e1][e2][b1]h[h0]
{[a0][b0][e0]}c[c0][e0]f{[a2][e1][h0]}iE41

Positive data from G is, for instance, {bcefijhi, bcecdijhi, bcdhi, acefijgi, bcdhi, acececefijijijgi, …}. Fig. 8 shows the transition of dimker(Mt), the number of appeared terminal symbols|Σt|and the number of compatible shapes for t sentences from some positive data.

We assumed thatdimker(G)5. When dimker(Mt) is less than 5, candidates of possible shapes are enumerated.

Figure 8.

dimker( M t ) and the number of candidates for possible shapes

After 100 sentences were input, the algorithm output the grammar H whose rules are

Sa[a0][a1]{[a2][c0]}d[a1]g[g0][e2]jSb[b0][b1][c0]e[e0][e1][e2][b1]h[h0]
{[a0][b0][e0]}c[c0][e0]f{[e1][g0][h0]}iE42

L(G)=L(H) although G and H have different shapes. While there are 15 candidates for possible shape, recall that all of them do not give the same minimal language. The ambiguity of G is usually less than it. In this case, the ambiguity of G is 4.

6.2. An experiment for RSG-QL

Finally, as an experiment for RSG-QL, we consider the problem of maximizing total reword in a maze (Fig. 9).

A robot starts from the position st=(1,2) on the map of Fig. 9 and moves towards the goal gl=(9,2). The robot observes the position where it is. The robot can take four kinds of actions, left, right, up or down. The robot is given a reward -1 per single step.

Figure 9.

The map of the maze.

The difference from ordinary maze problems is the way to give reward after the goal. The robot is allowed to occupy a location either f+=(5,6) or f-=(5,2) at most once. If the robot arrives at the goal, the robot is given two different kinds of reward that depend on its path. If the robot has passed through f+, it observes h+ with probability 0.9 and h- with 0.1. On the other hand, if the robot has passed through f-, it observes h+ with 0.1 and h- with 0.9. The reward corresponding to observation of h+ is 100, and of h- is 50. So, the best action of the robot is to pass through f+.

Formally, the RSG G=<V,Σ,R,S> representing the environment is written as follows.

Σ=MAP{h+h}E43
V={[a0]|aMAP, a=gl}{[f+0][f0]S}E44
R={[a0]b[b0]|ab,  b=glf+f-}   {[a0]f+[f+0][f+1]|af+}   {[a0]f-[f-0][f-1]|af-}   {[a0]gl|agl}   {[f+1]h+  [f+1]h-  [f-1]h+  [f-1]h-  Sst[st0]}E45

where MAP = {(i,j)|(i,j) is a reachable position on the map in Fig. 9}, andabif and only if the robot can move from a to b in one step.abcdenotesabandbc. For instance,(46)f+(66),(66)f+and

glanywhereE46

There is another H=<V’,Σ,R’,S> such that L(G) = L(H), where V’ and R’ are as follows.

V'={[a0]|aMAP}{S}E47
R'={[a0]b[b0]|ab}   {[gl0]h+  [gl0]h-   Sst[st0]}E48

Fig.10 shows the shapes of G and H for gl, f+ and f-. the RSG-DP based on G cannot identified with any finite-state MDP, while the one on H is identified with an episodic finite-state MDP.

Figure 10.

Left) The shape of G (Right) The shape of H

G and H are all the RSGs in canonical form whose languages are equivalent to L(G). Thus both G and H, and only G and H are enumerated by the learning algorithm for RSGs from positive data.

The USG G*=<V*,Σ,R*,S> which is the result of unifyRSG(G,H), i.e. the algorithm in Fig. 6, is as follows.

V*={[a0]|aWEST}{[a0]+[a0]|aEAST}{[f+0]+[f0]+[f+0][f0]S}E49
R*={[a0]b[b0]|ab,  bWEST}     {[a0]+b[b0]+  [a0]-b[b0]-|ab, bEAST}     {[a0]+gl[f+1]  [a0]-gl[f-1]|agl}     {(46)f+[f+0]+  (42)f-[f-0]-}     {[f+1]h+  [f+1]h-  [f-1]h+  [f-1]h-  Sst[st0]}E50

where WEST denotes the left half of MAP, that is,{(ij)MAP|i4}, and EAST denotes the right half of MAP, that is,{(ij)MAP|i6}

Figure 11.

Left) Total reward and episode length of RSG-QL, (Right) Comparison of naive QL and RSG-QL

Note that G* is not an RSG. Because the SG-DP based on G* is not an RSG-DP, it is not identified with any episodic finite-state MDP.

The optimal length of episode of this problem is 16 when the robot passes f+, and thus the maximum total reward is 79. The left side of Fig. 11 shows a result of the experiment of the SG-QL algorithm in Fig. 7 on the above maze problem. The algorithm converges to one USG as the basis of the environment at approximately the 200th episode. The result demonstrates that the robot approaches the optimal path and obtains maximum total reward after the completion of the inference. In the right side of Fig. 11, our method is compared to the naive Q-Learning method, in which the environment is assumed to be an episodic finite MDP (the same thing as H). The total reward obtained by the naive Q-Learning method is approximately 40, indicating that the robot is passing through f- and failing to maximize total reward.

7. Conclusion

In this chapter, we presented two new notions. One is an extension of episodic finite-state MDPs from the point of view of grammatical formalism. We can extend well-known methods of reinforcement learning and apply them to this extension easily. The other is the probabilistic generalities of grammars and unifiability of them. This notion plays an important role to apply the recent results of grammatical inference area. The difficulty with applying them is the need of consideration for the probabilistic generality of grammars. The reason is that, if the languages of two grammars are equivalent, it is not necessary that the generalities of them are also equivalent. We presented the idea of unifiability and a method to unify grammars in some grammar class to overcome the difficulty.

Episodic finite-state MDPs can be extended by using the class of SGs and its subclasses; VSG, RSG, USG and SG itself. Although the class of SG-DPs is enough large to contain all episodic finite-state MDPs, the class of SGs is neither learnable from positive data nor unifiable. The class of VSGs is efficiently learnable in the limit from positive data, but the class of VSG-DPs does not include the class of episodic finite-state MDPs. The class of USG-DPs contains all episodic finite-state MDPs and the class of USGs is unifiable, but no efficient learning algorithm for it is known yet. In the four subclasses of simple grammars, the class of RSGs is the only class that satisfies efficient learnability and unifiablity and contains all episodic finite-state MDPs at the same time. The class of RSGs is not unifiable within itself, but it is unifiable within the class of USGs.

Finally, we presented the method RSG-QL for RSG-DPs, combining the extended Q-learning with the learning algorithm and the unification algorithm. Using a maximize-reward problem in a simple maze, we demonstrated that RSG-QL learns the best answer, but the naive QL does not, when the environment is regarded as an RSG-DP. The advantage of RSG-QL is that it is applicable for the wider class of environment with requiring no prior knowledge except that the environment is regarded as an RSG-DP. On the other hand, RSG-QL requires the environment to be precisely identified with some RSG-DP, otherwise the learning algorithm for RSGs from positive data does not work well. In future work, it is required to find algorithms that are stronger for errors. That might be established by using statistical learning methods for grammatical inferences.

© 2008 The Author(s). Licensee IntechOpen. This chapter is distributed under the terms of the Creative Commons Attribution-NonCommercial-ShareAlike-3.0 License, which permits use, distribution and reproduction for non-commercial purposes, provided the original is properly cited and derivative works building on this content are distributed under the same license.

How to cite and reference

Link to this chapter Copy to clipboard

Cite this chapter Copy to clipboard

Takeshi Shibata and Ryo Yoshinaka (January 1st 2008). An Extension of Finite-state Markov Decision Process and an Application of Grammatical Inference, Reinforcement Learning, Cornelius Weber, Mark Elshaw and Norbert Michael Mayer, IntechOpen, DOI: 10.5772/5276. Available from:

chapter statistics

3001total chapter downloads

More statistics for editors and authors

Login to your personal dashboard for more detailed statistics on your publications.

Access personal reporting

Related Content

This Book

Next chapter

Interaction Between the Spatio-Temporal Learning Rule (Non Hebbian) and Hebbian in Single Cells: A Cellular Mechanism of Reinforcement Learning

By Minoru Tsukada

Related Book

First chapter

Different Tools on Multi-Objective Optimization of a Hybrid Artificial Neural Network – Genetic Algorithm for Plasma Chemical Reactor Modelling

By Nor Aishah Saidina Amin and I. Istadi

We are IntechOpen, the world's leading publisher of Open Access books. Built by scientists, for scientists. Our readership spans scientists, professors, researchers, librarians, and students, as well as business professionals. We share our knowledge and peer-reveiwed research papers with libraries, scientific and engineering societies, and also work with corporate R&D departments and government entities.

More About Us