## 1. Introduction

In this chapter, we introduce the notion of simple context-free decision processes, which are an extension of episodic finite-state Markov decision processes (MDPs). Intuitively, a simple context-free decision process can be thought of as an episodic finite-state MDP with a stack. In fact, many reinforcement learning methods can be applied to the class of simple context-free decision processes with natural modification on their equations.On the other hand, in grammatical inference area, some non-regular subclasses of simple grammars, such as very simple grammars and right-unique simple grammars, have been found to be efficiently identifiable in the limit from positive data. Especially, the class of right-unique simple decision processes, which are simple context-free processes based on right-unique simple grammars, is a superset of the class of episodic finite-state MDPs.

Because episodic states histories are regarded as positive data, one might expect that those positive results in grammatical inference area could be applied to reinforcement learning directly.

However, one should note that grammars generating the same language can generate different probabilistic languages. While it is enough to find a process representing the target language in the scheme of identification in the limit, in reinforcement learning, one has to find a process representing the target probabilistic language.

Therefore, we need to modify the results in grammatical inference area for applying them to reinforcement learning. Actually, a grammar can be more general than another in the sense that it generates all the probabilistic languages generated by the other. Hence, finding a most general grammar gives a solution to this problem. This chapter however shows that both classes of simple grammars and right-unique simple grammars do not admit most general grammars.

Besides, we show that there is an intermediate class between right-unique simple grammars and simple grammars that admits an algorithm computing a most general grammar from any two grammars whose languages coincide.

We present an algorithm that learns the optimal actions under right-unique simple context-free processes, by concatenating the algorithm learning right-unique simple grammars from positive data, the one computing a most general grammar and the modified update equations of some usual reinforcement learning methods.

## 2. Notation and definitions

Before we give the definition of simple context-free MDPs, we write some standard notation and definitions and introduce subclasses of simple grammars and probabilistic grammars.

A context-free grammar (CFG) is a quadruple denoted by * <V, Σ, R, S>*, where

*is a finite set of nonterminal symbols,*V

*is a finite set of terminal symbols, R*Σ

L(G, X) denotes a language derived from X, i.e.,

In order to be easy to read, terminal symbols and nonterminal symbols are denoted by a,b,c,… and A,B,C,… respectively, and finite sequence of terminal symbols and nonterminal symbols are denoted by …,x,y,z and ,β,γ,… respectively.

CFGs G=<V, Σ, R, S> and H=<V’, Σ, R’, S’> are equivalent modulo renaming of nonterminal symbols when there is a bijection φ:V→V’ such that A→X is in R if and only if φ(A)→φ*(X) is in R’, and φ(S)=S’. φ* is a homomorphism (V

Definition 1. Let G=<V, Σ, R, S> be a CFG. G is called a simple grammar (SG) if and only if

G is in Greibach normal form, that is, for each rule of G is written as

The subclasses of SGs which will appear in this chapter are defined below.

Definition 2. Let G=<V, Σ, R, S> be an SG. G is called a right-unique simple grammar (RSG) if and only ifFrom definitions, a VSG is an RSG, and an RSG is an SG.

Let G=<V, Σ, R, S> be an SG. A probability assignment P on G is a map from R to [0,1] such that

r(G,X,x,1),…,r(G,X,x,|x|) are the sequence of rules used in the derivation X

## 3. An extension of finite-state Markov decision processes

### 3.1. A representation of episodic finite-state MDPs with grammatical formalism

In this section, first, we describe the notion of the simple context-free Markov decision processes, by using a simple example of an episodic finite-state MDP. After that, the definition of simple context-free Markov decision process will be given.

Fig.1 is an episodic finite-state MDP, which contains 5 states, {a, b, c, d, e}. ‘S’ indicates the initial state, and double circles indicate end states ({a, e}). The actions the robot can take are ‘L’ and ‘R’. Reward given to the robot is -1 for every step, and if the robot gets in the end state ‘e’, 1 is given. In this case, it is obvious that, if the robot takes ‘R’ action for every state, the best policy is acquired.

A possible history of states is a sequence of states representing the transition of the robot from the initial state to an end state, such as {cba, cde, cbdba, … }. Let us call the set of possible histories the language generated by the finite-sate MDP in Fig. 1. While this language is a regular language, some regular languages cannot be generated by any finite-state MDPs. For example a singleton {aae} cannot be the language of any MDP, because each letter identifies one state. The fact that aae is a possible history implies that the robot may translate from the state a to a. Thus,

Therefore, the possible histories can be also described as a language for some regular grammar G, and its language class is not required to include the class of regular languages.

The class of simple grammars is one of the subclasses of CFGs such that all languages generated by finite-state MDPs are generated by them. For example, a CFG whose rules are {S→cC, C→bB, B→a, B→cC, C→dD, D→e, D→cC} generates the language of the MDP in Fig.1. The derivation of ‘cbcde’ is written as S

A probabilistic CFG is defined by assigning a non-negative real number to every rule, where

Every context-free grammar generates various probabilistic languages by assigning various probabilities to each rule of it. The set of all probabilistic languages generated from a CFG G by assigning a probability to each rule of it is called the probabilistic generality of G. The probabilistic generality of G is a subset of

It is clear that the fact that grammars A and B generate the same language does not imply its probabilistic generalities of them are the same. Suppose that H is a CFG that has rules {S→aA, S→b, A→aA, A→b}. Obviously, L(G) = L(H). But the generality of H is

So generalities of them are different from each other.

### 3.2. Simple context-free Markov decision processes

Simple context-free MDP is formally defined as follows

Definition 4. Let G be an simple grammar. G(U,P,C) =<V,Σ,R,S,U,P,C> is a simple context-free decision process if and only if U, P and C are the following set and functions.

U is a finite set of actions.

P : R×U → [0,1] is a probabilistic assignment. For all (A,u) in V×U,

C : Σ → (-∞,∞) is a reward function.

In the following, when G is in a subclass of SGs, we call the simple context-free decision process G(U,P,C) [the name of the subclass]-DP.

Corresponding to a given SG-DP, a sequence of discrete random variables X(1), Y(1), X(2), Y(2), … is given as follows. X(1) =S, the domain of X(i) is Σ*V* and the domain of Y(i) is U.

Clearly, X and Y are an infinite-state MDP. Every episodic finite-state MDP is equivalent to some SG-DP as we have discussed in the beginning of this section. In fact, an episodic finite-state MDP whose states are

Policies, value-function, optimal value function, etc. are introduced below in analogues to those of MDPs. Let G(U,P,C) = <V, Σ, R, S,U,P,C> be an SG-DP.

Definition 5. A policy of G(U,P,C) is a map

One of the main purposes of reinforcement learning is to determine a policy μ so as to maximise the expectation of the total reward from S.

Definition 6. A value function of G(U,P,C) under a policy μ,

where

When

Definition 7. The optimal value function of G(U,P,C) denoted as

where

There exists some policy

Definition 8.

Definition 9. The optimal action-value function

Definitions 5 - 9 are a natural extension of the usual definitions on reinforcement learning for finite-state MDPs, whose discounting factor equals 1. Most of well known reinforcement learning methods, such as Q-learning, TD(λ) can be applied to SG-DPs corresponding to the above definitions. In the following theorem, an extended Q-learning for SG-DPs is introduced and its convergence to the optimal action-value is established. A proof is in (Shibata et al., 2006).

Theorem 1. Assume that

The sequence of random variables

where the sequence

## 4. Identification in the limit of right-unique simple grammars from positive data

### 4.1. Learning simple grammars

A most rigid theoretical model for learning concepts would be identification in the limit proposed by Gold (1967). Because episodic states histories are regarded as positive data, one might expect that fruits of grammatical inference on identification in the limit from positive data could be directly applied to reinforcement learning. As simple context-free decision processes are a generalization of finite Markov decision processes, we should refer to results on grammatical inference of simple grammars. It is known that however simple languages are not identifiable in the limit from positive data, because every regular language with an endmarker is a simple language and the class of whole regular languages is not identifiable in the limit from positive data (Gold 1967).

On the other hand, Yokomori (2003, 2007) has shown that very simple grammars, which form a small subset of simple grammars, are polynomial-time identifiable in the limit from positive data. A very simple grammar is a simple grammar such that each terminal symbol has exactly one production rule in which it occurs. Therefore, the grammar has exactly the same number of production rules as terminal symbols. Decision processes constructed on very simple grammars are, however, too restricted and they are no longer able to cover finite Markov decision processes.

Thus we need another richer subclass of simple grammars that should give an extension of finite Markov decision processes and at the same time it should be efficiently identifiable in the limit from positive data. Here we introduce a new class of grammars, called right-unique simple grammars, which is located between simple grammars and very simple grammars. This class still defines a small proper subclass of simple languages, but actually it satisfies the two desired properties mentioned above.

In this section, we show the efficient learnability of right-unique simple grammars.

### 4.2. Identification in the limit from positive data

First we let the reader recall the notion of identification in the limit from positive data established by Gold (1967). A positive presentation of a language L_{*} is an infinite sequence of strings where all and only elements of L_{*} appear. Each string appearing in a positive presentation is called a positive example of L_{*}. A learning algorithm_{1},w_{2},… as input, and outputs some infinite sequence of grammars G_{1},G_{2},…, i.e.,_{i} and outputs G_{i} for i = 1, 2,…. A learning algorithm_{1},w_{2},… if for all but finitely many i, G_{i} = G holds._{*}.

Usually learning algorithms are supposed to output a grammar consistent with the given positive examples, i.e., the conjectured grammar generates all the examples. Moreover, they do not change the conjecture unless the current conjecture is inconsistent with the newly given example. Our learning algorithm, which will be presented in Sec. 3.4, also has this standard property.

### 4.3. Right-unique simple grammars

Our learning target here is right-unique simple languages defined by right-unique simple grammars. A simple grammar

Let us see an example of an RSG. The grammar consisting of the following rules is an RSG:

where S,T,U are nonterminal symbols and

The definition of RSGs allows us to define the function

This function is homomorphically extended so that

We also say that a shape

Here we note that any language L admits a finite number of compatible shapes. This is because, if

because x is a proper prefix of xay and

To simplify our discussion, here we introduce a special form of RSGs, called canonical form. Every RSG can be transformed into canonical form with preserving the language and the shape. The set of nonterminal symbols of an RSG G in canonical form of shape

and every rule has the form

for some.

is converted into G' in canonical form with the rules

Here the start symbol S of G is divided into S_{0} and [a,0] in G’ and A is divided into [a,1] and [b,0]. Therefore, it is allowed to consider only RSGs in canonical form. Actually our learning algorithm for RSGs computes its conjecture in canonical form.

### 4.4. Learning algorithm

By definition, each shape has a finite number of RSGs in canonical form. Together with the fact that any language admits a finite number of compatible shapes, we see that there is a finite number of RSLs consistent with the given positive examples. This property is known as finite thickness. Angluin (1980) has shown that every class of languages with finite thickness is identifiable in the limit from positive data. Thus RSGs are identifiable in the limit from positive data.

Our learning algorithm outputs an RSG that generates a minimal language among all the RSLs containing the given positive examples. This strategy ensures that the algorithm finally converges to a grammar representing the target language. If the current conjecture G does not generate the target language L_{*}, we never have,_{*}. Then the algorithm eventually abandons the conjecture G and the times changing the conjecture is finite by the finite thickness of the class of RSLs. Finally the conjecture converges to a grammar generating the target language.

To enable us to analyze the efficiency of the learning task, we present a concrete learning algorithm for RSGs. The learning method for RSGs is basically same as Yokomori's (2003, 2007) algorithm for very simple grammars. The first step of our algorithm for learning RSGs is to find shapes compatible with the given positive examples. Then the algorithm computes consistent RSGs in canonical form whose shapes are those found at the first step. At last a minimal (with respect to its language) grammar among those candidate RSGs is picked up. Fig. 3 represents this learning strategy.

To enumerate all the compatible shapes for a given set L of positive examples, it is enough to check the compatibility of shapes

Once we obtain a compatible shape, one can straightforwardly construct the minimum RSG in canonical form of that shape that is consistent with the given positive examples. For a given set L of examples and a shape

so that the resultant grammar generates all the elements of L. For instance, when two positive examples aabccc, bc are given, the only compatible shape

In general, multiple compatible shapes would be computed. In that case, we will compute grammars as many as the compatible shapes and have to choose one among those as the conjecture. The criterion is to choose a minimal grammar with respect to the language. That is, we have to solve the inclusion problem of RSGs. This procedure would be rather purely an issue of formal language theory and thus we relegate this subroutine to (Yoshinaka 2006), where it is shown that inclusion of two RSGs computed as output candidates is decidable in polynomial time in

Let us see an example. Suppose that the first positive example is abbc. There are two compatible shapes

The minimum consistent RSGs G_{1} and G_{2} constructed on

We have._{2}.

## 5. Probabilistic generality of subclasses of simple grammars

### 5.1. Probabilistic generalities and unifiablity

Definition 10. The Probabilistic generality of an SG G is defined asAn SG H is more general than G if and only if

The following lemma establishes requirements for

Suppose that C is a subclass of SGs. In the following, we will discuss whether C has a more general grammar than arbitrary two grammars that generate the same language.

Definition 11. Let C and D be subclasses of SGs. C is unifiable within D if and only if, for all G, H in C such that L(G) = L(H), there is I in D such that

The main result of this section is construction of an SG G* that is more general than a finite number of given RSGs whose languages are equivalent. However, neither the class of SGs nor the class of RSGs is unifiable within itself, as we see in what follows. The proofs for all of them use Lem. 1. In the following, we say that C is unifiable when C is unifiable within C itself.

Theorem 2. The class of SGs is not unifiable.A proof of Theorem 2 is written in (Shibata et al., 2006). The class of RSGs is also not unifiable. This fact is showed by considering the finite language L= (a|b)(c|d)(e|f) = {ace,acf,ade,adf,bce,bcf,bde,bdf}. On the other hand, the class of VSGs is unifiable. For any VSGs G and H, L(G)=L(H) implies

### 5.2. A unifiable subclass of simple grammars

In this subsection, we will introduce unifiable simple grammars. The class of USGs is unifiable and is a superclass of the class of RSGs. This implies that the class of RSGs is unifiable within the class of USGs. Because the proof of unifiability for the class of USGs is constructive, the algorithm that unifies a finite number of RSGs whose languages are equal is also given.

Let G = <V,Σ,R,S > be an SG. Let

Neighbourhood pairs introduced below play the most important role in constructing the proof of the unifiability of the class of USGs. Before we define neighbourhood pairs, it is necessary that two new notions are introduced.

Definition 13. For a subset U of V,

The upstream of U is easy to compute from R. Using the above definitions, we define a neighbourhood pair as follows.

Definition 15. Let <T,U> be a pair of subsets of V. <T,U> is called a neighbourhood pair if and only if the following conditions hold.

For some A in V,

For some B in V,

For all x,

An intuitive meaning of a neighbourhood pair can be seen in the fourth condition of Definition 15. If a nonterminal symbol

We explain only the abstract of the proof, which is constructing a unified USG from two arbitrary USGs. For the complete proof of it, refer to (Shibata et al., 2006).

First, we find and eliminate all neighbourhood pairs from both USGs. While some neighbourhood pair is found, we eliminate it keeping the generality of the grammar.

In Fig. 5, Φ(G, <T,U>) denotes the USG obtained by eliminating a neighbourhood pair keeping the generality. Φ(G, <T,U>) is a USG obtained by eliminating useless nonterminal symbols and rules from a USG <V’,Σ,R’,S > where

The above map φ from V* to V’* is defined recursively as the following.

Φ(G, <T,U>) is more general than G.

Let G/σ denote a USG <V/σ,Σ,R/σ,S>, where

When neither a USG G nor a USG H has neighbourhood pair, L(G) = L(H) implies that G is σ-isomorphic to H. If G and H are σ-isomorphic, it is easy to unify them. In fact, let G=< V,Σ,R,S > and H=< V’,Σ,R’,S’ > be σ-isomorphic. A USG defined as

Thus we have the following theorem.

Theorem 3. The class of USGs is unifiable.As a finite language admits a finite number of compatible shapes, any RSL has a finite number of compatible shapes. This entails that any RSL is generated by only a finite number of RSGs in canonical form. In fact, for any RSG G, we can compute all the RSGs in canonical form generating the same language. It is easy to see that if G is an RSG and H is the canonical form of G, i.e., L(G)=L(H),

Theorem 4. For every RSG G, we can construct a USG G* which satisfies the following property. For any RSG H such that L(H) = L(G),

Fig. 6 shows a unification algorithm of RSGs whose languages are equivalent.

We finally describe only the result of the order of |G*|. The time complexity of the algorithm is mainly dominated by |G*|. |G*| is

and amb(G) is the number of the equivalence classes with respect to σ-isomorphism whose languages equal to L(G), that is,

## 6. Implementation and experiments

### 6.1. Implementation of the learning algorithm for RSGs and the extension of QL

In the following, we assume that environments are RSG-DPs. The class of RSG-DPs is sufficiently large and includes the class of episodic finite-state MDPs, as we have described in Section 3. Those are the reasons why we assume that environments are RSG-DPs. For instance, let us consider other classes described in this chapter. The class of VSG-DPs does not include the class of episodic finite-state MDPs. If we assume the class of SG-DPs in stead of RSG-DPs as the class of environments, it is too large to learn, that is, the class of SGs is not learnable from positive data. The class of USGs is learnable theoretically, but efficient learning method is not known yet.

Although sequences of observations can be identified as positive data of grammars and we have an efficient learning algorithm for RSGs, we cannot apply a technique of reinforcement learning with assuming that the environment is constructed on the grammar obtained by the learning algorithm in Section 4. As discussed in Section 5, it is possible that any RSG-DP based on the grammar output by the learning algorithm does not generate the same probabilistic language as the environment, though both define the same (nonprobabilistic) language. In this case, actually the RSG that bases the environment must be one candidate computed in the learning algorithm, though it is not chosen as the output by the nondeterministic choice. Here, we modify the learning algorithm in Section 4 so that it outputs all the RSGs generating the same language as the grammar output by the original algorithm. By applying the unification algorithm in Section 5 to obtained RSGs, we get a USG that is more general than the RSG that bases the environment.

Combining the learning algorithm and the unification algorithm with an extended Q-learning, we obtain the algorithm in Fig. 7. Let us call that algorithm RSG-QL. RSG-QL does not require any grammar which bases the environment to be given. Only the set of actions and the set of observations (= terminal symbols) are given.

Fig. 7 shows update of the RSG-QL for one episode, or sentence. At first, a USG G, which intends to represent the environment, is set to an episodic finite-state MDP. This is just a tentative one, and when a USG is computed from the learning algorithm and the unification algorithm, it is substituted for G. If G cannot derive a prefix x given by the environment, the algorithm abandon updating Q until the episode ends. After the episode ends, it is added to the set of histories, and new USG is computed.

There are two new external functions in Fig. 7. Those are the same things in the ordinary reinforcement learning. environ(x,u) returns an observation (= terminal symbol) when a prefix of the sequence of observations is x and the action that the learning robot takes is u. Suppose that the environment is identified with an RSG-DP H(U,P,C), then environ(x,u) randomly returns a with the probability

The other function is strategy(Q,G,x). strategy(Q,G,x) is the function that decides which action the learning robot takes. It can be arbitrarily chosen within the condition of Theorem 1. Actually, famous strategies such as eps-greedy and soft-max are often chosen (Sutton & Barto 1998).

For an implementation of the function enumRSGs(positive_data), i.e., the learning algorithm of RSGs from positive data, we introduce the way to save computational cost (Yokomori 2003, 2007). Suppose that input positive data is

Now let us see how our learning algorithm works through an example. Let G be an RSG whose rules are

Positive data from G is, for instance, {bcefijhi, bcecdijhi, bcdhi, acefijgi, bcdhi, acececefijijijgi, …}. Fig. 8 shows the transition of dimker(

We assumed that

After 100 sentences were input, the algorithm output the grammar H whose rules are

L(G)=L(H) although G and H have different shapes. While there are 15 candidates for possible shape, recall that all of them do not give the same minimal language. The ambiguity of G is usually less than it. In this case, the ambiguity of G is 4.

### 6.2. An experiment for RSG-QL

Finally, as an experiment for RSG-QL, we consider the problem of maximizing total reword in a maze (Fig. 9).

A robot starts from the position st=(1,2) on the map of Fig. 9 and moves towards the goal gl=(9,2). The robot observes the position where it is. The robot can take four kinds of actions, left, right, up or down. The robot is given a reward -1 per single step.

The difference from ordinary maze problems is the way to give reward after the goal. The robot is allowed to occupy a location either f+=(5,6) or f-=(5,2) at most once. If the robot arrives at the goal, the robot is given two different kinds of reward that depend on its path. If the robot has passed through f+, it observes h+ with probability 0.9 and h- with 0.1. On the other hand, if the robot has passed through f-, it observes h+ with 0.1 and h- with 0.9. The reward corresponding to observation of h+ is 100, and of h- is 50. So, the best action of the robot is to pass through f+.

Formally, the RSG G=<V,Σ,R,S> representing the environment is written as follows.

where MAP = {(i,j)|(i,j) is a reachable position on the map in Fig. 9}, and

There is another H=<V’,Σ,R’,S> such that L(G) = L(H), where V’ and R’ are as follows.

Fig.10 shows the shapes of G and H for gl, f+ and f-. the RSG-DP based on G cannot identified with any finite-state MDP, while the one on H is identified with an episodic finite-state MDP.

G and H are all the RSGs in canonical form whose languages are equivalent to L(G). Thus both G and H, and only G and H are enumerated by the learning algorithm for RSGs from positive data.

The USG G*=<V*,Σ,R*,S> which is the result of unifyRSG(G,H), i.e. the algorithm in Fig. 6, is as follows.

where WEST denotes the left half of MAP, that is,

Note that G* is not an RSG. Because the SG-DP based on G* is not an RSG-DP, it is not identified with any episodic finite-state MDP.

The optimal length of episode of this problem is 16 when the robot passes f+, and thus the maximum total reward is 79. The left side of Fig. 11 shows a result of the experiment of the SG-QL algorithm in Fig. 7 on the above maze problem. The algorithm converges to one USG as the basis of the environment at approximately the 200th episode. The result demonstrates that the robot approaches the optimal path and obtains maximum total reward after the completion of the inference. In the right side of Fig. 11, our method is compared to the naive Q-Learning method, in which the environment is assumed to be an episodic finite MDP (the same thing as H). The total reward obtained by the naive Q-Learning method is approximately 40, indicating that the robot is passing through f- and failing to maximize total reward.

## 7. Conclusion

In this chapter, we presented two new notions. One is an extension of episodic finite-state MDPs from the point of view of grammatical formalism. We can extend well-known methods of reinforcement learning and apply them to this extension easily. The other is the probabilistic generalities of grammars and unifiability of them. This notion plays an important role to apply the recent results of grammatical inference area. The difficulty with applying them is the need of consideration for the probabilistic generality of grammars. The reason is that, if the languages of two grammars are equivalent, it is not necessary that the generalities of them are also equivalent. We presented the idea of unifiability and a method to unify grammars in some grammar class to overcome the difficulty.

Episodic finite-state MDPs can be extended by using the class of SGs and its subclasses; VSG, RSG, USG and SG itself. Although the class of SG-DPs is enough large to contain all episodic finite-state MDPs, the class of SGs is neither learnable from positive data nor unifiable. The class of VSGs is efficiently learnable in the limit from positive data, but the class of VSG-DPs does not include the class of episodic finite-state MDPs. The class of USG-DPs contains all episodic finite-state MDPs and the class of USGs is unifiable, but no efficient learning algorithm for it is known yet. In the four subclasses of simple grammars, the class of RSGs is the only class that satisfies efficient learnability and unifiablity and contains all episodic finite-state MDPs at the same time. The class of RSGs is not unifiable within itself, but it is unifiable within the class of USGs.

Finally, we presented the method RSG-QL for RSG-DPs, combining the extended Q-learning with the learning algorithm and the unification algorithm. Using a maximize-reward problem in a simple maze, we demonstrated that RSG-QL learns the best answer, but the naive QL does not, when the environment is regarded as an RSG-DP. The advantage of RSG-QL is that it is applicable for the wider class of environment with requiring no prior knowledge except that the environment is regarded as an RSG-DP. On the other hand, RSG-QL requires the environment to be precisely identified with some RSG-DP, otherwise the learning algorithm for RSGs from positive data does not work well. In future work, it is required to find algorithms that are stronger for errors. That might be established by using statistical learning methods for grammatical inferences.