InTech uses cookies to offer you the best online experience. By continuing to use our site, you agree to our Privacy Policy.

Computer and Information Science » Numerical Analysis and Scientific Computing » "New Achievements in Evolutionary Computation", book edited by Peter Korosec, ISBN 978-953-307-053-7, Published: February 1, 2010 under CC BY-NC-SA 3.0 license. © The Author(s).

Chapter 9

GA-Based Q-Learning to Develop Compact Control Table for Multiple Agents

By Tadahiko Murata and Yusuke Aoki
DOI: 10.5772/8054

Article top

Overview

Outline of QDSEGA
Figure 1. Outline of QDSEGA
Q-table created from a set of individuals
Figure 2. Q-table created from a set of individuals
Transportation task
Figure 3. Transportation task
Chromosome representation for the agents in Figure 3
Figure 4. Chromosome representation for the agents in Figure 3
An example of Q-table with a set of chromosomes
Figure 5. An example of Q-table with a set of chromosomes
Gain attained by Q-learning generated by QDSEGA
Figure 6. Gain attained by Q-learning generated by QDSEGA
The number of actions in Q-table generated by QDSEGA
Figure 7. The number of actions in Q-table generated by QDSEGA
Gain attained by Q-learning generated by QDSEGA applied the neighboring crossover when obtaining 100 reward
Figure 8. Gain attained by Q-learning generated by QDSEGA applied the neighboring crossover when obtaining 100 reward
The number of actions in Q-table generated by QDSEGA applied the neighboring crossover when obtaining 100 reward
Figure 9. The number of actions in Q-table generated by QDSEGA applied the neighboring crossover when obtaining 100 reward
Gain attained by Q-learning generated by QDSEGA applied the neighboring crossover and the deletion algorithm when obtaining 100 reward
Figure 10. Gain attained by Q-learning generated by QDSEGA applied the neighboring crossover and the deletion algorithm when obtaining 100 reward
The number of actions in Q-table generated by QDSEGA applied the neighboring crossover and the deletion algorithm when obtaining 100 reward
Figure 11. The number of actions in Q-table generated by QDSEGA applied the neighboring crossover and the deletion algorithm when obtaining 100 reward
Succession of the states to achieve the goal
Figure 12. Succession of the states to achieve the goal
Achievement of carrying the load to the goal
Figure 13. Achievement of carrying the load to the goal

GA-Based Q-Learning to DevelopCompact Control Table for Multiple Agents

Tadahiko Murata1 and Yusuke Aoki1

1. Introduction

Recently reinforcement learning has received much attention as a learning method (Sutton, 1988, Watkins & Dayan, 1992). It does not need a priori knowledge and has higher capability of reactive and adaptive behaviors. However there are some significant problems in applying it to real problems. Some of them are deep cost of learning and large size of action-state space. The Q-learning (Watkins & Dayan, 1992), known as one of effective reinforcement learning, has difficulty in accomplishing learning tasks when the size of action-state space is large. Therefore the application of the usual Q-learning is restricted to simple tasks with the small action-state space. Due to the large action-state space, it is difficult to apply the Q-learning directly to real problems such as control problem for robots with many redundant degrees of freedom or multiple agents moving cooperatively one another.

In order to cope with such difficulty of large action-state space, various structural and dividing algorithms of the action-state space were proposed (Holland, 1986; Svinin et al., 2001; Yamada et al., 2001). In the dividing algorithm, the state space is divided dynamically, however, the action space is fixed so that it is impossible to apply the algorithm to the task with a large action space. In the classifier system, “don’t care” attribute is introduced in order to create general rules. But, that causes partially observable problems. Furthermore, an ensemble system of general and special rules should be prepared in advance.

Considering these points, Ito & Matsuno (2002) proposed a GA-based Q-learning method called “Q-learning with Dynamic Structuring of Exploration Space Based on Genetic Algorithm (QDSEGA).” In their algorithm, a genetic algorithm is employed to reconstruct an action-state space which is learned by Q-learning. That is, the size of the action-state space is reduced by the genetic algorithm in order to apply Q-learning to the learning process of that space. They applied their algorithm to a control problem of multi-legged robot which has many redundant degrees of freedom and a large action-state space. By applying their restriction method for the action-state space, they successfully obtained the control rules for a multi-legged robot by their QDSEGA. However, the way to apply a genetic algorithm in their approach seems so straightforward. Therefore we have proposed a crossover for QDSEGA (Murata & Yamaguchi, 2005; Murata & Yamaguchi, 2008). Through their computer simulations on a control problem of a multi-legged robot, they could make about 50% reduction of the number of generations to obtain a target state of the problem.

In this chapter, we apply the QDSEGA with the neighboring crossover to control multiple agents. An application of QDSEGA to multiple agent system has been considered (Ito and Gofuku, 2003; Ito et al., 2004) though, they still applied genetic operators straightforward. We apply the neighboring crossover to Multi Agent Simulations (MAS) problem and show its effectiveness to reduce the number of actions in a Q-table. We also propose a deletion algorithm to make more compact Q-table in MAS problem. We employ the application in Ito et al. (2004) where a Q-table is developed for homogeneous multiple agents. Computer simulation results show that the size of Q-table can be reduced by introducing the proposed neighboring crossover and the deletion algorithm.

2. QDSEGA

In this section, we briefly explain the outline of QDSEGA (Ito & Matsuno, 2002; Ito & Gofuku, 2003; Ito et al., 2004; Murata & Yamaguchi, 2005). QDSEGA has two dynamics. One is a learning dynamics based on Q-learning and the other is a structural dynamics based on Genetic Algorithm. Figure 1 shows the outline of QDSEGA. In QDSEGA, each action is represented by an individual of a genetic algorithm. According to actions defined by a set of individuals, an action-state space called Q-table is created. Q-learning is applied to the created Q-table. Then the learned Q-table is evaluated through simulations. A fitness value for each action is assigned according to Q-table. After that, each individual (i.e., each action) is modified through genetic operations such as crossover and mutation. We show some details in these steps in the following subsections, and show our proposed method for crossover and a deletion algorithm in the next section.

2.1. Action encoding

Each individual expresses a selectable action on the learning dynamics. It means that a set of individuals is selected by genetic operations, and a learning dynamics is applied to the subset. After the evaluation of the subset of actions, a new subset is restructured by genetic operations.

media/image1.jpeg

Figure 1.

Outline of QDSEGA

media/image2.jpg

Figure 2.

Q-table created from a set of individuals

2.2. Q-table

An action-state space called Q-table is created from the set of individuals. When several individuals are the same code, only one action is used in the action-state space to avoid the redundancy of actions. Figure 2 shows this avoidance process.

2.3. Learning dynamics

In QDSEGA, the conventional Q-learning (Watkins & Dayan, 1992) is employed as a learning dynamics. The dynamics of Q-learning are written as follows:

Q(s,a)(1α)Q(s,a)+α{r(s,a)+γmaxaQ(s,a)}
(1)
,

where Q(s,a) is a Q-value of the state s and the action a , r is the reward, α is the learning rate, and γ is the discount rate.

2.4. Fitness

The fitness fit(ai) for each action is calculated by the following equation:

fit(ai)=fitQ(ai)+kffitu(ai)
(2)
,

where fitQ(ai) is a fitness value for action ai calculated from Q-table, fitu(ai) is a fitness value for action ai calculated from the frequency of use, and kf is a non-negative constant value to determine the ratio of fitQ(ai) and fitu(ai) . We show the detail explanation of these factors in this subsection.

(a) Fitness of Q-table

The fitness of Q-table fitQ(ai) is calculated from Q-values in the current Q-table. In order to calculate fitQ(ai) for each action ai the following normalization is taken place in advance as for the Q-values in the current Q-table.

First, calculate the maximum and minimum value of each state as follows:

Vmax(s)=maxa(Q(s,a))
(3)
,
Vmin(s)=mina(Q(s,a))
(4)
.

Then Q of the normalized Q-table is given as follows:

If Q(s,a)0 Then Q(s,a)=1pVmax(s)Q(s,a)+p
(5)
Else (Q(s,a)0)Then Q(s,a)=pVmin(s)Q(s,a)+p
(6)

where p is a constant value which means the ratio of reward to penalty. After this normalization process, we fix the action ai and sort Q(s,ai) according to their value from high to low for all states. We define the sorted Q(s,ai) as Qs(s,ai) , and Qs(1,ai) means the maximum value of Q(s,ai) , and Qs(Ns,ai) means the minimum value of Q(s,ai) , where Ns is the size of states. Using the normalized and sorted Q-value Qs(s,ai) , the fitness of action ai is calculated as follows:

fitQ(ai)=j=1Ns(wjk=1jQ's(k,ai)j)
(7)
,

where wj is a weight which decides the ratio of special actions to general actions.

(b) Fitness of Frequency of Use

The fitness of frequency of use fitu(ai) is introduced to save important actions. That fitness is defined as follows:

fitu(ai)=Nu(ai)/j=1NaNu(aj)
(8)
,

where Na is the number of all actions of one generation and Nu(ai) is the number of times which ai was used for in the Q-learning of this generation. Important actions are used frequently. Therefore the actions with high fitness value of fitu(ai) are preserved by this fitness value.

2.5. Genetic algorithm and neighboring crossover

Ito & Matsuno (2002) says “the method of the selection and reproduction is not main subject so the conventional method is used.” They employed a crossover that exchanges randomly selected bits between the parent individuals according to the crossover probability Pc . They mutated each bit according to the mutation probability Pm . They did not replace parent individuals with offspring. Therefore the number of individuals is increased by the genetic operations. As for the elite preserving strategy, they preserve 30% individuals with the highest fitness value.

Since they did not modify genetic operators for QDSEGA, we have proposed a crossover operation for the multi-legged robot control problem (MRC problem) in (Murata & Yamaguchi, 2005; Murata & Yamaguchi, 2008). We developed a neighboring crossover for QDSEGA for MRC problems.

The crossover employed in QDSEGA (Ito & Matsuno, 2002) causes drastic change in the phenotype of a solution since randomly selected bits are changed between two solutions. If the change of solution in phenotype is so drastic, the good part of the solution may be broken. In order to avoid causing such drastic change among solutions, we proposed a crossover between similar parent solutions. We define the similarity by the number of the same genes in the same locus of a chromosome. We introduced a parameter ksim to denote the similarity. Thus, the crossover is applied among individuals that have the same genes more than ksim .

This kind of the restriction for the crossover has been proposed in the research area of distributed genetic algorithms (DGAs). Researches on DGAs can be categorized into two areas: coarse-grained genetic algorithms (Tanese, 1989; Belding, 1995) and fine-grained genetic algorithms (Mandelick & Spiessens, 1989; Muhlenbein et al., 1991; Murata et al., 2000). In the coarse-grained GAs, a population, that is ordinarily a single, is divided into several subpopulations. Each of these subpopulations is individually governed by genetic operations such as crossover and mutation, and subpopulations communicate each other periodically. Algorithms in this type are called the island model because each subpopulation can be regarded as an island. On the other hand, several individuals are locally governed by genetic operations in fine-grained GAs. In a fine-grained GA, each individual exists in a cell, and genetic operations are applied to an individual with individuals in neighboring cells. The DGAs are known to have an advantage to keep the variety of individuals during the execution of an algorithm, and avoid converging prematurely.

While we don’t define any solution space such as cells or islands in our proposed crossover, our restriction in crossover operation may have the same effect of keeping variety in a population and attain the effective search.

3. Transportation task using Q-learning

3.1. Transportation task

We consider a transportation task shown in Ito & Gofuku (2003) and Ito et al. (2004). Figure 3 shows a transportation task used in this chapter. There is a world with 25 cells and a goal cell shown in “G” where five agents exist in Cell 0 and Cell 4. The aim of the transportation task is to convey a load shown in “L1” to the goal cell. In order to carry “L1” to “G”, the

media/image41.jpg

Figure 3.

Transportation task

other load shown in “L2” should be removed from Cell 22. Simultaneously, the door of “G” should be opened before carrying “L1”. To open the door, the switch shown in “SW” should be pushed by an agent in Cell 23.

Each load has a mobile direction. As shown in Figure 3, “L1” can be moved only in the vertical direction, and “L2” only in the horizontal direction. To move a load, more than one agent should push it toward the same movable direction. Therefore, to convey “L1” to “G”, agents should remove “L2” from Cell 22, open the door, and move “L1” to the goal.

In order to control actions of each agent, Ito & Gofuku, (2003); Ito et al., (2004) employed their QDSEGA where the state of the agent is handled as a chromosome of an individual to which genetic operators are applied. Figure 4 shows the chromosome representation of the agent location in Figure 3. Each chromosome consists of genes with the same number of agents. The figure in each gene shows the identification number of cell where the agent locates. Since the agents with odd number locate in Cell 0, all genes for those agents have 0 as its value.

media/image42.jpg

Figure 4.

Chromosome representation for the agents in Figure 3

media/image43.jpg

Figure 5.

An example of Q-table with a set of chromosomes

3.2. Q-learning in our simulation

Q-table of the Q-learning is generated using a set of chromosomes. Figure 5 shows an example of Q-table that shows the relations of agent locations.

In Figure 5, each column in the Q-table shows a chromosome generated by genetic operations. Rows of the table consist of the same chromosomes of the columns. That is, the chromosomes in the rows act as states in the Q-table, and the chromosomes in the column act as actions the agent can take in the fired state. When the ten agents locates in the start position (Cell 0 or Cell 4 as in Fig. 3), the current position is shown as (0, 4, 0, 4, 0, 4, 0, 4, 0, 4) in the table. If the target position (10, 3, 11, 4, 3, 13, 21, 11, 18, 22) is selected as an action from the current position, Agent “1” moves from Cell 0 to Cell 10, Agent “2” moves from Cell 4 to Cell 3, and so on.

With this Q-table, Q-learning is applied. In order to move each agent to the target position, Ito & Gofuku, (2003) and Ito et al., (2004) proposed the following rules.

<R1: The rule to decide a path to a target position>
                If  Then , ,
                Else if  Then , ,
                Else , .

where x(i)xt are the coordinates of the target position, and x(i+1)=x(i)sgn(x(i)xt)Δx are the coordinates of the current position of the agent in time i. Using this rule, the agent moves in horizontal direction first, then it moves vertically to the target position.

<R2: The rule to avoid a collision>
               If obstacle is on the course that is given by R1 Then
                  If the obstacle is load Then Employ R3
                  Else Don’t move
               Else Move using R1

Since the collision between agents is assumed to avoid using traffic rules, Ito & Gofuku, (2003); Ito et al., (2004) considered only the collision between an agent and a load. If the load can not be carried by the agent alone, it should stop until other agents come.

<R3: The rule to move the load>
               If Load is on the course that is given by R1 Then Push the Load to the way that the agent               has to go
               Else Move using R1

If the way that the agent has to go is not the direction to which the load can be moved, the agent should stop beside the load.

<R4: The rule to open the door>
                If Switch is in a cell where the agent stops Then Turn on the switch to open the door
                Else Nothing is done

3.3. A deletion algorithm to create more compact control table

When we observe a Q-table developed by QDSEGA, some actions or chromosomes are not used in moving multiple agents. That is, unnecessary actions are generated through genetic operations. In order to make a compact Q-table, we mark the chromosomes that are not used for a prespecified term in Q-Learning process.

4. Computer simulation

4.1. Parameter specifications

In this section, we show the simulation results to compare the conventional QDSEGA and the QDSEGA with the neighboring crossover shown in Subsection 2.5 and the deletion algorithm in Subsection 3.3. The neighboring crossover can be applied to the parent solutions that have the same genes more than y(i+1)=y(i) . In this paper, we employed y(i)yt . Since y(i+1)=y(i)sgn(y(i)yt)Δy means the crossover between the same chromosomes, we did not use. When x(i+1)=x(i) , the crossover is applied between any parent solutions. The deletion algorithm is applied when the reward for the developed Q-table becomes larger than 100. This means that the deletion algorithm is applied after attaining the goal by multiple agents.

We employed the same parameter specifications as shown in Ito & Gofuku (2003) and Ito et al. (2004) except the learning rate and the discount rate in Equation (1). We found better specifications for those parameters by preliminary simulations:

[Genetic Algorithm]

The number of individuals: 300,

Selection: Roulette selection,

Type of crossover: uniform crossover,

The probability of crossover: 0.2,

Type of mutation: change the value among valid cell number,

The probability of mutation: 0.001,

The number of generations: 100,

Weights in Equation (2): x(i+1)=x(i) ,

Weights in Equation (7): y(i+1)= y(i) , xt,yt .

[Q-learning]

Reward:When “L1” reaches the goal, x(i), y(i) ,

When “L1” moves up or “L2” is removed, ksim ,

When “L1” moves down or “L2” blocks the course of “L1”, ksim=0, 2, 4, 6, 8 ,

When any agent can not move to the target position, ksim=10 ,

Learning rate in Equation (1): ksim=0 ,

Discount rate in Equation (1): kf=200 ,

w1=0.5, wNs=0.5 -greedy action selection: 10% random action,

The number of trials of each learning dynamics: 10,000.

4.2. Simulation results

Figures 6 and 7 show that the average reward for the obtained Q-table and an average number of actions (or situations) in Q-table. The average reward for Q-table is calculated over the last 100 trials among 10,000 trials. The maximum average reward is 130. These figures show that the proposed QDSEGA with wi=0 (i=2,...,Ns1) could obtain the better or similar average reward with comparing to the algorithm without the neighboring crossover. As for the number of actions, the larger r=100 enables the less number of actions as shown in Figure 7. This shows that a compact Q-table can be obtained using the proposed neighboring crossover. Obtaining a compact Q-table enables users to find important actions to control the multiple agents.

In order to obtain a compact Q-table with high average reward, we apply our proposed neighboring crossover after the average reward becomes larger than 100. Since the neighboring crossover is applied to the similar parent solutions, that crossover often produces the offspring that is the same chromosome. This causes the reduction of the size of Q-table. As shown in Figure 7, the number of actions in the Q-table reduced rather than the previous QDSEGA. However, this reduction may prevent improving the performance in the average reward.

media/image70.jpg

Figure 6.

Gain attained by Q-learning generated by QDSEGA

media/image71.jpg

Figure 7.

The number of actions in Q-table generated by QDSEGA

Figures 8 and 9 show that the average reward and the average number of actions in Q-table. From these figures, we can see that the proposed QDSEGA can keep the high average reward with any value of r=20 . Figure 9 shows that the large value of r=20 enables to reduce the number of actions in Q-table.

media/image73.jpg

Figure 8.

Gain attained by Q-learning generated by QDSEGA applied the neighboring crossover when obtaining 100 reward

media/image74.jpg

Figure 9.

The number of actions in Q-table generated by QDSEGA applied the neighboring crossover when obtaining 100 reward

Although the neighboring crossover has an effect to reduce the number of actions, there are some actions that are not used in moving agents. Therefore, we apply the deletion algorithm in Subsection 3.3. Figures 10 and 11 show the results of QDSEGA with neighboring crossover and the deletion algorithm. From these figures, we can see that the deletion algorithm does not degrade the performance in the average reward but have a fine effect to reduce the number of actions. By combining the neighboring crossover and the deletion algorithm, we could obtain more compact control table with high performance than using the previous algorithms.

media/image75.jpg

Figure 10.

Gain attained by Q-learning generated by QDSEGA applied the neighboring crossover and the deletion algorithm when obtaining 100 reward

media/image76.jpg

Figure 11.

The number of actions in Q-table generated by QDSEGA applied the neighboring crossover and the deletion algorithm when obtaining 100 reward

Table 1 shows the average number of actions obtained at the final generation. From this table, we can see that the number of actions is reduced by the neighboring crossover and the deletion algorithm. Especially the deletion algorithm could reduce it without degrading the performance of the developed control table using neighboring crossover.

After obtaining a compact control table, we can examine the states and actions that are used to reach the goal. We can see that in order to achieve the task to bring “L1” to the goal, only two actions are required from the initial states shown in Figure 3. For example, the two actions in Figure 12 are enough to convey “L1” to the goal with ten agents. Figure 13 shows the states or positions of the agents according to the obtained states shown in Figure 12.

Without Deletion AlgorithmPrevious2468
# of actions222.2210.8196.9149.296.5
With Deletion AlgorithmPrevious2468
# of actions46.639.028.030.225.1

Table 1.

Size of the Q-table at the final generation

media/image77.jpg

Figure 12.

Succession of the states to achieve the goal

media/image78.jpeg

Figure 13.

Achievement of carrying the load to the goal

5. Conclusion

In this chapter, we show the effectiveness of the neighboring crossover and the deletion algorithm especially in reducing the size of the Q-table. By reducing the Q-table, it becomes easy to read the Q-table that is required for attaining the objective to reach the goal and minimizes the memory to store the developed control table.

As for other further study, we can bring other objective functions to achieve the goal. In Figures 6, 8, and 10, we compared the average reward as shown in the previous study (Ito and Gofuku, 2003, Ito et al., 2004). From these figures, we could minimize the total moving cost of all the agents to achieve the goal.

Furthermore, Ito and Gofuku (2003) examined the effectiveness of QDSEGA for multi-agent system with heterogeneous ability. We can show the effectiveness of the neighboring crossover in that problem too.

6. Acknowledgements

This work was partially supported by the MEXT, Japan under Collaboration with Local Communities Project for Private Universities starting 2005.

References

1 - T. C. Belding, 1995 The distributed genetic algorithm revisited. Proceedings of 6th International Conference on Genetic Algorithms, 114 121 , 1-55860-370-0 of Pittsburgh, July 1995, Morgan Kaufmann Publishers, Inc., San Francisco, USA
2 - J. H. Holland, 1986 Escaping brittleness: the possibilities of general purpose learning algorithms applied to parallel rulebased system, Machine Learning: An artificial intelligence approach, 2 593 623 , Morgan Kaufmann Publishers Inc., 0-93461-300-1 Francisco, USA
3 - K. Ito, F. Matsuno, 2002 A study of reinforcement learning for the robot with many degrees of freedom-Acquisition of locomotion patterns for multi legged robot-, Proceedings of IEEE International Conference on Robotics and Automation, 392 397 , 0-78037-272-7 D.C., USA, May 2002, Institute of Electrical and Electronics Engineers, Inc., Piscataway, USA
4 - K. Ito, A. Gofuku, 2003 Hybrid autonomous control for heterogeneous multi-agent system, Proceedings of 2003 IEEE/RSJ International Conference on Intelligent Robots and Systems, 2500 2505 , 0-78037-860-1 Vegas, October 2003, Institute of Electrical and Electronics Engineers, Inc., Piscataway, USA
5 - K. Ito, A. Gofuku, M. Takeshita, 2004 Hybrid autonomous control for multi mobile robots, Advanced Robotics, 18 1 83 99 , 0169-1864
6 - B. Mandelick, P. Spiessens, 1989 Fine-grained parallel genetic algorithms, Proceedings of 3rd International Conference on Genetic Algorithms, 428 433 , 1-55860-006-3 Mason University, June 1989, Morgan Kaufmann Publishers, Inc., San Francisco, USA
7 - H. Muhlenbein, M. Schomisch, J. Born, 1991 The parallel genetic algorithm as function optimizer, Proceedings of 4th International Conference on Genetic Algorithms, 271 278 , 1-55860-208-9 Diego, July 1991, Morgan Kaufmann Publishers, Inc., San Francisco, USA
8 - T. Murata, M. Yamaguchi, 2005 Neighboring Crossover to Improve GA-Based Q-Learning Method for Multi-Legged Robot Control, Proceedings of Genetic and Evolutionary Computation Conference 2005, 145 146 , 1-59593-010-8 D.C., June 2005, The Association for Computing Machinery, Inc., New York, USA
9 - T. Murata, M. Yamaguchi, 2008 Multi-Legged Robot Control Using GA-Based Q-Learning Method With Neighboring Crossover, In: Frontiers in Evolutionary Robotics, Iba (Ed.), 341 352 , I-Tech Education and Publishing, 978-3-90261-319-6 Vienna, Austria
10 - T. Murata, H. Ishibuchi, M. Gen, 2000 Cellular genetic local search for multi-objective optimization, Proceedings of Genetic and Evolutionary Computation Conference 2000, 307 314 , 1-55860-708-0 Vegas, July 2000, The Association for Computing Machinery, Inc., New York, USA
11 - R. S. Sutton, 1988 Reinforcement Learning: An Introduction, The MIT Press, 0-26219-398-1 USA
12 - M. Svinin, S. Ushio, K. Yamada, K. Ueda, 2001 An evolutionary approach to decentralized reinforcement learning for walking robots, Proceedings of the 6th International Symposium on Artificial life and Robotics, 176 179 , Tokyo, January 2001
13 - R. Tanese, 1989 Distributed genetic algorithms, Proceedings of 3rd International Conference on Genetic Algorithms, 434 439 , 1-55860-006-3 Mason University, June 1989, Morgan Kaufmann Publishers, Inc., San Francisco
14 - C. J. C. H. Watkins, P. Dayan, 1992 Technical note q-learning, Machine Learning, 8 279 292
15 - K. Yamada, K. Ohkura, M. Svinin, K. Ueda, 2001 Adaptive segmentation of the state space based on bayesian discrimination in reinforcement learning, Proceedings of the 6th Int. Symposium on Artificial life and Robotics, 168 171 , Tokyo, January 2001