Open access peer-reviewed chapter

A Q-Learning-Based Approach for Simple and Multi-Agent Systems

By Ümit Ulusoy, Mehmet Serdar Güzel and Erkan Bostanci

Submitted: February 23rd 2019Reviewed: July 9th 2019Published: April 22nd 2020

DOI: 10.5772/intechopen.88484

Downloaded: 262


This study proposes different machine learning-based solutions to both single and multi-agent systems, took place on a 2-D simulation platform, namely, Robocode. This dynamic and programmable platform allows agents to interact with the environment and each other by employing a variety of battling strategies. Q-Learning is one of the leading and popular machine learning-based solutions to be applied to such a problem. However, especially for continued spaces, the control problem gets deeper. Essentially, one of the main drawbacks of reinforcement learning (RL) is to design an appropriate reward function that the function can be described by only employing few parameters for simple tasks, whereas estimating the goal of the reward function may be a challenging problem. Recent studies prove that neural network-based approaches can handle these challenges and achieve to learn control strategies from 2-D or 1-D data. Besides those problems of RL algorithms for single robots, once the number of robots increases and the systems need to behave as multi-agent systems, the overall design requirements become more complex. Accordingly, the proposed system is validated by considering different battle scenarios. The performance of the Q-Learning-based system and the supervised learning techniques are compared by employing different scenarios for this problem. Results reveal the superiority of the ANN-based approach over other methods.


  • multi-agent systems
  • Q-Learning
  • Robocode
  • auto-encoder
  • neural network
  • battling strategy

1. Introduction

Swarm intelligence is a scientific field that integrates the fields of swarm intelligence and cooperative robotics to establish and coordinate robots to achieve challenging tasks within a reasonable time [1, 2]. Multi-agent systems, on the other hand, are considered to be coordination of autonomous agents so as to complete tasks by exchanging or sharing information over a network. This resembles the swarm intelligence discipline in similar ways [3, 4], as it has been previously noted that multi-agent systems mainly deal with the coordination of multiple interacting agents so as to complete different tasks. The key objective of those systems is to coordinate rather simple agents instead of using a complex agent [5]. The coordination ability of agents gains different skills to these systems that individual agents may not be allowed to achieve, which are, namely, robustness, scalability, and flexibility [5, 6]. The Robocode platform on the other hand is a game developing platform and allows developers to design robot battle tanks to battle against other tanks [7]. The battles are running in real time, and the game is played on a two-dimensional simulation environment by employing single or multiple robots, as shown in Figure 1. These robots can be defined as single robot, or some of them can be marked as team robots. Each of them is possessed with battling behaviors which allows them to decide movement, fire, and targeting in order to keep their energies high and destroy their opponents. The time is measured with ticks, and each robot is allowed only one movement for each tick. At the end of each round (game), the total score for each attendee is calculated by their “fire damage,” “ram damage,” and “survival status.” This lets a team to obtain the highest score even if their robots did not survive.

Figure 1.

An example screenshot from the Robocode environment [7].

The flexibility, scalability, and robustness of the Robocode platform encourage authors to employ machine learning-based approaches for multi-agent system problems. Despite their advantages, the platform also offers some challenges that should be handled in an appropriate manner. The critical issues are detailed as follows:

  • Opponent rounds are not visible and the environment is not fully observable.

  • Sensors used by robots are limited.

  • The number of action is quite high which makes learning harder.

  • The speed of robots slows down during firing and turning behaviors.

  • Once the gun of robot’s temperature is high, firing behavior does not work, which forces users to consider all parameters.

Robocode gathers great deal and attention from a big community including researcher, students, and engineers, in which design concepts and source codes are shared. Tournaments and leagues are arranged via websites. Hence, the rankings of customized robots are continuously updated [7]. Therefore, game strategies are very critical and continuously evolve by utilizing different approaches. Robocode game strategies are characterized as trees of atomic elements agreeing to actions and observations in a battle.

Machine learning has been widely used in single [8, 9] and multi-agent systems [2, 3]. This also encourages researchers to apply machine learning or meta-heuristic-based methods to train and prepare robotic teams for this battling process of Robocode environment. For instance, there exist studies employing genetic algorithm in order to generate various and evolving behaviors using genetic algorithm [8, 10]. Besides, decision tree and neural network-based solutions have been employed to estimate a strategy to obtain a higher rank in the league [11, 12]. Those studies prove that machine learning is an efficient way of designing and implementing strategies for such an environment. Accordingly, this study is inspired from those previous studies and introduces three different machine learning-based approaches to train and prepare robots for the battle. The first approach mainly employs reinforcement learning to train a single and team robot separately so as to allow them to survive in a tank battle. It is proven that despite its discrete structure, Q-Learning can be adapted for such a complex game. In addition, a neural network-based design is also implemented in order to compare the results, which has been previously employed in a similar study [13]. Finally, an auto-encoder-based model is designed to train a number of robots, allowing them to battle to the death in an arena. Similar studies can be seen in [14, 15, 16]. Next section mainly introduces the proposed methods separately. The experiments are defined, and results are evaluated in a detailed manner at the experimental section. Lastly, the study is concluded at the conclusion section.


2. Methodology

This study proposes three different machine learning-based solutions to the multi-agent battling game. The first of them employs reinforcement learning approach, aiming extracting the maximum award from the network used in learning procedure. The second approach, on the other hand, relies on a supervised learning algorithm based on an artificial neural network architecture. Finally, an auto-encoder-based model has been designed and implemented to train the robots for the challenge. Each of those solutions will be detailed, respectively.

2.1 Q-Learning robot for Robocode

Reinforcement learning (RL) aims to take suitable action to maximize reward in a specific situation. It is employed by various software and machines to find the best possible behavior in particular situation. Reinforcement learning differs from the supervised learning that the agent agrees what to do to achieve with the given task. Instead of employing a training dataset, the agent learns from its experiences. Q-Learning is the most popular RL algorithm and preferred for this study due to its efficiency and popularity [17]. The agent mainly observes the environment and performs the action by employing the previously defined action. The agent then obtains the action consequence or award from the environment. This state and action pair is kept for future usage since it gives clues about the reward. The algorithm mainly aims to generate a Q-Table which illustrates maximum expected future rewards for action at each state. Q-Learning update rule is designed based on the Bellman equation so as to estimate the optimal Q-Value, and the Q-Learning update rule is given as Eq. (1):


where, at each time “t,” the agent selects an action “at,” observes a reward “rt,” and enters a newer state “st + 1.” Besides, αrefers to learning rate, whereas γillustrates discount factor. Within the given algorithm and approach, a Q-Learning robot is designed according to the rule and environment of Robocode. Accordingly, any robot knows enemy position, bearing angle, and distance to enemy by employing its radar; then at each step, a robot selects an action that maximizes the upcoming reward for the current state. These state and action pairs are stored in table that is updated during the game, and robot collects rewards at each “thick” to update the table (Q-Table) as a result of applied actions. The lookup table contains following states and actions for the proposed Q-Learning robot for the Robocode problem (see Table 1). The flow chart of the Q-Learning robot is given in Figure 2.

Robot locationRun away from the enemy
Enemy locationMove toward the enemy
Bearing angle with the enemyHold the current position
Energy levelSpin clockwise or anticlockwise

Table 1.

List of states and actions for Q-Learning robot of Robocode.

Figure 2.

Q-Learning robot training flow chart.

The pseudocode of the Q-Learning algorithm is given as follows:

Q-Learning algorithm:

RequiresStates s:1 to n;

Actions a:1 to m;

Reward Function;

α(learning rate), γ:(discount factor)

EnsuresUpdated Q-Table for action state coordination


Initialize state-actions tableQ(s,a)

Current state “s” should be selected

While(A final state or threshold value is obtained)

Basing on the action selection policies select and action a

Obtain reward rfor selected action alongside with the next state.

Update Q value for current state sand for following state according to (1) and parameters



2.2 Artificial neural network robot for Robocode to approximate Q-Values

Artificial neural network is considered as universal approximators that can be adapted in many different and challenging problems. Several studies have already been applied to Robocode environment; some of those references can be seen in Section 1. Accordingly, a multilayer perceptron inspired from those studies has been adapted and designed for this study. It mainly aims to search the best output for each Robocode thick based on actions and states that mainly allows us to approximate maximum Q-Values.

The proposed neural network has a very simple structure which consists of two layers. The first layer represents inputs, namely, X position, Y position, the distance between the robot and the opponent, the bearing angle, action, and bias values. The second layer on the other hand is a fully connected layer. The final layer represents the Q-Values, as seen in Figure 3. Sigmoid function is employed as the activation function, and also, node numbers at the hidden layer are estimated by trial and error method that results in higher learning accuracy.

Figure 3.

ANN architecture to approximate Q-Values.

2.3 Deep auto-encoders applied Robocode to approximate Q-Values

Stacked sparse auto-encoder is a type of deep neural network involving stacking sparse auto-encoders, and a classifier is regularly used as the final layer for mainly classification or regression problems [18]. This model has not been applied in such a problem which encourages authors to employ the technique into the current problem. Consequently, an example model is designed and given for this problem shown in Figure 4.

Figure 4.

A stacked sparse auto-encoder (SSAE) having two hidden layers and softmax classifier.

Accordingly, the first auto-encoders are trained by utilizing an unsupervised training method [18]. Fundamentally, the output of the first sparse auto-encoder is considered as an input to the second one, and the output of second auto-encoders becomes an input to the classifier as shown in the corresponding figure. The auto-encoders and the classifier “SoftMax” are stacked and qualified in a supervised manner by employing the backpropagation algorithm for estimating the optimum Q-Value. Each auto-encoder is trained by employing using the cost function illustrated in Eq. (1). Ervalue is regulated by employing mean square error (MSE) approach:


where Eris the error rate, xis the input and x̂is the restored data, "λ"coefficient is used by L2 “Weight Regularization” and "β"coefficient is used for the “Sparsity Regularization,” mis the number of observations, and tillustrates the training data label number.

The Ωweightsillustrates “Weight Regularization” and is defined as flows:


Here X indicates the number of hidden layers, n signifies observation numbers, and k shows hidden layers [18].

Sparsity Regularization is on the other hand can be defined as follows:


where anticipated value is represented by ρ, ρ̂idenotes the average output activation of each neuron "i,"and “KL” is the function that evaluates the variance between two probabilities distribution over the same data. The details of those equations can be seen in [18].

3. Experimental results

Aforementioned Robocode is a tank-combat emulator developed by IBM alphaWorks [7]. Basically the tank or teams must navigate the environment to avoid being shot by its rivals. Three different machine learning-based approaches are employed to train the single and multi-agent systems to win the battle against their opponents in an autonomous manner. A desktop computer having Intel Core i7-6700 CPU @ 2.60-GHz and 16-GB RAM is employed to conduct experiments. Each method and results are illustrated separately by defining scenarios.

3.1 Scenario 1

This scenario illustrates a single robot battle, in which a Q-Learned customized robot (AUQRobot) fights against the Spin Robot. An example screenshot is illustrated in Figure 5. Within the scenario up to 12,000 round took place to train the robot. Figure 6 illustrates the change of the winning percentage along the rounds. Regarding to the graph, it is very clear that winning percentage is up to 87% with the power of reinforcement learning that provided greedy method. Collaterally, collected cumulative reward change along the rounds gives a same curve. This result is obtained under Robocode maximum data storing constraints. Results are experienced with a Q-Table includes 9216 elements, and it is also noted that increasing the table size will probably increase the overall winning performance.

Figure 5.

A screenshot obtained from Scenario 1 (SpinBot Robot).

Figure 6.

Winning percentage of AUQRobot within rounds.

Within this scenario, Q-Learned customized robot (AUQRobot) fights against the TrackerBot Robot (see Figure 7), which is also a popular robot used in Robocode. For this scenario, the same training configuration is also applied, and 92% winning rate is also obtained. An example screenshot obtained from the Robocode platform is shown in Figure 8. Regarding to the results, Tracker Robot never survived during a 20-round battle. AUQRobot has lost 8% against to the opponent.

Figure 7.

A screenshot obtained from Scenario 1 (Tracker Robot).

Figure 8.

A screenshot illustrating Total score for Scenario 1 (Tracker Robot).

3.2 Scenario 2

This scenario illustrates a single robot battle, in which a customized robot (AUNNRobot) fights against the Spin Robot. The robot, which was implemented according to the artificial neural network architecture described above, was trained against SpinBot within 200 and 50,000 iterations. Table 2 illustrates the configuration of ANN-based system.

Input nameParameter rangeNN input range
Position X0–800px{0.0, 0.1 … 7.9, 8.0}
Position Y0–600px{0.0, 0.1 … 7.9, 8.0}
Distance to enemy0–1000px{0.0, 0.1 … 5.9, 6.0}
Bearing angle between robot and enemy0°–360°{0.0, 0.1 … 5.9, 6.0}

Table 2.

Configuration of ANN-based system.

The neural network performing linear regression and Q-Value, obtained from Q-Learning algorithm, was employed to train the system. Figure 9 illustrates the winning percentage of the ANN-based robot within training procedure. Results reveal that the ANN-based method starts learning rapidly but converge lately when compared with reinforcement-based approach.

Figure 9.

Winning percentage of AUNNRobot within rounds.

Accordingly, it has been considered that a battle between AUQRobot and AUNNRobot, both have already been trained for same robot class, may compare both systems performance appropriately (see Figure 10). In general, none of the participants are able to outperform the opponent clearly, but AUNNRobot has an advance as 54–46% over the AUQRobot based on 50 rounds as can be seen in Figure 11.

Figure 10.

A screenshot obtained from Scenario 2 for AUQRobot vs. AUNNRobot.

Figure 11.

Results for 50 rounds for Scenario 2.

The results of deep auto-encoder-based method have also been trained that the winning percentage of the network with respect to the training data is also illustrated in Figure 12, namely, AUAERobot (see Figure 13) that, however, provides less wining rate compared with AUNNRobot. It should be noted that if raw image data is employed as input instead of giving position values to the network, the deep auto-encoder and also CNN-based architectures may outperform the AUNNRobot architecture, which is, on the other hand, costly in terms of obtaining training data as well as performing training process compared with aforementioned approaches. Despite given challenges, it is planned to apply those architectures to the given problem as future works. Consequently, instead of AUAERobot, AUNNRobot is preferred to compete with AUQRobot.

Figure 12.

Winning percentage of AUAERobot within iterations.

Figure 13.

A screenshot obtained from Robocode for Scenario 2 (AUAERobot).

3.3 Scenario 3

This scenario illustrates a multi-agent robot battle, in which a customized robot (AUQRobot) fights against with SpinBot Team. Since one of the starting points of this study aims to perform multi-agent team battles, first the trained robot forms a team against a single robot class. Afterward, the members of this team are programmed not to strike each other. Finally, the robot class, where the training is achieved, is defined as a robot team of AUQRobot. Figure 14 illustrates a screenshot from the battlefield. Results reveal that customized decentralized AUQRobot teams outperform “SpinBot Team” with an average of 65–35% as shown in Figure 15.

Figure 14.

A screenshot obtained from Scenario 3 (multi-agent battle).

Figure 15.

A screenshot obtained from Robocode to illustrate results for Scenario 3.

3.4 Scenario 4

This scenario illustrates a battle of multi-agent systems consisting of five robot tanks. According to which, the first team is inherited from AUQRobot, whereas the second one is inherited from the AUNNRobot model. An example screenshot obtained from this scenario is shown in Figure 16. Several different competitions (experiments) were conducted between those robot teams, and results reveal that the AUNNRobot team outperforms its opponent from 67% (minimum) to 74% (maximum) winning rate. Figure 17, generated from the Robocode platform during the experimental procedure, includes 50-round battle and illustrates the team performance of AUNNRobot against its opponent.

Figure 16.

A screenshot obtained from Robocode for Scenario 4.

Figure 17.

A screenshot obtained from Robocode to illustrate results for Scenario 4.

4. Conclusion

This paper introduces and compares some of the popular machine learning-based approaches for the single and multi-agent systems by employing a popular 2-D game simulator, namely, Robocode. This platform essentially allows researchers to design customized robot teams so as to join the competition and perform tank battle players and designers all around the world. Despite the challenges of continued space problem with respect to the characteristics of the games, a Q-Learning-based model is introduced for the problem. Besides, an ANN-based model is designed to approximate Q-Values instead of constructing a huge Q-Table, which in essence is not a realistic approach. In addition, previous experiences prove that stacked auto-encoders (SAEs) may offer an alternative supervised learning approach once the labeled data is obtained. However, as position data is employed instead of a raw image, SAEs do not provide any advances such as denoising on images or reducing the input size. Within these results, it should be noted that raw images, illustrating game states, should better be employed by the deep architectures, as input to design a stronger architecture than an ANN architecture. However, within the given input, the ANN model outperforms both machine learning approaches on both single and multi-agent systems. The experimental results and evaluation of those results encourage authors to design a SAE- or CNN-based model using raw images as future works. Those models will only need raw image data to train models, which will probably outperform both RL- and ANN-based models but may need larger amount of training data, and also require excessive training time to form a suitable model. Despite the given explanation, it is not clear to estimate the performance of those algorithms on different multi-agent systems except Robocode without implementing and evaluating their overall performance.

© 2020 The Author(s). Licensee IntechOpen. This chapter is distributed under the terms of the Creative Commons Attribution-NonCommercial 4.0 License, which permits use, distribution and reproduction for non-commercial purposes, provided the original is properly cited.

How to cite and reference

Link to this chapter Copy to clipboard

Cite this chapter Copy to clipboard

Ümit Ulusoy, Mehmet Serdar Güzel and Erkan Bostanci (April 22nd 2020). A Q-Learning-Based Approach for Simple and Multi-Agent Systems, Multi Agent Systems - Strategies and Applications, Ricardo López - Ruiz, IntechOpen, DOI: 10.5772/intechopen.88484. Available from:

chapter statistics

262total chapter downloads

More statistics for editors and authors

Login to your personal dashboard for more detailed statistics on your publications.

Access personal reporting

Related Content

This Book

Next chapter

Multi-Agent Systems, Simulation and Nanotechnology

By Alexandre de Oliveira Zamberlan, Rafael Heitor Bordini, Guilherme Chagas Kurtz and Solange Binotto Fagan

Related Book

First chapter

Some Commonly Used Speech Feature Extraction Algorithms

By Sabur Ajibola Alim and Nahrul Khair Alang Rashid

We are IntechOpen, the world's leading publisher of Open Access books. Built by scientists, for scientists. Our readership spans scientists, professors, researchers, librarians, and students, as well as business professionals. We share our knowledge and peer-reveiwed research papers with libraries, scientific and engineering societies, and also work with corporate R&D departments and government entities.

More About Us