Stable Walking Pattern Generation for a Biped Robot Using Reinforcement Learning

In this research, a stable biped walking pattern is generated by using reinforcement learning. The biped walking pattern for forward direction is chosen as a simple third order polynomial and sinusoidal function is used for sideway direction. To complete the forward walking pattern, four boundary conditions are needed. In order to avoid jerk motion, initial position and velocity and final position and velocity of the joint are selected as boundary conditions. Also desired motion or posture can be achieved by using the initial and final position. The final velocity of the walking pattern is related to the stability but it is hard to choose proper value. So the final velocity of the walking pattern is chosen as a learning parameter. In order to find the proper boundary condition value, a reinforcement learning algorithm is used. For the sideway movement, a sway amount is selected as learning parameter and a reinforcement learning agent finds proper value for sideway movement. To test the algorithm, a three-dimensional simulator that takes into consideration the whole model of the robot and the environment is developed. The algorithm is verified through a simulation.


Introduction
In various research fields involving humanoid robots, this research centers on the mobility. For a humanoid robot, its method of locomotion is critical. Numerous movement methods have been considered, including the use of wheels and caterpillar-type motion, quadruped motion, and hexapod walking methods inspired from the motions of animals and insects. While these methods have respective strengths, it is preferable that humanoid robots to be used in human society be capable of biped motion using two legs, resembling a human, as our environment is geared to biped walking. By employing biped motion, the humanoid robot will be able to navigate stairs easily and walk along common walkways. In addition, it will be more familiar to humans.
The realization of biped walking is, however, relatively difficult because a biped walking robot is a highly complex system that is inherently unstable. The realization of biped walking started with the idea of static walking. The robot known as WABOT-1, which was developed by Waseda University in the 1970s, required 45 seconds per step but was the first biped walking robot that utilized the concept of static walking [23]. Static walking is characterized by slow biped movement, and the effect of linear or angular momentum of the robot is neglected.
Dynamic walking was considered after the general idea known as ZMP (Zero Moment Point) was introduced to a biped walking robot by Vukobratovic [1]. The ZMP is a point on the ground plane at which the total moments due to ground reaction force becomes zero. If the ZMP is located in a support region, the robot will never fall down when the robot is walking. The first robot to which the idea of ZMP was applied successfully was the WL-10 series from Waseda University. This robot can walk with 1.3 seconds between each step. After the appearance of the first dynamic biped walking robot, researchers developed a variety of working algorithms based on the ZMP.
Since the first successful presentation of biped walking at Waseda University in the 1970s and 1980s, there have been numerous trials to realize stable biped walking robustly and efficiently, with a number of notable results [2][3] [4][5] [6]. The humanoid robots developed by these research groups can walk steadily on flat or inclined ground and can even run [7][8] [9]. Many notable algorithms developed for stable walking and the methods can be categorized into four paradigms [10]: (a) the use of passive walking as a starting point for the design of active walkers; (b) the use of the "zero moment point" control; (c) the use of fixed control architecture and application of a parameter search to find the parameter settings that yield successful walking gaits; and (d) the development of feedback laws based upon insights into balance and locomotion.
HUBO, the first humanoid robot in Korea, was developed by Oh et al. at KAIST in 2004 [2] [11] [12][13]. It is a child-sized (125 cm tall) biped walking robot with 41 DOF (Degree Of Freedom). This humanoid robot combines several biped walking methods for stable walking. For the walking strategy, a walking pattern specific to a given environment is initially designed and a ZMP (Zero Moment Point) feedback controller and other subcontrollers are then used to maintain stability for a dynamically changeable environment. Many researchers use only a ZMP feedback controller. While stable walking can be maintained in this manner, however, it is difficult to generate desired motions. Hence, HUBO uses the aforementioned (b) and (c) paradigms to overcome this problem.
But the key challenge with the existing method used by HUBO is the determination of the proper parameters for designing or generating a stable walking pattern. It is difficult to find proper parameters as they are influenced by many factors such as the posture of the robot and the ground conditions. The existing robot HUBO determines these parameters through many experiments along with an analysis of walking data using a real system. This process is, however, difficult and time-consuming. Furthermore, only an expert can tune these parameters because an unconfirmed walking pattern is tested using a real robot, there is an inherent risk of accidents. This is the starting point of the present research.
In order to overcome these problems, a HUBO simulator and an algorithm that automatically determines an appropriate walking pattern were developed. The HUBO simulator describes the dynamics of the entire system using a physics engine and includes interactions between the robot and its environment, such as reaction forces and collision analysis. The function of this simulator is to test walking patterns. Also reinforcement learning is used to find suitable walking pattern parameters automatically in order to ensure stable walking and tested on this simulator. Reinforcement learning is a learning method that mimics the human learning process (i.e., learning from experience). Furthermore, this control method is usable if the information or the model of the given system is unclear. With the exception of reinforcement learning, many other methods such as those utilizing a neural oscillator, neural network or fuzzy logic can be used to solve this problem. However, these methods are complex compared to reinforcement learning and require an expert or reliable data. Thus, reinforcement learning is used for the generation of stable walking patterns in this study.
Earlier research on the subject of biped walking using reinforcement learning focused primarily on stable walking. However, the posture of a robot is as important as stable walking. For example, posture is particularly important when the robot is climbing stairs or walking across stepping stones. In these cases, foot placement by the robot is very important. Each foot should be placed precisely or the robot can collapse. Thus, the main goal of this research is to determine a walking pattern that satisfies both stable walking and the required posture (foot placement) using reinforcement learning. Particularly, the Qleaning algorithm is used as the learning method and CMAC (Cerebellar Model Articulation Controller) serves as the generalization method. The Q-learning algorithm is easy to implement and its convergence is not affected by the learning policy. Hence, it has been used in many applications.

Related work
Former studies concerning the realization of stable biped walking using reinforcement learning are categorized below.
(a) The use of reinforcement learning as a sub-controller to support the main controller (b) The use of reinforcement learning as a main controller or a reference generator In Case (a), reinforcement learning is normally used as a gain tuner of the main controller www.intechopen.com or as a peripheral controller for stable walking. In Case (b), reinforcement learning is used directly to generate a stable walking pattern or as the main controller for stable walking.
Chew and Pratt [14][54] simulated their biped walking robot, Spring Flamingo ( Fig. 2-1), in the planar plane (two-dimensional simulation). This seven-link planar bipedal robot weighed 12 kg, was 0.88m in height and had bird-like legs. A reinforcement leaning system was used as the main controller. The following states were chosen as follows: (a) velocity of the hip in the forward direction (x-coordinate); (b) the x-coordinate of an earlier swing ankle measured with reference to the hip; and (c) the step length. The goal was to enable the robot to walk with constant speed; thus, the learning system received '0' when the robot walked within the boundary speed or was given a negative value as a reward. The position of the swing foot was used as the action. Additionally, a torque controller in each ankle was used to control the ankle joint torque. The same type of torque controller was also used to maintain the velocity of the body. The ankle joint torque was limited to a certain stable value; hence, the robot could walk stably without considering the ZMP. However, because the goal was to realize walking with constant speed, the posture of the robot was not considered. peripheral controllers were used. The central controller used the experience of the peripheral controllers to learn an average control policy. Using several peripheral controllers, it was possible to generate various stable walking patterns. The main controller activated specific peripheral controllers in an approach that was suitable for specific situations. However, the architecture of the controller was complex, and this approach required many learning trials and a lengthy convergence time.  Morimoto,Cheng,Atkeson,and Zeglin [16][60] ( Fig. 2-3) used a simple five-link planar biped robot to test their reinforcement learning algorithm. The foot of each leg had a shape resembling a 'U', and no joints were used in the ankle. Consequently, it moved in the manner of a passive walker. The goal of the learning system was to walk with constant speed and the states were as follows: (a) velocity of the hip in the forward direction; and (b) forward direction distance between the hip and ankle. The reward was simply falling down or remaining upright, and the action was the angle of the knee joint. The hip joint trajectory was fixed but the step period could vary. If the swing leg touched the ground before the current step period, the next step period was decreased. In addition, if the swing leg touched the ground after the current step period, the step period was increased. This work concentrated only on stable walking; the posture of the robot was not considered. Schuitema et al. [17] also used reinforcement learning to simulate their planar robot. Their robot, termed Meta, is a passive dynamic walking robot with two hip active joints. The goal of their learning system was to have the robot walk successfully for more than 16 steps. The state space consisted of six dimensions: the angle and angular velocity of the upper stance leg, the upper swing leg, and the lower swing leg. To avoid conditions in which the system would not learn the same thing twice, symmetry between the left and right leg was implemented by mirroring left and right leg state information when the stance leg changed. There was one action dimension, the torque that was applied to the hip joint, which was given a range between -8 and 8 Nm. If the robot walked forward successfully, the learning system received a reward. Additionally, if the body of the robot moves backward, the learning system was penalized. Various experiments were simulated under various constraints and compared the results of each experiment. Kim et al. [18] used a reinforcement learning system for ZMP compensation. In their research, two-mode Q-learning was used as ZMP compensation against the external distribution in a standing posture. The performance of the Q-learning system was improved using the failure experience of the learning system more effectively along with successful experiences. The roll angle and the roll angular velocity of the ankle were selected as the states. For the action, ankle rolling was given three discrete levels (±0.5°, 0°) during a period of 20 ms. If selecting an action in the opposite direction of the external force, the agent received a reward. If the angle and ZMP constraints were exceeded, the agent was penalized.
www.intechopen.com Other researchers have also proposed learning approaches for stable biped walking [19][20] [21]. However, existing research in this area in which reinforcement learning is utilized concerns only stable walking while the posture of the robot is not considered. These researchers normally utilize the position of the swing leg for stable walking. It is, however, difficult to locate the foot in a desired position using only the swing leg. Moreover, it was necessary to use another controller or additional devices for the motion of the support leg. Therefore, this research focuses on the control of the support leg as opposed to control of the swing leg for stable and desired motion walking.

Sagittal plane
There are several methods of designing a stable walking pattern. But recent researches can be categorized into two groups [22]. The first approach is the 'inverted pendulum model control method ' [46][47]. In this method, a simple inverted pendulum model is used as a biped walking model. Based on this model, a proper ZMP reference is generated and a ZMP feedback controller is designed to follow this reference. As this method uses a simple inverted pendulum model, its control structure is very simple. Furthermore, because it follows the ZMP reference for stable walking, stability is always guaranteed. However, it requires a proper ZMP reference and it is difficult to define the relationship between the ZMP reference and the posture of the biped walking robot clearly and accurately. Therefore, it is difficult to select the proper ZMP reference if the posture of the biped walking robot and its walking stability is important. A pattern generator, which translates the ZMP reference to a walking pattern, is also required.  . This model requires an accurate model of the biped walking robot and its environment. In this method, a stable walking pattern is generated in advance based on the abovementioned accurate model and the biped walking robot follows this walking pattern without a ZMP feedback controller. One advantage of this method is that it allows control of the biped walking robot with a desired posture. Additionally, it does not require a ZMP controller. However, the generated walking pattern is not a generally functional walking pattern. For example, the walking pattern that is generated for flat ground is not suitable for inclined ground.
www.intechopen.com Compared to the 'inverted pendulum model control method', the 'accuracy model method' does not guarantee stability against disturbances; however, it has its own strengths. First, it is possible to control the motion of the biped robot using this method. The 'inverted pendulum model control method' only guarantees stability if the ZMP reference is correct. And it is not possible to control the motion. Second, this method is more intuitive compared to the 'inverted pendulum model control method'. Thus, it is easy to imply physical intuitions using this method. Third, a ZMP controller is not required. Hence, the overall control architecture is simpler with this method compared to the 'inverted pendulum model control method'.
However, an additional problem with the 'accuracy model method' involves difficulty in obtaining an accurate model of the robot and its environment, including such factors as the influence of the posture of the robot, the reaction force from the ground, and so on. Consequently, the generated walking pattern should be tuned by experiments. The generated walking pattern for a specific environment is sensitive to external forces, as this method does not include a ZMP controller. However, when the precise posture of the biped walking robot is required, for example, when moving upstairs or through a doorsill, the 'accuracy model method' is very powerful [9].
In an effort to address the aforementioned issues, the algorithm generating walking patterns based on the 'accuracy model method' was developed using reinforcement learning. To generate a walking pattern, initially, the structure of the walking pattern should be carefully selected. Selection of the type of structure is made based on such factors as polynomial equations and sine curves according to the requirements. The structure of the walking pattern is selected based on the following four considerations [24].
(a) The robot must be easy to operate. There should be minimal input from the operator in terms of the step time, stride, and mode (e.g. forward/backward, left/right) as well as commands such as start and stop. (b) The walking patterns must have a simple form, must be smooth, and must have a continuum property. It is important that the walking patterns be clear and simple. The trajectory of the walking patterns should have a simple analytic form and should be differentiable due to the velocity continuum. After the walking patterns are formulated, the parameters for every step are updated. (c) The calculation must be easy to implement in an actual system. The calculation burden and memory usage should be small and the pattern modification process should be flexible. (d) The number of factors and parameters that are to be tuned must be small. The complexity of the learning process for the walking patterns is increased exponentially as the number of factors and parameters is increased.
In this research, based on these considerations, a third-order polynomial pattern for the support leg was designed as the walking pattern. This pattern starts from the moment one foot touches the ground and ends the moment the other foot touches the ground (Fig. 3-3).
To create or complete the third-order forward walking pattern, as shown in Eq. 3-1, four boundary conditions are needed. These boundary conditions were chosen with a number of factors taken into account. First, to avoid jerking motions and formulate a smooth walking pattern, the walking pattern must be continuous. For this reason, the position and velocity of the hip at the moment of the beginning of the walking pattern for the support leg were chosen as the boundary conditions. Additionally, when the foot of the robot is to be placed in a specific location, for example traversing uneven terrain or walking across stepping stones, the final position of the walking pattern is important. This final position is related to the desired posture or step length, and this value is defined by the user. Hence, the final position of the hip can be an additional boundary condition. Lastly, the final velocity of the walking pattern is utilized as the boundary condition. Using this final velocity, it is possible to modify the walking pattern shape without changing the final position, enabling the stabilization of the walking pattern [24]. From these four boundary conditions, a thirdorder polynomial walking pattern can be generated. However, it is difficult to choose the correct final velocity of the pattern, as exact models include the biped robot, ground and other environmental factors, are unknown. The existing HUBO robot uses a trial-and-error method to determine the proper final velocity parameter, but numerous trials and experiments are required to tune the final velocity. Thus, in order to find a proper value for this parameter, a reinforcement leaning algorithm is used.

Coronal plane
Coronal plane movements are periodic motions; it the overall movement range of these movements is smaller than the sagittal plane motion, a simple sine curve is used. If the movement of the z direction is constant, the coronal plane motion can be described by Eq.
3-2, where Y is the sway amount and w is the step period.
From the simple inverted pendulum model, the ZMP equation can be approximated using Eq. 3-3, where l denotes the length from the ankle joint of the support leg to the mass center of the robot.
From Eq. 3-2 and Eq. 3-3, the ZMP can be expressed using Eq. 3-4 www.intechopen.com The length l and the step period w are given parameters and the acceleration of gravity g is known parameter. The only unknown parameter is the sway amount. The sway amount can be determined by considering the step period, the DSP (Double Support Phase) ratio and the support region. If the amplitude of the ZMP is located within the support region, the robot is stable. It is relatively easy to determine the unknown parameter (the sway amount) compared to the sagittal plane motion. However, it is unclear as to which parameter value is most suitable. The ZMP model is simplified and linearized, and no ZMP controller is used in this research. Thus, an incorrect parameter value may result from the analysis.
Therefore, using the reinforcement learning system, the optimal parameter value for stable walking using only low levels of energy can be determined.

Introduction
Reinforcement learning is based on trial-and-error methodology. It can be hazardous to apply a reinforcement learning system to an actual biped walking system before the learning system is trained sufficiently through many trials, as walking system likely has not been fully analyzed by the learning system. In particular, when such a system is inherently unstable, such as in the case of a biped walking robot, attention to detail is essential. Therefore, it is necessary to train a learning system sufficiently before applying it to a real system. For this reason, simulators are typically used. The HUBO simulator, which was developed for this study, is composed of a learning system that is in charge of all leaning processes, a physics engine that models a biped robot and its environment, and utility functions to validate the simulation results. Fig. 4-1 shows these modules and the relationships between them. As shown in the figure, learning contents or data obtained from the reinforcement learning module are stored through generalization process. In this study, the CMAC algorithm is used as the generalization method; however, other generalization methods can be easily adapted. The dynamics module, which contains a physics engine, informs the reinforcement learning module of the current states of HUBO. It also receives the action (final velocity of the walking pattern and the sway amount) from the reinforcement learning module, generates a walking pattern, and returns a reward. For the visualization of the movement of a biped walking robot, the OpenGL library is used. Because all components of the HUBO simulator are modularized, it is easy to use with new algorithms or components without modification.
The HUBO simulator contains all of the components necessary for simulating and testing biped walking systems and control algorithms. In addition, all modules are open and can be modified and distributed without limitation. The HUBO simulator follows the GPL (GNU General Public License) scheme.

Physics Engine
To obtain viable simulation results, the dynamics model of a simulator is very important. If the dynamics model differs greatly from a real model, the result of the simulator is useless. Therefore, it is important to ensure that the simulation model resembles the actual model to the greatest extent possible. Essentially, the model of a biped walking system should contain a robot model as well as a model of the environment of the robot. Many researchers only consider the robot model itself and neglect a model of the environment, which is in actuality more important in a realistic simulation of a biped walking.
For this reason, a physics engine was used to build realistic dynamic model in this study. A physics engine is a tool or APIs (Application Program Interface) that is used for computer simulation programs. In this research, ODE (Open Dynamics Engine) [38] was used to develop the robot and environmental model in an effort to represent the actual condition of the robot accurately. ODE is a rigid body physics engine initially developed by Russell

Smith. Its source code is open and is governed by the open source community. ODE provides libraries for dynamics analyses, including collision analyses. The performance of ODE has been validated by various research groups [37][39][40]
, and many commercial and engineering programs use ODE as a physics engine.

Learning System
The learning system of the HUBO simulator consists of a learning module and a generalization module. The reinforcement learning module uses the Q-learning algorithm, which uses the Q-value. To store the various Q-values that represent actual experience or trained data, generalization methods are needed. Various generalization methods can be used for this. In the present work, the CMAC (Cerebella Model Articulation Controller) algorithm is employed. This algorithm converges quickly and is readily applicable to real systems.
Setting up states and a reward function is the most important process in the efficient use of reinforcement learning. When setting up states, using physical meanings is optional; however, it is important that the most suitable states for achieving the goal are selected. Additionally, the reward function should describe the goal in order to ensure success. The reward function can represent the goal directly or indirectly. For example, if the goal for a biped walking robot is to walk stably, the learning agent receives the reward directly if the robot walks stably without falling down. Otherwise, it is penalized. In addition, the reward function describes the goal of stable walking indirectly, including such factors as the pitch or roll angle of the torso while walking and the walking speed. However, it is important that the reward should suitably describe the goal. Fig. 4-2 shows the main window of the HUBO simulator. The motion of HUBO calculated using ODE is displayed in the center region of the HUBO simulator using OpenGL. Each step size or foot placement can be modified from the main window. Fig. 4-3 shows the www.intechopen.com learning information window. This window shows information such as the current states and the reward associated with the learning module. In addition, the learning rate and the update rate can be modified from this window. Fig. 4-4 shows the body data window. This window shows the current position and orientation of each body. As lower body data is important for the system, only the data of the lower body is represented. Fig. 4-5 shows the joint angle of the lower body. The data of the force and torque for each ankle joint is shown in the force-torque data window in Fig. 4-6.

Layout
The HUBO simulator was developed using the COCOA ○,R library under a Mac OS X ○,R environment. As COCOA is based on the Object-C language and all structures are modulated, it is easy to translate to other platforms such as Linux ○,R and Windows ○,R .

States, action and reward
The biped walking pattern generation system can be viewed as a discrete system. Before a new walking step begins, the learning module receives the information of the current states and generates the walking pattern. The robot then follows the generated walking pattern. Following this, the walking pattern is finished and the process starts again. Therefore, this system can be viewed as a discrete system in which the time step is the walking pattern period or the step period. In this study, the walking pattern starts at the moment of the SSP (Single Support Phase) and ends at the moment of the next SSP. At the beginning of the SSP, the learning module receives the current states and calculates the action. Simultaneously, an evaluation of the former action is carried out by the learning module.

Sagittal plane
To set up proper states for the sagittal plane motion, a simple inverted model is used. From the linearized inverted pendulum model, the ZMP equation can be formulated, as shown in Eq. 4-1.
From Eq. 4-1, the position and acceleration of the mass center is directly related to the ZMP. As the ZMP is related to the stability of the biped walking system, it is feasible to select the position and acceleration of the mass center as states. In addition, to walk stably with minimal energy consumption, the robot should preserve energy, implying that the robot should utilize its momentum (angular or linear). The momentum reflects current and future states; it is related to the velocity of the mass center. Therefore, the velocity of the mass center was chosen as the state in this study. Selected states and the reasons for their selection are summarized in Table 4-1. All states are normalized to -1.0 ~ 1.0. However, the reinforcement learning agent has no data regarding the maximum values of the states. It receives this data during the training and updates it automatically. First, these maximum values are set to be sufficiently small; in this research, the value is 0.1. The reinforcement learning agent then updates the maximum value at every step if the current values are larger than the maximum values.

State
Reason The position of the mass center with respect to the support foot Relation between the position of the mass center and ZMP and the body posture The velocity of the mass center Angular or linear momentum The acceleration of the mass center Relation between the position of the mass center and ZMP Table 2. States for the sagittal plane motion The learning parameter learnt through reinforcement learning is the final velocity. It is an www.intechopen.com unknown parameter in the initial design of the walking pattern. The boundary conditions of the walking pattern were discussed in Chapter 3. Eq. 4-2 shows these conditions again. From Eq. 6, Conditions 1 and 2 are determined from the former walking pattern and Condition 3 is the given parameter (the desired step size) from the user. However, only the final velocity is unknown, and it is difficult to determine this value without precise analysis. Hence, the action of the reinforcement learning system is this final velocity (Table 4-2).

Action Reason
Final velocity of the walking pattern Only the final velocity is unknown parameter and it is related to the stable walking Table 3. Action for the sagittal plane motion The reward function should be the correct criterion of the current action. It also represents the goal of the reinforcement learning agent. The reinforcement learning agent should learn to determine a viable parameter value for the generation of the walking pattern with the goal of stable walking by the robot. Accordingly, in this research, the reward is 'fall down or remain upright' and 'How good is it?' Many candidates exist for this purpose, but the body rotation angle (Fig. 4-7) was finally chosen based on trial and error. Table 4-3 shows the reward and associated reasons. If the robot falls down, the reinforcement learning agent then gives a high negative value as a reward; in other cases, the robot receives positive values according to the body rotation angle. The pitch angle of the torso represents the feasibility of the posture of the robot.

Reward Reason
Fall down This denotes the stability of the robot(or absence of stability) Pitch angle of the torso It represents how good it is for stable dynamic walking Table 4. Reward for the sagittal plane motion www.intechopen.com  Fig. 4-8 shows the overall structure of the generation of the walking pattern using reinforcement learning. The reinforcement learning system receives the current states, calculates the proper action, and the walking pattern generator generates the walking pattern based on this action. The reinforcement learning system learns the suitability of the action from its result, and this process is repeated until the reinforcement learning system shows reasonable performance.

Coronal plane
The coronal plane motion is assumed to be weakly coupled to the sagittal plane motion. Thus, states for the sagittal plane motion are not considered in the coronal plane motion learning implementation. Regarding the inverted pendulum model, the linear dynamic equation for the inverted pendulum can be described as follows: Eq. 4-3 can be integrated to show the relationship between y and y for all other times before the next support exchange event. When C is greater than zero, the mass approaching the vertical plane that passes through the pivoting point will be able to travel across the plane. When C is less than zero, the mass will not be able to travel across the vertical plane. Instead, it reverts to its original direction of travel at some instant. If . y and y are given, it is possible to predict stability using C . Thus, . y (the velocity of the torso) and y (the position of the torso with respect to the support foot) are used as states.
As a simple sine function is used in the walking pattern of the coronal plane motion and because the step period is given, the only unknown parameter is the amplitude (sway amount) of the sine function. Thus, the amplitude is used as the action. The reinforcement learning system for the coronal plane motion adopts a reward that only issues a punishment value when a failure state is encountered: Here, θ is the roll angle of the torso, which is bounded by l Θ and u Θ and R is a negative constant (punishment). It is important to note that the failure condition based on θ is checked at all times.

Experiment
To test the logic of the reinforcement learning algorithm, several experiments were carried out. It was assumed that the sagittal plane and the coronal plane motion are weakly related; for this reason, the coronal plane motion was considered first. The reinforcement learning agent for the coronal plane motion learns the stable walking pattern parameter. Based on this learning agent, the reinforcement learning agent for the sagittal plane motion is added.
First, step walking was tested, and forward walking tests were performed after these tests.

5.1
Step walking As step walking does not contain forward motion, it is similar to coronal plane motions.

www.intechopen.com
Hence, the reinforcement learning agent for coronal plane motions learns stable parameters first through step walking experiments. Following this, the learning agent for the sagittal plane motion is added. To determine the suitable parameters, the learning agent for the coronal plane motion learns the sway amount. The update rate α and learning rate γ are set to 0.2 and 0.8, respectively, and the e-greedy value is set to 0.2 initially. The e-greedy value converges to 0 as the learning agent finds the successful parameter.
The experimental conditions are shown in Table 5-1. As shown in Fig. 5-1, step walking converged in 17 of the trails. The parameter for step walking (the sway amount) converges to 0.055m; Fig. 5-2 shows the sideways (y) direction movement of the torso. The movement of the torso is very stable with the period motion. The roll angle of the torso is within 0.4 degree, except for the initial movement, and is very stable, as shown in Fig. 5-3. Fig. 5-4 shows the z-direction movement of the foot, and shows that the robot walks stably without falling down. Additionally, it shows that the DSP time is set to 10% of the step period.

Forward walking -15 cm
Based on the learning agent for coronal plane motions that were fully trained through the step walking experiment, a forward walking experiment of 15cm was performed to find a stable forward walking pattern. The experiment conditions are shown in Table 5-2.
As shown in Fig. 5-5, the learning agent learns the proper parameters within 14 trials. The converged final velocity of the walking pattern is 0.3m/sec. The stable walking pattern is identical to that of the forward (x) direction movement of the torso; Fig. 5-6 shows that the walking pattern is very stable. Moreover, the pitch angle of the torso does not exceed 2 degrees (failure condition), as shown in Fig. 5-7. Earlier research used the motion of the swing leg for stable walking while in this research the walking pattern for the support leg is considered. Therefore, it is possible to place the foot in the desired position. Figs 5-8 and 5-9 represent the movement of the foot. As shown in Fig. 5-9, the foot is located in the desired position (0.15m).

Forward walking -20 cm
To test the robustness of the reinforcement learning agent for sagittal plane motions, an additional forward walking test was performed. In this test, the learning agent determined a stable parameter within 21 trials, and the pitch angle of the torso was within 1.3 degrees while walking. Fig. 5-14 shows that the foot is placed in the desired position. Step

Conclusion
The main purpose of this research is to generate the stable walking pattern. There are many methods about stable biped walking but these methods can be categorized into two groups. One is the 'inverted model based control method'. In this method, a simple inverted pendulum model is used as a biped walking model. Based on this model, a proper ZMP reference is generated and a ZMP feedback controller is designed to follow this reference. A second method is known as the 'accuracy model method'. This model requires an accurate model of the biped walking robot and its environment. In this method, a stable walking pattern is generated in advance based on the abovementioned accurate model and the biped walking robot follows this walking pattern without a ZMP feedback controller. One advantage of this method is that it allows control of the biped walking robot with a desired posture. Additionally, it does not require a ZMP controller. Each method has its own strength and weakness but in this research, the 'accuracy model method' is used.
However, a problem with the 'accuracy model method' involves difficulty in obtaining an accurate model of the robot and its environment, including such factors as the influence of the posture of the robot, the reaction force from the ground, and so on. Consequently, the generated walking pattern should be tuned by experiments. The generated walking pattern for a specific environment is sensitive to external forces, as this method does not include a ZMP controller. Also the tuning process takes much time and needs an expert.
In this research, reinforcement learning is used to solve this problem. Reinforcement learning is different from supervised learning, the kind of learning in most current research in machine learning, statistical pattern recognition, and artificial neural network. In point of view treating nonlinear problems which are finding optimal solution under given environment, supervised learning is similar to reinforcement learning. But this process is only possible to finding good solution if and if only there are good examples provided by external supervisor. But reinforcement learning learns this process without the external supervisor so if the system model is unknown or partially known, reinforcement learning will be good choice.
Reinforcement learning is based on trial-and-error methodology. It can be hazardous to apply a reinforcement learning system to an actual biped walking system before the learning system is trained sufficiently through many trials, as walking system likely has not been fully analyzed by the learning system. In particular, when such a system is inherently unstable, such as in the case of a biped walking robot, attention to detail is essential. Therefore, it is necessary to train a learning system sufficiently before applying it to a real system. For this reason, the HUBO simulator is developed. The HUBO simulator contains the physics engine, the learning system, the generalization module and other utility modules. The physics engine contains the environment model not only the robot model. So it is possible to simulate the interaction between the robot and its environment such as the ground. Its structure is modulated and simple, it is easy to update or add its functions or algorithm.
In this research, it is developed the stable biped walking pattern generation algorithm using reinforcement learning based on the HUBO simulator. Unlike former researches, the walking pattern of the support leg is considered. Existing researches use the motion of the swing leg for stable biped walking and extra controllers are needed for controlling the motion of the support leg. But because the algorithm developed in this research is for the walking pattern of the support leg and the motion of the swing is determined by the given specifications, extra controllers are not needed and overall control structure is very simple. Also this algorithm generates the stable biped walking pattern automatically, supervisors or experts are not needed. Also algorithms developed by former researches were limited to the planar planed robot system, but the algorithm developed through this research considers the sagittal and the coronal plane motion. Former researches considered the sagittal plane motion only and the coronal plane motion was neglected. In this research, it is assumed that the sagittal and the coronal plane motion are weakly coupled. So the reinforcement learning systems for the each plane are trained separately and after the sufficient learning process, two learning systems are combined. Through several experiments, it is validated the performance of the stable biped walking pattern generation algorithm. Several experiments are accomplished using the HUBO simulator and proper states and reward function for stable biped walking are founded. The algorithm is converged stably and its performance is superior and convergence time is faster than existing researches. Although this research does not contain the experiment that contains the real system, the logic of the algorithm is tested and verified using the HUBO simulator. And the performance of the algorithm is distinguishable compare to the existing researches. But it is necessary to test or apply the algorithm to the real system and following research will contain this. Also the algorithm is tested to forward walking only but in the following research various motions such as side walking and asymmetric biped walking will be tested.

Contributions
Future work The HUBO simulator is developed which contains the physics engine, the learning system, the generalization module and other utility modules.
It is needed to apply and test the algorithm to the real system.
Proper states and reward function are founded for the reinforcement learning system. The third order polynomial walking pattern for the motion of support leg is generated using reinforcement learning. In contrast to former researches, the sagittal and coronal plane motion is considered.
It is necessary to test various motions such as side walking and asymmetric biped walking.