Comparison between the on-line proposed algorithm and off-line tuning.
The Reinforcement Learning (RL) problem has been widely researched an applied in several areas (Sutton & Barto, 1998; Sutton, 1988; Singh & Sutton, 1996; Schapire & Warmuth, 1996; Tesauro, 1995; Si & Wang, 2001; Van Buijtenen et al., 1998). In dynamical environments, a learning agent gets rewards or penalties, according to its performance for learning good actions.
In identification problems, information from the environment is needed in order to propose an approximate system model, thus, RL can be used for taking the on-line information taking. Off-line learning algorithms have reported suitable results in system identification (Ljung, 1997); however these results are bounded on the available data, their quality and quantity. In this way, the development of on-line learning algorithms for system identification is an important contribution.
In this work, it is presented an on-line learning algorithm based on RL using the Temporal Difference (TD) method, for identification purposes. Here, the basic propositions of RL with
TD are used and, as a consequence, the linear TD(λ) algorithm proposed in (Sutton & Barto, 1998) is modified and adapted for systems identification and the reinforcement signal is generically defined according to the temporal difference and the identification error. Thus, the main contribution of this paper is the proposition of a generic on-line identification algorithm based on RL.
The proposed algorithm is applied in the parameters adjustment of a Dynamical Adaptive Fuzzy Model (DAFM) (Cerrada et al., 2002; Cerrada et al., 2005). In this case, the prediction function is a non-linear function of the fuzzy model parameters and a non-linear TD(λ) algorithm is obtained for the on-line adjustment of the DAFM parameters.
In the next section the basic aspects about the RL problem and the DAFM are revised. Third section is devoted to the proposed on-line learning algorithm for identification purposes. The algorithm performance for time-varying non-linear systems identification is showed with an illustrative example in section fourth. Finally, conclusions are presented.
2. Theoretical background
2.1. Reinforcement learning and temporal differences
RL deals with the problem of learning based on trial and error in order to achieve the overall objective (Sutton & Barto, 1998). RL are related to problems where the learning agent does not know what it must do. Thus, the agent must discover an action policy for maximize the
On the other hand, TD method permits to solve the prediction problem taking into account the difference (error) between two prediction values at successive instants
RL problem can be viewed as a prediction problem where the objective is the estimation of the discounted gain defined by equation (1), by using the
Let be the prediction of
The real value of
which describe a temporal difference. The reinforcement value
The learning agent using the equation (5) for the parameters adjustment is called
2.2. Dynamical adaptive fuzzy models
Without loss of generality, a fuzzy logic model MISO (Multiple Inputs-Single Output), is a linguistic model defined by the following
The DAFM is obtained from the previous rule base (6), by supposing input values defined by fuzzy singleton, gaussian membership functions of the fuzzy sets defined for the fuzzy output variables and the defuzzification method given by center-average method. Then, the inference mechanism provides the following model (Cerrada et al., 2005):
In this work, the initial values of parameters are randomly selected on certain interval, the number of rules
Clearly, by taking the functions
3. RL-based on-line identification algorithm
In this work, the fuzzy identification problem is solved by using the weighted identification error as a prediction function in the RL problem, and by suitably defining the reinforcement value according to the identification error. Thus, the minimization of the prediction error (4) drives to the minimization of the identification error.
By replacing (15) into (5), the following learning algorithm for the parameters adjustment is obtained:
where expression in equation (15) can be viewed as the
From (14), the function
By replacing (18) into (17), the learning algorithm is given.
In the prediction problem, a good estimation of
On the other hand, a suitable adjustment of identification model means that the following condition is accomplished:
In this way, the identification error into the prediction function
Thus, an accurate adjustment of parameters is expected. Usually,
In this work, the proposed RL-based algorithm is applied to fuzzy identification and the identification model is provided by the DAFM in (7). Then, the prediction function
3.1. Descent-gradient-based analysis
The proposed identification learning algorithm can be studied like a descent-gradient method with respect to the parametric predictive function
In this case, a error measure is defined as:
In this work, the learning algorithm (17) is like a learning algorithm (24), based on the descent-gradient method, where
Then, the parameters adjustment is made on each iteration in order to attain the expected value of the prediction function
4. Illustrative example
This section shows an illustrative example applied to fuzzy identification of time-varying non-linear systems by using the proposed on-line RL-based identification algorithm and the DAFM described in section 2.2. Comparisons by using off-line gradient-based tuning algorithm are presented in order to highlight the algorithm performance. For off-line adjustment purposes, the input-output training data is obtained from Pseudo-Random Binary Signal (PRBS) input signal. The performance of the fuzzy identification is evaluated according to the identification relative error
The system is described by the following difference equation:
In this case, the unknown function
In the following, fuzzy identification performance by using the DAFM with the proposed RL-based tuning algorithm is presented. Equation (17) is used for the parameters adjustment with the prediction function defined in (14) and the reinforcement defined in (21)-(22). Here,
4.1. Initial condition dependence
In order to show the algorithm sensibility according to the initial conditions of the fuzzy model parameters, the following figures show the tuning algorithm performance. In this case, the system is described by the equation (31):
Figure 3 shows the tuning process by using a model with
The previous tests show the performance and the sensibility of the proposed on-line algorithm is adequate in terms of (a) The initial conditions of the DAFM parameters, (b) Changes on the internal dynamic (the term
These ones are very important aspects to be evaluated in order to consider an on-line identification algorithm. In the example, even though the initial error depends on the initial conditions of the DAFM parameters, a good evolution of the learning algorithm is accomplished. Table 1 also shows the number of rules
This work proposes an on-line tuning algorithm based on reinforcement learning for the identification problem. Both the prediction function and the reinforcement signal have been defined by taking into account the identification error, according to the classical recursive identification algorithms. The presence of the reinforcement signal in the proposed tuning algorithm permits to reject the identification error into the prediction function, then, the parameters adjustment not only depends on the gradient direction.
The proposed algorithm has been applied in fuzzy identification, then, the prediction function is a non-linear function of the fuzzy model parameters. In this case, the proposed identification model is a Dynamical Adaptive Fuzzy Model (DAFM) that has reported a good performance in identification problems.
In order to show the algorithm performance, an illustrative example related to time-varying non-linear system identification using a DAFM has been developed. The obtained results have been compared by using the off-line gradient-based learning algorithm. The performance obtained by using the DAFM with the proposed on-line algorithm is adequate in terms of the main aspects to be taken into account in on-line identification: the initial conditions of the model parameters, the changes on the internal dynamic and the changes on the input signal.
Even when similar results could be obtained by using the DAFM with off-line tuning, in this case good quality and quantity of available historical data is needed to reach a suitable validation phase in off-line tuning. This one highlights the use of the on-line learning algorithms and the proposed RL-based on-line tuning algorithm could be an important contribution for the system identification in dynamical environments with perturbations, for example, in process control area.
This work has been partially presented in the International Symposium on Neural Networks ISNN 2006.
Cerrada M. Aguilar J. Colina E. Titli A. 2002An approach for dynamical adaptive fuzzy modeling, , 156 161, Hawai-USA, May 2002, Canada
Cerrada M. Aguilar J. Colina E. Titli A. 2005Dynamical membership functions: an approach for adaptive fuzzy modelling. , 152 3(June 2005) (513-533)
Ljung L. 1997Prentice Hall, New York
Schapire R. E. Warmuth M. K. 1996On the worst-case analysis of temporal differences learning algorithms 22(95-121)
Si J. Wang-T Y. 2001On line learning control by association and reinforcement. , 12 2(264-276)
Singh S. P. Sutton R. S. 1996Reinforcement learning with replacing eligibility traces. , 22(123-158)
Sutton R. S. Barto A. G. 1998The MIT Press, Cambridge
Sutton R. S. 1988Learning to predict by the methods of temporal differences 3(9-44)
Tesauro G. 1995Temporal difference learning and TD-Gammon. , 38 3(58-68)
Van Buijtenen W. M. Schram G. Babuska R. Verbruggen H. B. 1995Adaptive fuzzy control of satellite attitude by reinforcement learning. , 6 2(185-194)
Wang L. X. 1994Prentice Hall, New Jersey