Comparison between the on-line proposed algorithm and off-line tuning.
1. Introduction
The Reinforcement Learning (RL) problem has been widely researched an applied in several areas (Sutton & Barto, 1998; Sutton, 1988; Singh & Sutton, 1996; Schapire & Warmuth, 1996; Tesauro, 1995; Si & Wang, 2001; Van Buijtenen et al., 1998). In dynamical environments, a learning agent gets rewards or penalties, according to its performance for learning good actions.
In identification problems, information from the environment is needed in order to propose an approximate system model, thus, RL can be used for taking the on-line information taking. Off-line learning algorithms have reported suitable results in system identification (Ljung, 1997); however these results are bounded on the available data, their quality and quantity. In this way, the development of on-line learning algorithms for system identification is an important contribution.
In this work, it is presented an on-line learning algorithm based on RL using the Temporal Difference (TD) method, for identification purposes. Here, the basic propositions of RL with
TD are used and, as a consequence, the linear TD(λ) algorithm proposed in (Sutton & Barto, 1998) is modified and adapted for systems identification and the reinforcement signal is generically defined according to the temporal difference and the identification error. Thus, the main contribution of this paper is the proposition of a generic on-line identification algorithm based on RL.
The proposed algorithm is applied in the parameters adjustment of a Dynamical Adaptive Fuzzy Model (DAFM) (Cerrada et al., 2002; Cerrada et al., 2005). In this case, the prediction function is a non-linear function of the fuzzy model parameters and a non-linear TD(λ) algorithm is obtained for the on-line adjustment of the DAFM parameters.
In the next section the basic aspects about the RL problem and the DAFM are revised. Third section is devoted to the proposed on-line learning algorithm for identification purposes. The algorithm performance for time-varying non-linear systems identification is showed with an illustrative example in section fourth. Finally, conclusions are presented.
2. Theoretical background
2.1. Reinforcement learning and temporal differences
RL deals with the problem of learning based on trial and error in order to achieve the overall objective (Sutton & Barto, 1998). RL are related to problems where the learning agent does not know what it must do. Thus, the agent must discover an action policy for maximize the
where
On the other hand, TD method permits to solve the prediction problem taking into account the difference (error) between two prediction values at successive instants
where
Let be the prediction of
The real value of
which describe a temporal difference. The reinforcement value
The learning agent using the equation (5) for the parameters adjustment is called
2.2. Dynamical adaptive fuzzy models
Without loss of generality, a fuzzy logic model MISO (Multiple Inputs-Single Output), is a linguistic model defined by the following
where
The DAFM is obtained from the previous rule base (6), by supposing input values defined by fuzzy singleton, gaussian membership functions of the fuzzy sets defined for the fuzzy output variables and the defuzzification method given by center-average method. Then, the inference mechanism provides the following model (Cerrada et al., 2005):
where
where:
or
The parameters
where
In this work, the initial values of parameters are randomly selected on certain interval, the number of rules
Clearly, by taking the functions
3. RL-based on-line identification algorithm
In this work, the fuzzy identification problem is solved by using the weighted identification error as a prediction function in the RL problem, and by suitably defining the reinforcement value according to the identification error. Thus, the minimization of the prediction error (4) drives to the minimization of the identification error.
The
Let
where
where:
By replacing (15) into (5), the following learning algorithm for the parameters adjustment is obtained:
where expression in equation (15) can be viewed as the
From (14), the function
By replacing (18) into (17), the learning algorithm is given.
In the prediction problem, a good estimation of
On the other hand, a suitable adjustment of identification model means that the following condition is accomplished:
The reinforcement
In this way, the identification error into the prediction function
Thus, an accurate adjustment of parameters is expected. Usually,
In this work, the proposed RL-based algorithm is applied to fuzzy identification and the identification model is provided by the DAFM in (7). Then, the prediction function
3.1. Descent-gradient-based analysis
The proposed identification learning algorithm can be studied like a descent-gradient method with respect to the parametric predictive function
In this case, a error measure is defined as:
where
In this work, the learning algorithm (17) is like a learning algorithm (24), based on the descent-gradient method, where
or
Then, the parameters adjustment is made on each iteration in order to attain the expected value of the prediction function
4. Illustrative example
This section shows an illustrative example applied to fuzzy identification of time-varying non-linear systems by using the proposed on-line RL-based identification algorithm and the DAFM described in section 2.2. Comparisons by using off-line gradient-based tuning algorithm are presented in order to highlight the algorithm performance. For off-line adjustment purposes, the input-output training data is obtained from Pseudo-Random Binary Signal (PRBS) input signal. The performance of the fuzzy identification is evaluated according to the identification relative error
The system is described by the following difference equation:
where
In this case, the unknown function
In the following, fuzzy identification performance by using the DAFM with the proposed RL-based tuning algorithm is presented. Equation (17) is used for the parameters adjustment with the prediction function defined in (14) and the reinforcement defined in (21)-(22). Here,
M | RMSE off-line | RMSE On-line |
6 | 0.1110 | 0.0838 |
8 | 0.1285 | 0.1084 |
10 | 0.1327 | 0.1044 |
15 | 0.1069 | 0.0860 |
20 | 0.1398 | 0.1056 |
4.1. Initial condition dependence
In order to show the algorithm sensibility according to the initial conditions of the fuzzy model parameters, the following figures show the tuning algorithm performance. In this case, the system is described by the equation (31):
where:
Figure 3 shows the tuning process by using a model with
The previous tests show the performance and the sensibility of the proposed on-line algorithm is adequate in terms of (a) The initial conditions of the DAFM parameters, (b) Changes on the internal dynamic (the term
These ones are very important aspects to be evaluated in order to consider an on-line identification algorithm. In the example, even though the initial error depends on the initial conditions of the DAFM parameters, a good evolution of the learning algorithm is accomplished. Table 1 also shows the number of rules
5. Conclusion
This work proposes an on-line tuning algorithm based on reinforcement learning for the identification problem. Both the prediction function and the reinforcement signal have been defined by taking into account the identification error, according to the classical recursive identification algorithms. The presence of the reinforcement signal in the proposed tuning algorithm permits to reject the identification error into the prediction function, then, the parameters adjustment not only depends on the gradient direction.
The proposed algorithm has been applied in fuzzy identification, then, the prediction function is a non-linear function of the fuzzy model parameters. In this case, the proposed identification model is a Dynamical Adaptive Fuzzy Model (DAFM) that has reported a good performance in identification problems.
In order to show the algorithm performance, an illustrative example related to time-varying non-linear system identification using a DAFM has been developed. The obtained results have been compared by using the off-line gradient-based learning algorithm. The performance obtained by using the DAFM with the proposed on-line algorithm is adequate in terms of the main aspects to be taken into account in on-line identification: the initial conditions of the model parameters, the changes on the internal dynamic and the changes on the input signal.
Even when similar results could be obtained by using the DAFM with off-line tuning, in this case good quality and quantity of available historical data is needed to reach a suitable validation phase in off-line tuning. This one highlights the use of the on-line learning algorithms and the proposed RL-based on-line tuning algorithm could be an important contribution for the system identification in dynamical environments with perturbations, for example, in process control area.
Acknowledgments
This work has been partially presented in the International Symposium on Neural Networks ISNN 2006.
References
- 1.
Cerrada M. Aguilar J. Colina E. Titli A. 2002 An approach for dynamical adaptive fuzzy modeling, ,156 161 , Hawai-USA, May 2002, Canada - 2.
Cerrada M. Aguilar J. Colina E. Titli A. 2005 Dynamical membership functions: an approach for adaptive fuzzy modelling. ,152 3 (June 2005) (513-533) - 3.
Ljung L. 1997 Prentice Hall, New York - 4.
Schapire R. E. Warmuth M. K. 1996 On the worst-case analysis of temporal differences learning algorithms22 (95-121) - 5.
Si J. Wang-T Y. 2001 On line learning control by association and reinforcement. ,12 2 (264-276) - 6.
Singh S. P. Sutton R. S. 1996 Reinforcement learning with replacing eligibility traces. ,22 (123-158) - 7.
Sutton R. S. Barto A. G. 1998 The MIT Press, Cambridge - 8.
Sutton R. S. 1988 Learning to predict by the methods of temporal differences3 (9-44) - 9.
Tesauro G. 1995 Temporal difference learning and TD-Gammon. ,38 3 (58-68) - 10.
Van Buijtenen W. M. Schram G. Babuska R. Verbruggen H. B. 1995 Adaptive fuzzy control of satellite attitude by reinforcement learning. ,6 2 (185-194) - 11.
Wang L. X. 1994 Prentice Hall, New Jersey