Artificial neural network models (NN) have been widely adopted on the field of time series forecasting in the last two decades. As a kind of soft-computing method, neural forecasting systems can be built more easily because of their learning algorithms than traditional linear or nonlinear models which need to be constructed by advanced mathematic techniques and long process to find optimized parameters of models. The good ability of function approximation and strong performance of sample learning of NN have been known by using error back propagation learning algorithm (BP) with a feed forward multi-layer NN called multi-layer perceptron (MLP) (Rumelhart et. al, 1986), and after this mile stone of neural computing, there have been more than 5,000 publications on NN for forecasting (Crone & Nikolopoulos, 2007).
To simulate complex phenomenon, chaos models have been researched since the middle of last century (Lorenz, 1963; May, 1976). For NN models, the radial basis function network (RBFN) was employed on chaotic time series prediction in the early time (Casdagli, 1989). To design the structure of hidden-layer of RBFN, a cross-validated subspace method is proposed, and the system was applied to predict noisy chaotic time series (Leung & Wang, 2001). A two-layered feed-forward NN, which has its all hidden units with hyperbolic tangent activation function and the final output unit with linear function, gave a high accuracy of prediction for the Lorenz system, Henon and Logistic map (Oliveira et. al, 2000).
To real data of time series, NN and advanced NN models (Zhang, 2003) are reported to provide more accurate forecasting results comparing with traditional statistical model (i.e. the autoregressive integrated moving average (ARIMA)(Box & Jankins, 1976)), and the performances of different NNs for financial time series are confirmed by Kodogiannis & Lolis (Kodogiannis & Lolis, 2002). Furthermore, using benchmark data, several time series forecasting competitions have been held in the past decades, many kinds of NN methods showed their powerful ability of prediction versus other new techniques, e.g. vector quantization, fuzzy logic, Bayesian methods, Kalman filter or other filtering techniques, support vector machine, etc (Lendasse et. al, 2007; Crone & Nikolopoulos, 2007).
Meanwhile, reinforcement learning (RL), a kind of goal-directed learning, has been generally applied in control theory, autonomous system, and other fields of intelligent computation (Sutton & Barto, 1998). When the environment of an agent belongs to Markov decision process (MDP) or the Partially Observable Markov Decision Processes (POMDP), behaviours of exploring let the agent obtain reward or punishment from the environment, and the policy of action then is modified to adapt to acquire more reward. When prediction error for a time series is considered as reward or punishment from the environment, one can use RL to train predictors constructed by neural networks.
In this chapter, two kinds of neural forecasting systems using RL are introduced in detail: a self-organizing fuzzy neural network (SOFNN) (Kuremoto et al., 2003) and a multi-layer perceptron (MLP) predictor (Kuremoto et al., 2005). The results of experiments using Lorenz chaos showed the efficiency of the method comparing with the results by a conventional learning method (BP).
2. Architecture of neural forecasting system
The flow chart of neural forecasting processing is generally used by which in Fig. 1. The tth step time series data can be embedded into a new n-dimensional space according to Takens Theorem (Takens, 1981). Eq. (1) shows the detail of reconstructed vector space which serves input layer of NN, here is an arbitrary delay. An example of 3-dimensional reconstruction is shown in Fig. 2. The output layer of neural forecasting systems is usually with one neuron whose output equals prediction result.
There are various architectures of NN models, including MLP, RBFN, recurrent neural network (RNN), autoregressive recurrent neural network (ARNN), neuro-fuzzy hybrid network, ARIMA-NN hybrid model, SOFNN, and so on. The training rules of NNs are also very different not only well-known methods, i.e., BP, orthogonal least squares (OLS), fuzzy inference, but also evolutional computation, i.e., genetic algorithm (GA), particle swarm optimization (PSO), genetic programming (GP), RL, and so on.
2.1. MLP with BP
MLP, a feed-forward multi-layer network, is one of the most famous classical neural forecasting systems whose structure is shown in Fig. 3. BP is commonly used as its learning rule, and the system performs fine efficiency in the function approximation and nonlinear prediction.
Gradient parameter β is usually set to 1.0, and to correspond to f (u), the scale of time series data should be adjusted to (0.0, 1.0).
BP is a supervised learning algorithm, using sample data trains NN providing more correct output data by modifying all of connections between layers. Conventionally, the error function is given by the mean square error as Eq. (5).
Here is a discount parameter (0.0< 1.0), is the learning rate (0.0 < 1.0). The training iteration keeps to be executed until the error function converges enough.
2.2. MLP with RL
One important feature of RL is its statistical action policy, which brings out exploration of adaptive solutions. Fig. 4 shows a MLP which output layer is designed by a neuron with Gaussian function. A hidden layer consists of variables of the distribution function is added. The activation function of units in each hidden layer is still sigmoid function (or hyperbolic tangent function) (Eq. (8)-(10)).
And the prediction value is given according to Eq. (11).
Here are gradient constants, ( ) represents the connection of kth hidden neuron with neuron μ,σ in statistical hidden layer and input neurons, respectively. The modification of is calculated by RL algirthm which will be described in section 3.
2.3. SOFNN with RL
A neuro-fuzzy hybrid forecasting system, SOFNN, using RL training algorithm is shown in Fig. 5. A hidden layer consists of fuzzy membership functions is designed to categorize input data of each dimension in , t = 1, 2,..., S (Eq. (12)).
The fuzzy reference , which calculates the fitness for an input set , is executed by fuzzy rules layer (Eq. 13).
Where i = 1, 2,..., n, j means the number of membership function which is 1 initially, are the mean and standard deviation of jth membership function for input , c means each of membership function which connects with kth rule, respectively. c j, ( j = 1, 2,..., l ), and l is the maximum number of membership functions. If an adaptive threshold of is considered, then the multiplication or combination of membership functions and rules can be realized automatically, the network owns self-organizing function to deal with different features of inputs.
Where are the connections between and rules, and are the mean and standard deviation of stochastic function whose description is given by Eq. (11). The output of system can be obtained by generating a random data according this probability function.
3. SGA of RL
3.1. Algorithm of SGA
A RL algorithm, Stochastic Gradient Ascent (SGA), is proposed by Kimura and Kobayashi (Kimura & Kobayashi, 1996, 1998) to deal with POMDP and continuous action space. Experimental results reported that SGA learning algorithm was successful for cart-pole control and maze problem. In the case of time series forecasting, the output of predictor can be considered as an action of agent, and the prediction error can be used as reward or punishment from the environment, so SGA can be used to train a neural forecasting system by renewing internal variable vector of NN (Kuremoto et. al, 2003, 2005).
The SGA algorithm is given below.
Step 1. Observe an input from training data of time series.
Step 2. Predict a future data according to a probability .
Step 3. Receive the immediate reward by calculating the prediction error.
Step 4. Calculate characteristic eligibility and eligibility trace .
Here is a discount factor, denotes ith internal variable vector.
Step 5. Calculate by Eq. (19).
Here b denotes the reinforcement baseline.
Step 6. Improve policy by renewing its internal variable by Eq. (20).
Here denotes synaptic weights, and other internal variables of forecasting system, is a positive learning rate.
Step 7. For next time step t+1, return to step 1.
Characteristic eligibility , shown in Eq. (17), means that the change of the policy function is concerning with the change of system internal variable vector (Williams, 1992). In fact, the algorithm combines reward/punishment to modify the stochastic policy with its internal variable renewing by step 4 and step 5. The finish condition of training iteration is also decided by the enough convergence of prediction error of sample data.
3.2. SGA for MLP
For the MLP forecasting system described in section 2.2 (Fig. 4), the characteristic eligibilityof Eq. (21)-(23) can be derived from Eq. (8)-(11) with the internal viable
The initial values of are random numbers in (0, 1) at the first iteration of training. Gradient constants and reward parameters r, denoted by Eq. (16) have empirical values.
3.3. SGA for SOFNN
Here membership function is described by Eq. (12), fuzzy inference is described by Eq. (13). The initial values of are random numbers included in (0, 1) at the first iteration of training. Reward r, threshold of evaluation error denoted by Eq. (16) have empirical values.
A chaotic time series generated by Lorenz equations was used as benchmark for forecasting experiments which were MLP using BP, MLP using SGA, SOFNN using SGA. Prediction precision was evaluated by the mean square error (MSE) between forecasted values and time series data.
4.1. Lorenz chaos
A butterfly-like attractor generated by the three ordinary differential equations (Eq. (28)) is very famous on the early stage of chaos phenomenon study (Lorenz, 1969).
Here are constants. The chaotic time series was obtained from dimension o(t) of Eq. (29) in forecasting experiments, where , , , .
The size of sample data for training is 1,000, and the continued 500 data were served as unknown data for evaluating the accuracy of short-term (i.e. one-step ahead) prediction.
4.2. Experiment of MLP using BP
It is very important and difficult to construct a good architecture of MLP for nonlinear prediction. An experimental study (Oliveira et. al, 2000) showed the different prediction results for Lorenz time series by the architecture of n : 2n : n : 1, where n denotes the embedding dimension and the cases of n = 2, 3, 4 were investigated for different term predictions (long-term prediction
For short-term prediction here, a three-layer MLP using BP and 3 : 6 : 1 structure shown in Fig. 3 was used in experiment, and time delay =1 was used in embedding input space. Gradient constant of sigmoid function = 1.0, discount constant = 1.0, learning rate = 0.01,
and the finish condition of training was set to E(W) < . The prediction results after training 2,000 times are shown in Fig. 6, and the change of prediction error according to the iteration of training is shown in Fig. 7. The one-step ahead prediction results are shown in Fig. 8. The 500 steps MSE of one-step ahead forecasting by MLP using BP was 0.0129.
4.3. Experiment of MLP using SGA
A four-layer MLP forecasting system with SGA and 3 : 60 : 2 : 1 structure shown in Fig. 4 was used in experiment, and time delay =1 was used in embedding input space. Gradient
constants of sigmoid functions , discount constant = 0.9, learning rate , the reward was set by Eq. (30), and the
finish condition of training was set to 30,000 iterations where the convergence E(W) could be observed. The prediction results after 0, 5,000, 30,000 iterations of training are shown in Fig. 9, Fig. 10 and Fig. 11 respectively. The change of prediction error during training is shown in Fig. 12. The one-step ahead prediction results are shown in Fig. 13. The 500 steps MSE of one-step ahead forecasting by MLP using SGA was 0.0112, forecasting accuracy was 13.2% upped than MLP using BP.
4.4. Experiment of SOFNN using SGA
A five-layer SOFNN forecasting system with SGA and structure shown in Fig. 5 was used in experiment, time delay =2 was used in 3, 4, or 5-dimensional embedding input spaces. Initial value of weight had random values in (0.0, 1.0), and discount = 0.9, learning rate , the reward r was set by Eq. (31), and the finish condition of training was also set to 30,000 iterations where the convergence E(W) could be observed. The prediction results after training are shown in Fig. 14, where the number of input neurons was 4 and data scale of results was modified into (0.0, 1.0). The change of prediction error during the training is shown in Fig. 15. The one-step ahead prediction results are shown in Fig. 16. The 500 steps MSE of one-step ahead forecasting by SOFNN using SGA was 0.00048, forecasting accuracy was 95.7% and 96.3% upped than the case by MLP using BP and by MLP using SGA respectively.
One advanced feature of SOFNN is its data-driven structure building. The number of membership function neurons and rules increased with samples (1,000 steps in training of experiment) and iterations (30,000 times in training of experiment), which can be confirmed by Fig. 17 and Fig. 18. The number of membership function neurons for the 4 input neurons was 44, 44, 44, 45 respectively, and the number of rules was 143 when the training finished.
Though RL has been developed as one of the most important methods of machine learning, it is still seldom adopted in forecasting theory and prediction systems. Two kinds of neural forecasting systems using SGA learning were described in this chapter, and the experiments of training and short-term forecasting showed their successful performances comparing with the conventional NN prediction method. Though the iterations of MLP with SGA and SOFNN with SGA in training experiments took more than that of MLP with BP, both of their computation time were not more than a few minutes by a computer with 3.0GHz CPU.
A problem of these RL forecasting systems is that the value of reward in SGA algorithm influences learning convergence seriously, the optimum reward should be searched experimentally for different time series. Another problem of SOFNN with SGA is how to tune up initial value of deviation parameter in membership function and the threshold those
were also modified by observing prediction error in training experiments. In fact, when SOFNN with SGA was applied on an neural forecasting competition “NN3” where 11 time series sets were used as benchmark, it did not work sufficiently in the long-term prediction comparing with the results of other methods (Kuremoto et. al, 2007; Crone & Nikolopoulos, 2007). All these problems remain to be resolved, and it is expected that RL forecasting systems will be developed remarkably in the future.