Training Deep Neural Networks with Reinforcement Learning for Time Series Forecasting

As a kind of efficient nonlinear function approximators, artificial neural networks (ANN) have been popularly applied to time series forecasting. The training method of ANN usually utilizes error back-propagation (BP) which is a supervised learning algorithm proposed by Rumelhart et al. in 1986; meanwhile, authors proposed to improve the robustness of the ANN for unknown time series prediction using a reinforcement learning algorithm named stochastic gradient ascent (SGA) originally proposed by Kimura and Kobayashi for control problems in 1998. We also successfully use a deep belief net (DBN) stacked by multiple restricted Boltzmann machines (RBMs) to realized time series forecasting in 2012. In this chapter, a state-of-the-art time series forecasting system that combines RBMs and multilayer perceptron (MLP) and uses SGA training algorithm is introduced. Experiment results showed the high prediction precision of the novel system not only for benchmark data but also for real phenomenon time series data.


Introduction
Artificial neural networks (ANN), which are mathematical models for function approximation, classification, pattern recognition, nonlinear control, etc., have been successfully applied in the field of time series analysis and forecasting instead of linear models such as 1970s ARIMA [1] since 1980s [2][3][4][5][6][7]. In [2], Casdagli used a radial basis function network (RBFN) which is a kind of feed-forward neural network with Gaussian hidden units to predict chaotic time series data, such as the Mackey-Glass, the Ikeda map, and the Lorenz chaos in 1989. In [3,4], Lendasse et al. organized a time series forecasting competition for neural network prediction methods with a five-block artificial time series data named CATS since 2004. The goal of CATS competition was to predict 100 missing values of the time series data in five sets which included 980 known values and 20 successive unknown values in each set (details are in Section 3.1). There were 24 submissions to the competition, and five kinds of methods were selected by the IJCNN2004: filtering techniques including Bayesian methods, Kalman filters, and so on; recurrent neural networks (RNNs); vector quantization; fuzzy logic; and ensemble methods. As the comment of the organizers, the different prediction precisions were reported though the similar prediction methods were used for the know-how and experience of the authors. So the development of time series forecasting by ANN is still on the way.
As a kind of classifiers or a kind of function approximators, the advances of the ANN are bought out by the nonlinear transforms to the input space. In fact, units (or neurons) with nonlinear firing functions connected to each other usually produce higher dimensional output space and various feature spaces in the networks. Additionally, as a connective system, it is not necessary to design fixed mathematical models for different nonlinear phenomena, but adjusting the weights of connections between units. So according to the report of NN3-Artificial Neural Networks and Computational Intelligence Forecasting Competition [5], there have been more than 5000 publications of time series forecasting using ANN till 2007.
To find the suitable parameters of ANN, such as weights of connections between neurons, error back-propagation (BP) algorithm [6] is generally utilized in the training process of ANN. However, due to every sample data (a pair of the input data and the output data) is used in the BP method, noise data influences the optimization of the model, and robustness of the model becomes weak for unknown input. Another problem of ANN models is how to determine the structure of the network, i.e., the number of layers and the number of neurons in each layer. To overcome these problems of BP, Kuremoto et al. [7] adopted a reinforcement learning (RL) method "stochastic gradient ascent (SGA)" [8] to adjust the connection weights of units and the particle swarm optimization (PSO) to find the optimal structure of ANN. SGA, which is proposed by Kimura and Kobayshi, improved Williams' REINFORCE [9], which uses rewards to modify the stochastic policies (likelihood). In SGA learning algorithm, the accumulated modification of policies named "eligibility trace" is used to adjust the parameters of model (see Section 2). In the case of time series forecasting, the reward of RL system can be defined as a suitable error zone to instead of the distance (error) between the output of the model and the teach data which is used in BP learning algorithm. So the sensitivity to noise data is possible to be reduced, and the robustness to the unknown data may be raised. As a deep learning method for time series forecasting, Kuremoto et al. [10] firstly applied Hinton and Salakhutdinov's deep belief net (DBN) which is a kind of stacked auto-encoder (SAE) composed by multiple restricted Boltzmann machines (RBMs) [11]. An improved DBN for time series forecasting is proposed in [12], which DBN is composed by multiple RBMs and a multilayer perceptron (MLP) [6]. The improved DBN with RBMs and MLP [6] gives its priority to the conventional DBN [5] for time series forecasting due to the continuous output unit is used; meanwhile the conventional one had a binary value unit in the output layer.
As same as the RL method, SGA adopted to MLP, RBFN, and self-organized fuzzy neural network (SOFNN) [7]; the prediction precision of DBN utilized SGA may also be raised comparing to the BP learning algorithm. Furthermore, it is available to raise the prediction precision by a hybrid model which forecasts the future data by the linear model ARIMA at first and modifying the forecasting by the predicted error given by an ANN which is trained by error time series [13,14].
In this chapter, we concentrate to introduce the DBN which is composed by multiple RBMs and MLP and show the higher efficiency of the RL learning method SGA for the DBN [15,16]   The model of time series forecasting is given as the following: Denote t = 1, 2, 3, …, where T is the time, n is the dimensionality of the input of function f(x), x t is the time series data, and x tþ1 is unknown data in the future as well as the output of model.
A deep belief net (DBN) composed by restricted Boltzmann machines (RBMs) and multilayer perceptron (MLP) is shown in Figure 1.

RBM
Restricted Boltzmann machine (RBM) is a kind of probabilistic generative neural network which composed by two layers of units: visible layer and hidden layer (see Figure 2).
Units of different layers connect to each other with weights w ij ¼ w ji , where i ¼ 1, 2, …, n and j ¼ 1, 2, …, m are the numbers of units of visible layer and hidden layer, respectively. The outputs of units v i , h j are binary, i.e., 0 or 1, except for the initial value of visible units which is given by the input data. The probabilities of 1 of a visible unit and a hidden unit are according to the following:  Here b i , b j are the biases of units. The learning rules of RBM are given as follows: the expectations of the first Gibbs sampling (k = 0), and <ṽ i > , <h j > the kth Gibbs sampling, and it works when k = 1.

MLP
Multilayer perceptron (MLP) is the most popular neural network which is generally composed by three layers of units: input layer, hidden layer, and output layer (see Figure 3).
The output of the unit y ¼ f z ð Þ and unit z k ¼ f x ð Þ is given as the following logistic sigmoid functions: Here n is the dimensionality of the input, K is the number of hidden units, and The learning rules of MLP using error back-propagation (BP) method [5] are given as follows: where 0 < ε < 1 is a learning rate andỹ is the teacher signal. The learning algorithm of MLP using BP is as follows: Step Step 2. Predict a future data y t ¼ x tþ1 according to Eqs. (7) and (8).
Step 3. Calculate the modification of connection weights, Δw j , Δv ji according to Eqs. (9) and (10). Step 4. Modify the connections, Step 5. For the next time step t þ 1, return to step 1.

The training method of DBN
As same as the training process proposed in [10], the training process of DBN is performed by two steps. The first one, pretraining, utilizes the learning rules of RBM, i.e., Eqs. (4-6), for each RBM independently. The second step is a fine-tuning process using the pretrained parameters of RBMs and BP algorithm. These processes are shown in Figure 4 and Eqs. (11)- (13).
In the case of reinforcement learning (RL), the output is decided by a probability distribution, e.g., the Gaussian distribution y $ π μ; σ 2 ð Þ. So the output units are the mean μ and the variance σ instead of one unit y.
The learning algorithm of stochastic gradient ascent (SGA) [7] is as follows. Step Step 2. Predict a future data y t ¼ x tþ1 according to a probability y t $ π x t ; w ð Þwith ANN models which are constructed by parameters w w μj ; w σj ; w ij ; v ji À Á .
Step 3. Receive a scalar reward/punishment r t by calculating the prediction error: where ζ is an evaluation constant greater than or equal to zero. Step 4. Calculate characteristic eligibility e i t ð Þ and eligibility trace D i t ð Þ: where 0 ≤ γ < 1 is a discount factor and w i denotes ith internal variable of DBN.
Step 5. Calculate the modification Δw i t ð Þ: where b ≥ 0 denotes the reinforcement baseline (it can be set as zero).
Step 6. Improve the policy Eq. (16) by renewing its internal variable w i by Eq. (21): where 0 ≤ ε ≤ 1 is a learning rate.
Step 7. For the next time step t þ 1, return to step 1. Characteristic eligibility e i t ð Þ, shown in Eq. (18), means that the change of the policy function concerns with the change of system internal variable vector. In fact, the algorithm combines reward/punishment to modify the stochastic policy with its internal variable renewing by Step 4 and Step 5.
The calculation of e w μj t ð Þ, e w σj t ð Þ, e v ij t ð Þ in MLP part of DBN is induced as follows; The e i t ð Þ of the RBM of Lth layer in the case of the DBN is given as follows: The learning rate ε in Eq. (21) affects the learning performance of fine-tuning of DBN. Different values to result different training error (mean squared error (MSE)) as shown in Figure 5. An adaptive learning rate as a linear function of learning error is proposed as in Eq. (27): where is 0 ≤ β a constant.

Optimization of meta-parameters
The number of RBM that constitute the DBN and the number of neurons of each layer affects prediction performance seriously. In [9], particle swarm optimization (PSO) method is used to decide the structure of DBN, and in [13] it is suggested that random search method [16] is more efficient. In the experiment of time series forecasting by DBN and SGA shown in this chapter, these metaparameters were decided by the random search, and the exploration limits are shown as the following. The optimization algorithm of these meta-parameters by the random search method is as follows: Step 1. Set random values of meta-parameters beyond the exploration limitations.
Step 2. Predict a future data y t ≈ x tþ1 by MLP or DBN using the current weighted connections.
Step 3. If the error between y t , x tþ1 is reduced enough, store the values of metaparameters, or else if the error is not changed, stop the exploration, else return to step 1.

CATS benchmark time series data
CATS time series data is the artificial benchmark data for forecasting competition with ANN methods [3,4].This artificial time series is given with 5000 data, among which 100 are missed (hidden by competition the organizers). The missed data exist in five blocks: • Elements 981 to 1000 • Elements 1981 to 2000 • Elements 2981 to 3000 • Elements 3981 to 4000 • Elements 4981 to 5000 The mean square error E 1 is used as the prediction precision in the competition, and it is computed by the 100 missing data and their predicted values as the following: where y t is the long-term prediction result of the missed data. The CATS time series data is shown in Figure 6.
The prediction results of different blocks of CATS data are shown in Figure 7. Comparing to the conventional learning method of DBN, i.e., using Hinton's RBM unsupervised learning method [6,8] and back-propagation (BP), the proposed method which used the reinforcement learning method SGA instead of BP showed its superiority in the sense of the average prediction precision E 1 (see Figure 7f). In addition, the proposed method, DBN with SGA, yielded the highest prediction (E1 measurement) comparing to all previous studies such as MLP with BP, the best prediction of CATS competition IJCNN'04 [4], the conventional DBNs with BP [9,11], and hybrid models [13]. The details are shown in Table 1.
The meta-parameters obtained by random search method are shown in Table 2. And we found that the MSE of learning, i.e., given by one-ahead prediction results, showed that the proposed method has worse convergence compared to the conventional BP training. In Figure 8, the case of the first block learning MSE of two methods is shown. The convergence of MSE given by BP converged in a long training process and SGA gave unstable MSE of prediction. However, as the basic consideration of a sparse model, the better results of long-term prediction of the proposed method may successfully avoid the over-fitting problem which is caused by the model that is built too strictly by the training sample and loses its robustness for unknown data.

Real time series data
Three types of natural phenomenon time series data provided by Aalto University [17] were used in the one-ahead forecasting experiments of real time series data.  Table 2.
Meta-parameters of DBN used for the CATS data (block 1).

Figure 8.
Change of the learning error during fine-tuning (CATS data  The prediction results of these three datasets are shown in Figure 9. Short-term prediction error is shown in Table 3. DBN with the SGA learning method showed its priority in all cases.
The efficiency of random search to find the optimal meta-parameters, i.e., the structure of RBM and MLP, learning rates, discount factor, etc. which are explained in Section 2.5 is shown in Figure 10 in the case of DBN with SGA learning algorithm. The random search results are shown in Table 4.
We also used seven types of natural phenomenon time series data of TSDL [18]. The data to be predicted was chosen based on [19] which are named as Lynx, Sunspots, River flow, Vehicles, RGNP, Wine, and Airline. The short-term (oneahead) prediction results are shown in Figure 11 and Table 5.
From Table 5, it can be confirmed that SGA showed its priority to BP except the cases of Vehicles and Wine. From Table 6, an interesting result of random search for meta-parameter showed that the structures of DBN for different datasets were different, not only the number of units on each layer but also the number of RBMs. In the case of SGA learning method, the number of layer for Sunspots, River flow, and Wine were more than DBN using BP learning.

Discussions
The experiment results showed the DBN composed by multiple RBMs and MLP is the state-of-the-art predictor comparing to all conventional methods in the case of CATS data. Furthermore, the training method for DBN may be more efficient by the RL method SGA for real time series data than using the conventional BP algorithm. Here let us glance back at the development of this useful deep learning method.
• Why the DBN composed by multiple RBMs and MLP [11,13] is better than the DBN with multiple RBMs only [9]?
The output of the last RBM of DBN, a hidden unit of the last RBM in DBN, has a binary value during pretraining process. So the weights of connections between the unit and units of the visible layer of the last RBM are affected and with lower complexity than using multiple units with continuous values, i.e., MLP, or so-called full connections in deep learning architecture.
• How are RL methods active at ANN training?
In 1992, Williams proposed to adopt a RL method named REINFORCE to modify artificial neural networks [8]. In 2008, Kuremoto et al. showed the RL method SGA is more efficient than the conventional BP method in the case of time series forecasting [6]. Recently, researchers in DeepMind Ltd. adopted RL into deep neural networks and resulted a famous game software AlphaGo [20][21][22][23].
• Why SGA is more efficient than BP?
Generally, the training process for ANN by BP uses mean square error as loss function. So every sample data affects the learning process and results including noise data. Meanwhile, SGA uses reward which may be an error zone to modify the parameters of model. So it has higher robustness for the noisy data and unknown data for real problems.

Conclusions
A deep belief net (DBN) composed by multiple restricted Boltzmann machines (RBMs) and multilayer perceptron (MLP) for time series forecasting were introduced in this chapter. The training method of DBN is also discussed as well as a reinforcement learning (RL) method; stochastic gradient ascent (SGA) showed its priority to the conventional error back-propagation (BP) learning method. The robustness of SGA comes from the utilization of relaxed prediction error during the  Table 3.
Prediction MSE of real time series data [17].  learning process, i.e., different from the BP method which adopts all errors of every sample to modify the model. Additionally, the optimization of the structure of DBN was realized by random search method. Time series forecasting experiments used benchmark CATS data, and real time series datasets showed the effectiveness of the DBN. As for the future work, there are still some problems that need to be solved such as how to design the variable learning rate and reward which influence the learning performance strongly and how to prevent the explosion of characteristic eligibility trace in SGA.  Table 6.

Author details
Size of time series data and structure of prediction network.