## 1. Introduction

An essential element for many management decisions is an accurate forecasting. There are several methods and techniques to forecast time series that include traditional forecasting techniques with theoretical foundations in statistics. These methods present some obstacles and complexities to overcome; one of the most important ones is the difficulty to select the model that can provide the best adjustment for a specific dataset, many attempts have to be usually done until the best model can be obtained. Considering this scenario, different machine learning techniques have been recently used in this problem, such as Artificial Neural Network (ANN), Evolutionary Computation (EC), in particular, Genetic Programming (GP), which is considered a promising approach to forecast noisy complex series (Kaboudan, 2000), there are many other works founded in the literature that use (GP) to Time Series Prediction. On the other hand, recently advances in the machine learning field show that the application of the Boosting algorithm is a powerful approach to increase the accuracy of forecasting methods. Boosting algorithm was proposed and developed by Freund and Schapire (1996). According to Allwein et al. (2000), Boosting is a method of finding a highly accurate hypothesis by combining many "weak" hypotheses, each of which is only moderately accurate. Paris et al. (2004) proposed GPBoost that uses the Boosting algorithm with the GP as base learner. We have proposed a new formula for the updating of the weights and for obtain the final hypothesis of the predictor. This algorithm was called of Boosting Correlation Coefficients (BCC) and it is based on the correlation coefficient instead of the loss function used by traditional Boosting algorithms. To evaluate this approach we conducted three experiments. In the first one, the BCC was used to forecast real time series, in this experiment the mean squared error (MSE) has been used to compare the accuracy of the proposed method against the results obtained by GP, GPBoost and the traditional statistical methodology (ARMA). In the second, to prove the efficiency of the proposed methodology a widespread Monte Carlo simulation was done covering the entire ARMA spectrum, in which artificial series were generated from the parametric space of the principal ARMA models, they are AR(1), AR(2), MA(1), MA(2) e ARMA(1,1). The database generated was composed by 214.000 time series with 150 observations each one. The training set was composed by 90% of date and the others 10% composes the test set. The results were compared out of sample and the BCC showed better performance than ARMA methodology, Genetic Programming and GPBoost. Finally, the BCC algorithm was also applied to multiple regressions problem and the results obtained from this method were compared with the results from Artificial Neural Network, Model Tree and Boosting. This comparison showed that the BCC supplied better results than other ones. In way compare the performance of the BCC methodology with other methods, many statistical tests were performed such as Median Square Error (MSE), Root Median Square Error (RMSE) and a non parametric test Friedman. The results were compared out of sample and the BCC methodology had been presented accurate forecasts. Future research Considering that GP is able to provide solutions of high quality, and after the success of our own experiments (Souza et al., 2007a), we are encouraged to further explore GP towards finding solutions to the problem of modeling and pattern discovery of complex time series and in additional we will investigate the procedure BCC using GP as a base learner to analyze probabilistic and stochastic processes. We will investigate new tools that can work GP to more effectively solve this problem. One of the most important applications for the time series analysis is in stock markets. The goal of this task is to choose the best stocks when making an investment, and to decide which is the best strategy at the moment. Therefore, we will investigate the appropriate means for using GP in this task, as well as other general problems in financial time series. An another application that we must investigate is in Biological Networks, for example, gene regulatory network.

## 2. Genetic programming

Genetic Programming (GP) is an Evolutionary Computation Technique in which the individuals are computational programs. This theory was developed by John Koza (1992) and it is based on Genetic Algorithm (GA) presented by John Holland (1975). In accordance to Banzhaf (1998) and Kaboudan (2000) GP is known as an effective research paradigm in Artificial Intelligence and Machine Learning, and have been studied in the most diverse areas of knowledge, such as: digital cirucuits, data mining, molecular biology, optimization taks and another ones. In nature, those individuals that better adapt to the environment that surrounds them, have greater chance to survive. They pass their genetic characteristics to their descendents, who will suffer modifications to better adapt to the environment. After many generations, this population reaches a natural evolution. In Genetic Programming (GP), the evolutionary algorithm operates over a population of programs that have different forms and sizes. The initial population must have enough diversity, that is, the individuals must have most of the characteristics that are necessary to solve the problem, because characteristics that do not exist in the initial population will probably not appear during the evolutionary process. The evolutionary process is guided by a fitness function that measures the individual’s ability to solve the problem. Those individuals that better solve the problem will receive a better fitness value and consequently, will have a better chance to be selected for the next generation. The choice of this function depends on the domain of the problem. A good choice is essential to provide good results. Once the individuals are selected, it is time to apply the genetic operators. These are: Reproduction – an individual is replicated to the next generation, with no modification in its structure; Crossover – two programs are recombined to generate two offspring and Mutation – a new sub-tree replaces a randomly selected part of a program. This process is repeated until a satisfactory solution or a stop criterion is reached. Instead of a population of beings, GP works with a population of computer programs. The goal of the GP algorithm is to select, through recombination of “genes”, the program that better solves a given problem. The main elements of GP are:

Program Structure: a tree is the most used structure to represent programs in GP. Each node can be a function or a terminal. A function has to be evaluated considering its parameters while a terminal has its own value. The terminal (T) and function (F) datasets must be provided by the user in accordance to the current problem. For example, if the datasets are: F = {+, −, *, /} and T = {x, 2} are one simple variable arithmetic expression can be generate, such as x * x + 2 or (x

^{2}+2). Figure 1 shows the abstract syntax tree for that expression.Fitness Function and Selection: in nature, individuals are selected based on how well they fit to the environment. The individuals that are able to better solve the problem have better chance to be selected.

Parameters: there are some parameters that will guide the evolutionary process, these parameters will limit and control the search performance. Some of them are: genetic operators rates (crossover rate, mutation rate), population size, selection rate (tournament size), maximum depth of the individual, etc.

In GP the population is composed by individuals that are computational programs (Koza, 1992). The first step of the algorithm is to create randomly the initial population that is the Generation 0. After that, there are two majors tasks processed in a loop with two main steps:

The evaluation of each program by using a fitness function: the GP algorithm receives the set that includes the values that represent the solution for the problem. For example, in a Symbolic Regression problem, it is necessary to provide the set of values of x and f (x) to the GP algorithm. These values are applied to the programs generated with the defined sets of operators and terminals. At the end, the fitness value is obtained.

The new population is created by selecting individuals that have better fitness value and by applying the genetic operators.

Each run of this loop represents a new generation of individuals, that are the new population that will substitute the previous one. This process is repeated until a solution is found or until a maximum number of generations is reached. At the end, the GP algorithm presents the best tree that is able to solve the given problem in the best way. The pseudo code of the GP algorithm is showed in the Figure 2.

## 3. Boosting algorithms

The Boosting algorithm was proposed and developed by Freund and Schapire (1996) for binary problems. According to Schapire and Singer (1997) Boosting is a method to find a highly accurate hypothesis by combining many weaker hypotheses, each of which is only moderately accurate. It manipulates the training examples to generate multiple hypotheses. In each iteration the learning algorithm uses different weights on the training examples and returns a hypothesis *h*
_{
t
}. The weighted error of *h*
_{
t
} is computed and applied to update the weights on the training examples. The result of the change in weights is to place higher weights in training examples that were misclassified by *h*
_{
t
}, and lower weights on examples that were correctly classified. The final classifier is composed by the weighted vote of the individual classifiers. Freund and Schapire (1996) introduced the AdaBoost algorithm commonly referred to as the Discrete Boosting algorithm, Ridgeway (1999) developed for binary classification problems. Freund and Schapire (1997) extended AdaBoost to a multi-class case, which they called AdaBoost.M1 and AdaBoost.M2. In order to solve regression problems, Schapire (2002) extended the AdaBoost.M2 and called it AdaBoost.R. It solves regression problems by reducing them to classification ones. Drucker (1997) developed AdaBoost.R2 algorithm, which is an ad hoc modification of AdaBoost.R. He carried out some experiments with AdaBoost.R2 for regression problems and obtained some promising results (Solomatine and Shrestha, 2004). All these approaches follow the view of boosting as a gradient descent procedure (Breiman, 1997), or as residual-fitting, (Buhlmann and Yu, 2003), (Mason et al. 1999). Instead of being trained on a different sample of the same training set, as in previous boosting algorithms, a regressor is trained on a new training set that has different target values (e.g., the residual error of the sum of the previous regressor) (Assad and Bone, 2003). Bone et al. (2003) adapted a Boosting algorithm to time series forecasting using neural networks as base learners. Their experiments showed that Boosting actually provides improved results in regression problems. Iba (1999) proposed a version of AdaBoost for GP and regression. In his work the distribution was implemented picking up examples to generate a new learning set for each Boosting round. The probability of an example being picked up is proportional to its weight, and any example can be picked from 0,1 up to several times. According to Paris (2002), this approach makes the implementation easier, but the precision of weights is somewhat lost in the process.

### 3.1. GPBoost

Paris proposed a Boosting method that retains the precision of weights and operates on the whole set of examples for every round of Boosting. Their algorithm, named GPBoost, uses a weighted based fitness function in GP and the generic AdaBoost.R2 algorithm to update the weights its pseudo code is showed in the figure 3. The GP technique allows us to obtain accurate models from different datasets with diverse characteristics, and from the obtained model to estimate or predict the behavior of the system along time. On the other side, using Boosting, the results obtained with the merging hypothesis are always better than the results obtained with GP technique.

### 3.2. Boosting Correlation Coefficients (BCC)

After many studies through the Boosting algorithms, it is possible to point out that these algorithms have been sufficiently explored over classification problems. Less emphasis, however, has been given to regression problems. The Boosting algorithm for regression problems is an adaptation of the concepts used in classification. The traditional form of obtaining the weights of each example is based on error or loss function. However, the loss function is one of the possible information that can be used to obtain these weights. Recently, (Souza et al., 2007; 2009) used the BCC algorithm that is Boosting using Correlation Coefficients to time series forecasting and regressions problem. The method is a new approach of the Boosting algorithm for regression and is empirically based. BCC uses the coefficients of correlation for the updating of the weights and it has been observed that this directly influences the minimization of the loss function. The same coefficients can also be used in the final combination of predictors. The correlation coefficient is a statistical measure of the interdependence of two or more random variables. Fundamentally, the value indicates how much of a change in one variable is explained by a change in another. The BCC method is based on this measure and will be described within this section, but firstly, a brief review on the definition of Correlation Coefficients is given.

Definition: The correlation between two samples X and Y, with m observations, is calculated by using the Equation 1.where

The Boosting Correlation Coefficient is a metric function that measures the relation degree between two variables, in this case between the real and the predicted values. The structure of the BCC algorithm is similar to the GPBoost algorithm. First of all, the weight distribution *D*
_{
t
} is initialized in Step 1 and the boosting iterations start (Step 2) by calling each time the GP algorithm. After the GP’s complete execution, the best individual ft in the run is chosen. After, the loss function is computed. To calculate the loss function different forms can be used, such as the exponential showed in Equation 2. Then, the correlation coefficients are calculated (Equation 3) and after the next weights are updated using the Equation 4. Finally, Step 3, the output must be computed as a combination of the different generated hypotheses. To calculate the final hypothesis F, T functions ft will be combined using again the correlation coefficients, see Equation 5.

*i*=

*m*

The intention in use the correlation coefficients is to promote a smoothness in the update of the weights, because it was observed that there is an inherent roughness on it, on the other hand the correlation coefficients were used to combine the obtained hypothesis in only one hypothesis that can be better than each one because of its goodness of estimation.

## 4. Time Series Forecasting

Time Series Forecasting is an important area of the knowledge and there are many applications in the real world. Accurate forecasting is an essential element for many management decisions. There are several methods and techniques to find a good model that can be used to produce accurate forecasting the traditional techniques have your foundations in statistics. The most important model statistical methodology is the Autoregressive and Moving Average (ARMA) models. These methods presents some obstacles and complexities to overcome. The major difficulty is to select the good model that can best adjustment for a specific dataset, usually many attempts must be performed until the best model must be found. Because of these difficulties, many researchers have been done several efforts to overcome these problems, such as Artificial Neural Network (ANN), Evolutionary Computation (EC) and in special Genetic Programming (GP) that have been provided good results in time series forecasting.

## 5. Experiments using real time series

This section will describe the experiments that have been accomplished using the BCC algorithm with GP as a base learner and the results are compared with other techniques such as Box & Jenkins; traditional GP and GPBoost. In the first experiment the algorithms were used in a group of academic series, in the second one (Section 6) we used a widespread Monte Carlo simulation covering the entire ARMA spectrum. First of all, we will describe the data sets that were used in the experiments, second the configuration of the methodology that were used and after the results are presented and the evaluate of the algorithms is done.

### 5.1. Academic and Benchmark Time Series

The academic series used in this experiment can be found at the website: (www.ime.usp.br/~pam/ST.html) and the Benchmark series at (www.economatica.com). A brief explanation about the series is presented at Table 1. Each data set was divided into two other data sets: training set that contains the 90% first values from the data set that were used to train the methods and the second one that contains the remaining values.

### 5.2. Box and Jenkins methodology- ARMA models

The Box & Jenkins methodology is one of the most important and recognized work in Time Series area. The research was made by George Box and Gwiliyn Jenkins (1970) and it is based on the Wold (1954) studies, who proved that any time series can be represented by a structure of infinity moving average. The methodology adjust Autoregressive and Moving Average Models ARMA(p, q) to the data set, where p is the parameter of the Autoregressive and q is the parameter of the Moving Average models and they represent the order of the model to be used. The Box and Jenkins methodology is composed by four steps.

*Step 1.*Identification of the model to be used on the dataset, the best model is selected by using Akaike (1974) Information Criterion (AIC), see Equation 6, where*k*is the number of parameters in the statistical model, and L is the maximized value of the likelihood function for the estimated model.

### 5.3. Configuration of GP, GPBoost and BCC

To apply these algorithms, we chose the tool Lil-GP 1.0 (Zongker, 1995) which is a free and easily configurable software, implemented according to Koza’s GP (1992). To solve the problems it is necessary to provide the configuration files, standard GP parameters, functions and variables used to generate the models, input and output files (training and test set), as well as to specify the fitness function. The parameters used in this work to configure GP tool are showed at Table 3. These parameters have been gotten empirically after many tests have been accomplished. The terminal set used was T = { Z_{t−1}, Z_{t−2}, Z_{t−3}, Z_{t−4}} that is, the last four observations are used to make a prediction at the time t. Beside these a constant is also used and the functions set is F = {+, −, *, /, log, cos, sin, exp, root}. This Function set allows us to obtain non linear models that have better adjust to the complex series than linear models. The fitness function used was defined as the Weighted Root Mean Square Error (WRMSE) that is, in general, used to measure the accuracy of a forecasting method. The WRMSE is showed in Equation 7, where x_{i} is the real value,
*D*
_{
i
} is the weight of each example and *n* is the size of the dataset. In this experiment, individuals with WRMSE equal to 0 or near to 0 are the best.

The Boosting algorithm with the different output hypotheses has been implemented in C computer language. For each data set, ten models of each algorithm (GP, GP-Boost and BCC) were obtained using a different random initial seed and for the GPBoost and BCC algorithm it was used ten Boosting iterations. After that, each generated model was used to forecast the values in the test set.

### 5.4. Results

In order to evaluate the performance of these methods, we used the average of the Mean Square Error (MSE) obtained by using 10 initial seeds over the test set. This measure was used as a parameter of comparison because it is accepted by the statistical community. For the ARMA model, there is only one prediction and then the value of n is one. The results of this experiment are summarized in Table 4. The BCC algorithm presented the best performance, in almost all the data set the MSE average was the lowest.

Beside the MSE, it was applied a non parametric test, Friedman Test (Siegal, 1988) and (Demsar, 2006), to estimate the BCC performance against the other methods. The null hypothesis that all the algorithms are equivalent was tested and rejected, after that the post-hoc multiple comparisons were performed. The results are summarized in Table 5. The algorithms ARMA, GP, GP-Boost and BCC and the result of the test shows that there is a significant difference between ARMA and BCC and between GPBoost and BCC all the other algorithms have no significant difference.

## 6. Experiment using Monte Carlo simulation

A widespread Monte Carlo simulation has been done to exhaustively evaluate the performance of the BCC algorithm with GP as a base learner, in which we simulated artificial time series that belong to the entire spectrum of the structures AR(1), MA(1), AR(2), MA(2) and ARMA(1,1). To generate these series we used the free statistical software R). The parameters were put through variations within their respective parametric spaces and a noise component was added. The noise has normal distribution with mean 0 and standard deviation 1.

### 6.1. Parameters setting

In order to setting the parameters that were used in this simulation, it was considered the stationary region in the parametric space of the principal structure of the ARMA models: AR(1), AR(2), MA(1), MA(2), ARMA(1, 1). The stationary region of the AR(1) and MA(1) structure are represented by the Equation 8. Where _{}is the autoregressive parameter and _{} is the moving average parameter, both of them are defined from -1 to 1. This space was divided using step 0.1

The parametric space of the stationary region of the AR(2) structure is defined by the Equation 9. Where _{}and_{}are the autoregressive parameters, for the MA(2) structure there is no restriction of the stationary (Morettin and Toloi, 2004), but its invertibility region is the same of the stationary region of the AR(2) structure. This region was divided using step 0.2 on the x and y axes.

Finally, the parametric space of the stationary region of the ARMA(1, 1) structure is defined by the Equation 10. Where _{}is the autoregressive parameter and _{} is the moving average parameter.

Using this criterion, the data set included 214.000 series distributed for each structure as showed in the Table 6. For each set of parameter, 500 series were created, for example, the AR(1) structure has 19 parameters, then the number of series is 500 × 19 = 9,500. Each data set was divided in training and test set in the same way that was made for the real time series. The number of the examples of each data set was 150, then the training set contain 135 examples and the test set 15.

### 6.2. Evaluation metrics

Despite using MSE as a comparison measure, this is not enough to guarantee that the algorithm is better than another one. To analyze more precisely the performance of the proposed algorithm against other, the Friedman test was performed. The null hypothesis that states that all the algorithms are equivalent was rejected and then, the post-hoc multiple comparisons were performed. The analysis was made for each structure considering the MSE results of each set of parameters. For example, the MA(2) structure has two hundred different sets of parameters, for each of them, one MSE result is obtained for each algorithm. In this simulation, only one seed was used to GP due to the largest of the data set. The training set contains the first 90% of the values from the data set and was used to train the methods. The testing set contains the remaining values and was used to evaluate the methods.

### 6.3. Computational implementation

For each series from the data set (214,000) it was run 10 Boosting algorithm (GPBoost and BCC). The data sets have been run in a computational platform that includes 42 dual processor computers. The computer language used to implement the algorithms GPBoost, BCC and GP was C++ that used the Lil-GP 1.0 to run the GP as a sub routine. The data set was divided into 428 groups each one containing 500 series. Each group ran 10 Boosting algorithms including GPBoost and BCC. The computational time to run each group was 32 hours.

### 6.4. Results

The results were analyzed through the MSE in the test set, the MSE was calculated in accordance with Equation 11. Where serie(x_{i}) are the real values, ARMA(x_{i}) are the predicted values from ARMA models, GPBoost(x_{i}) are the predicted values from GPBoost and BCC(x_{i}) are the predicted values from BCC algorithm.

(11) |

In Table 7 is presented the MSE in the forecasted values for all the algorithms. Table 8 shows the results of the MSE for all the structure, due to space reasons only the sum of the MSE is presented. Table 9 shows the pos-hoc multiple comparisons results. In this Table, the symbol “FALSE” is used to denote that there is no statistical difference between the methods ARMA, GP, GP-Boost and BCC. From these Tables it is possible to conclude that:

For AR(1) structure, the BCC is significantly better than the other algorithms; GP and GPBoost algorithms have no difference, as well as, ARMA and GP algorithms.

For AR(2) structure, the BCC is significantly better than ARMA and GP algorithms, and equal to GPBoost; ARMA and GP have no difference.

For MA(1) structure, the BCC is significantly better than the other algorithms; there is no difference between: ARMA and GP algorithms, ARMA and GP-Boost algorithms, and GP and GP-Boost algorithms.

For MA(2) structure, the BCC is significantly better than the other algorithms; there is no difference between ARMA and GP algorithms; GP-Boost is better than GP and ARMA algorithms.

For ARMA(1,1) structure, the BCC is significantly better than the other algorithms; there is no difference between ARMA and GP algorithms; GP-Boost is better than GP and ARMA algorithms.

Concluding, in almost all the cases the BCC method is the best and when the method is not the best, there is no statistical differences between the methods. GP-Boost is significantly better than GP and ARMA algorithms in many cases, but GP and ARMA algorithms have no difference.

## 7. Regression problems

In this section we describe an experiment applying the BCC algorithm for Multivariate Regressions Problems. The goal is to evaluate the proposed method in this kind of problem. In order to compare with other techniques, we used the same datasets and evaluation measures described in Souza and Vergilio (2006). In their work, they presented the results obtained with the application of Neural Networks (multi-layer perceptron, ANN), M5 Model Tree (MT), Adaboost.R2 (R2) and Adaboost.RT (RT), here replicated for comparison purposes with our approach.

### 7.1. Data sets

The dataset´s features are described in Table 10, these datasets are used as a benchmark for comparing results with the previous research. The dataset was divided into training (70%) and test (30%). The subsets were obtained using a random selection without replacement strategy. The procedure was repeated for ten times with different seeds in order to obtain ten statistically different subsets of data. These ten subsets were prepared to obtain consistent results. The BCC algorithm was implemented using the software R and as base learner the Cart (Classification and Regression Tree) algorithm was used. The reason to select this learning technique is that it is simple, easy and fast to train.

### 7.2. Results

The performances of the different algorithms are presented in the Table 11. By analyzing the results we can conclude that there is no technique that will always present the best results. For some datasets the best model was obtained by using one approach, for other datasets the best model was obtained with another one. In order to analyze more precisely the algorithm’s relative performance some quantitative measurement is needed rather than just subjective comparison. For this reason, following the methodology used by Souza and Vergilio (2006), we used the so-called scoring matrix. It shows the average relative performance (in %) of one technique over another technique for all the data sets considered. Element of scoring matrix SMij should be read as the i^{th} machine’s average performance (header row in Table 12) against machine j (header column in Table 12) and is calculated for a N number of datasets as is showed in the Equation 12 (for i ≠ j ). From Table 12 it can be clearly observed that BCC scores highest.

## 8. Conclusion

In this paper we present the BCC algorithm that uses the correlation coefficients between the real and the forecasting value obtained using GP as a base learner. The correlation coefficient is used for update the weights and for the generation of the final formula. Differently from works found in the literature, in this paper we investigate the use of the correlation metrics as a factor, besides the error metric. This new approach, called Boosting using Correlation Coefficients (BCC), has been empirically obtained when trying to improve the results from the other methods. The correlation coefficient was considered because two algorithms could present the same mean error and different correlation coefficients for a dataset. This difference on behavior of the two algorithms can be measured by the correlation coefficient. A good correlation coefficient results in small errors for each example. The correlation coefficient is used in the proposed algorithm with two purposes: the first use of the correlation coefficient is for the update of the weights, here the intent is to promote a smoothness of the update, because it has been observed that the original equation has an inherent roughness. The second use is to combine the final hypothesis, and in this case the intent is to allow that each hypothesis contributes to the final hypothesis according to its goodness of estimation. This idea was evaluated through two groups of experiments. In the first group of experiments, we explore the BCC for time series forecasting, using Genetic Programming (GP) as a base learner. Differently from works found in the literature a great number of series were used considering academic series and a widespread Monte Carlo simulation. In the Monte Carlo Simulation, series were generated

in the entire parametric space for the main ARMA structures: AR(1), AR(2), MA(1) MA(2) and ARMA(1,1). From all these experiments we can conclude that in almost all cases the BCC method is the best and when the method is not the best, there is no statistical difference between the compared methods. The goal of the second group of experiments was to evaluate the proposed method in multivariate regression problems. We have compared the BCC algorithm with the results reported by other authors, comparing a M5 model tree (MT),

bagging, AdaBoost.R2, AdaBoost.RT and Artificial Neural Networks (ANN). For our work, a model tree was chosen as the base learner. We use a scoring matrix to analyze the algorithms’ relative performance. It shows the average relative performance (in%) of one technique over another technique. Like in time series forecasting it can be clearly observed that BCC scores the highest. We can conclude that the proposed algorithm is very advantageous for regression problems. The BCC algorithm was evaluated using two different base learners and it always showed good results. These results encourage us to carry out future experiments to explore other base learners. We intend to better evaluate the proposed approach and to explore meta-learning to select the best algorithm according to the characteristics of the datasets. The meta-learning approach will help us to better understand the problems that better fit to one algorithm or another. Other future works are the investigation of other application domains like Software Reliability. Also, an interesting idea is to extend the work for classification problems, however, the correlation coefficient

can be used only for continuous variables, and then, other metrics must be considered.