Comparison Between Equations Obtained by Means of Multiple Linear Regression and Genetic Programming to Approach Measured Climatic Data in a River

© 2012 Arganis et al., licensee InTech. This is an open access chapter distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. Comparison Between Equations Obtained by Means of Multiple Linear Regression and Genetic Programming to Approach Measured Climatic Data in a River


Introduction
The Ebro River is located in north-eastern Spain. After crossing the Catalan coastal mountain system, the Ebro reaches the sea. Along the lower part of the river, about 100 km from the mouth, there is a system of three reservoirs: Mequinenza (1500 hm 3 ), Ribarroja (210 hm 3 ) and Flix (11 hm 3 ). These reservoirs regulate the hydrologic regime of the lower part of the river until it reaches the sea. The Mequinenza and Ribarroja reservoirs were finished in the late 1960s (in 1966 and 1969, respectively), while the Flix reservoir was completed in 1945. About 5 km downstream from the Flix reservoir is the Ascó nuclear power plant, which began its activity in December 1984 [1].
Ascó Nuclear Power Station, located on the Ebro River in Spain (Figure 1), takes river water for cooling purposes. The temperature of discharged water must be less than 13 ºC, however five kilometers downstream a water temperature of nearly 14ºC was estimated and such an anomaly was reported to the nuclear center. A detailed analysis shows the relationship between water temperature variation and the presence of a cascade dam system upstream of the Ascó Nuclear Power Station. Water temperature decreases downstream in the outlets of cascade dam systems [1]. During the winter period there also exists thermal stratification within the river, whereby water temperatures near deep intake areas are considerably less than the ambient temperature. Such a situation impacts water taken for cooling purposes by Ascó Nuclear Power Station.
Throughout the years, the human being has made use of fluvial ecosystems. Some actions have caused changes in the thermal regimes of rivers (eg. [2 ,3]).
Reservoirs and the use of water for cooling are the most important sources of water temperature modifications caused by humans. The use of water for cooling, usually by power plants, causes the water to become warmer [4]. This is often called "thermal pollution".
Reservoirs can cause various effects, depending on various factors such as the climate, the size of the impoundment, the residence time, the stability of the thermal stratification and the depth of the outlet [5]. Due to thermal stratification occurs, the water from deep-release reservoirs is cooler in the summer and warmer in the winter than it would be without the reservoir [6,7]. Water diversions can also alter water temperature regimes because they reduce discharge, which causes water temperature range to increase throughout the year [8]. Irrigation is also known to decrease discharge and increase water temperature [9].
In order to preserve the ecological balance it is very important to have a continuous inspection of water quality in that portion of the river. Freshwater organisms are mostly ectotherms and are therefore largely influenced by water temperature. Some of the expected consequences of a water temperature increase are life-cycle changes [4,10], and shifts in the distribution of species with the arrival of allochthonous species [11,12] and the expansion of epidemic diseases [13] as a possible result. Also, aquatic flora and fauna depend on dissolved oxygen to survive and this water quality parameter is a function of water temperature as well.
Water temperature variation analysis, in a river with a cascade dam, involves several hydrological and environmental aspects because of the dams impact on aquatic flora and fauna as shown by [14,15,16,1,17,18,19].
Because temperature is a water quality parameter that affects aquatic flora and fauna, it is important to have mathematical models which allow one to make estimations of water temperature behavior. These models are based on climatic data such as solar radiation, net radiation, relative humidity, air temperature, and wind speed. Accurate water temperature modeling may help diminish the environmental impact of increased water temperature on aquatic flora and fauna within the river.
Genetic programming (GP) algorithms have been used to derive equations which estimate the ten minute average water temperature from known variables such as relative humidity, air temperature, wind speed, solar radiation, and net radiation [20]. Only air temperature and relative humidity were associated with water temperature in some of the resulting equations, even though solar radiation is known to increase water temperature in rivers and ponds.
A correlation analysis could prove the implicit participation of solar radiation as a variable in air temperature, even though an explicit solar radiation term does not appears in the equation. Solar radiation was assumed to be independent with respect to water temperature resulting from neglecting the lag time between a change in the solar radiation value and the corresponding change in water temperature, [1] estimated this lag time to be nearly 160 minutes. By inputting data to both the genetic programming algorithm and multiple linear regression (MLR) in this study, it was possible to identify the relative significance of each climatic variable in estimating water temperature.
Tests were made from data collected at the Ribarroja Station, which is located on the Ebro River in Spain (Figure 1).

Genetic programming
Evolutionary Computation (EC) are learning, search and optimization algorithms based on the theories of natural evolution and genetic. The steps of the basic structure of this kind of algorithms are shown in Figure. First, an initial population of potential solutions is randomly created (in the case of a Simple Genetic Algorithm (SGA), the initial population is composed of binary individuals). Then, the individuals of this population are evaluated considering the problem to be solved (environment) where a fitness value is assigned to each individual depending on how close individuals are to the optimum. A new generation is created by selecting the fitter solutions of previous generation and then, genetic operators such as crossover and mutation (Alter P(t) of Figure 2) are applied to selected individuals in order to create a new population (offsprings) which improve their fitness values in comparison to previous generation. This new population is evaluated and selection, crossover and mutation are again applied. This process continues until a termination criterion is reached (this is commonly established as the maximum number of generation).
Genetic Programming (GP) is a class of Evolutionary Algorithm (EA) [ 21,22,23] where individuals in the population are computer programs, usually expressed as syntax trees or as corresponding expressions in prefix notation (see Figure 3).  . Genetic programming representation: syntax tree, LISP or prefix notation, mathematical function and MATLAB program As seen from Figure 3, individuals are created based on a function and terminal set according to the problem to be solved. A root node is generally a function selected randomly from the function set. Then, functions and terminals are chosen in order to form the syntax tree that represents an individual. It is important to set a maximum depth or maximum number of nodes, thus the size of the individuals can be control and avoid bloating. Bloat is the rapid growth of programs produced by genetic programming or variable coding heuristics.
The fitness value of the population is usually calculated by running each individual with the problem input data, or testing data, and see how close the output of the program (individual) is to some desired (reference) output specified by the user.
Each generation, fitter individuals are evolved by means of crossover and mutation. Crossover is a sexual genetic operator that takes two parent-individuals, randomly selects a node in each parent and exchanges the associated sub-branch starting from the selected node between the parents producing two new individuals. Due to GP uses variables individuals representation, the selected nodes for crossing over two individuals are different in each parent. Note that if the parents to crossover are identical, the new two offsprings are generally different to the parents because the node selected for crossing over is different in each paren. In contrast to Genetic Algorithms, when two identical parents are crossing over, the offsprings are similar to their parents because the crossing point is the same for both parents and they have the same length.
Mutation is a asexual genetic operator that takes an individual, randomly selects a node and replaces the associated branch for a new branch generated based on the primitive set (functions and terminals sets).
The application of evolutionary computing algorithms has expanded in the last few years to several engineering applications, particularly in regards to hydraulics and hydrological engineering. Examples include: studies of hydroinformatics by [24,25]; studies in rainfall runoff modeling by [26][27][28][29][30][31] . The unit hydrograph for a typical urban basin was obtained by means of genetic programming in [32].
A study of Chezy's roughness coefficient by [33], who also uses an evolutionary polynomial regression in [34,35]. A deep percolation model using genetic programming was obtained by [36]. Models related to sediments were obtained with genetic programming by [37].
Evapotranspiration phenomena has been predicted by means of genetic programming [38]. The flood routing problem was analyzed by means of genetic programming by [39] and the soil moisture too [40].
In this work, a genetic programming algorithm operating in the MATLAB environment [41] developed at the Instituto de Investigaciones en Matemáticas Aplicadas y en Sistemas (IIMAS), Universidad Nacional Autónoma de México (UNAM) was applied and compared with a traditional curve adjustment technique, in an attempt to get another useful application of these optimization procedures. Here, a stochastic universal selection method was used [42] (Baker, 1987); crossover operator was used with a probability of 90% (see Table 1). It is important to mention that two different mutation operators were used. The first one with a probability of 5% randomly selects a branch and then it exchanges this selected branch by a new generated one. The second mutation operator works by selecting constant values and with a probability of 5%, these constants are mutated by adding a random value of a defined range.
This climatic data modeling problem is expressed as a symbolic regression, a common application of genetic programming, where function set consists of arithmetic and trigonometric functions and terminals set consists of climatological variables which are described in next section.

Input data
Water temperature (Tw), solar radiation (rs), net radiation (rn), relative humidity (hr), air temperature (Ta), and wind speed (Vv) data measured at the Ribarroja Station from January to June of 1998 were utilized in this study. The ten minute water temperature average was calculated using all of these variables. Later, the averaged air temperature and relative humidity (in decimals) were filtered to take into account a seven day relay. Data filtering was done with the following equation: Where : Vi is the original independent variable t f Vi is the filtered independent variable and k is the size or widow filter (in this case k=6).
Recorded solar radiation at minute ti has its influence on water temperature at instant ti+160 [1] and such a gap needs to be taken into account for all considered data. For example, the first data point of the dependent variable, ten minute average water temperature at instant ti+160, was coupled with the first data point of the independent variable, such as solar radiation at instant ti. For the independent variables, net radiation (rn) and wind speed (vw) values of ti+160 were used, while air temperature and relative humidity values were considered using both seven day filtering and values corresponding to instant ti+160 .

Objective function
The objective function was to minimize the mean square error between the calculated and measured data using the following equation: Where: Z is the function to minimize Tw is the average of measured temperature each ten minute interval in ºC T w is the calculated temperature with the genetic programming algorithm in ºC, and n is the data number.

Parameter setting
Parameters used in the genetic programming algorithm are shown in Table 1 The function cosine (cos) was included in the function set due to preliminary tests, where a reduction in mean quadratic error was obtained, included this cosine function. This fact is related to one of the two properties that GP individuals must satisfy: sufficiency. This property says that the set of terminals and the set of functions should be defined in order to express a solution to the study problem [23]. The second property, closure, specifies that each of the functions in the function set can be able to accept, as its argument, any value and data type that may possibly be returned by any function and any value or data type that can be possibly assumed by any terminal [23]. In this approach, a protected division was implemented in order to avoid a division by zero. In this situation occurs, a high value is returned.
By including the cosine function, associated equation also presented a good reproduction of the periodic behavior of water temperature over time.

Multiple linear regressions
Coefficients a1,a2,a3,…an, are weighting factors which allow one to see the relative importance of each variable xi as y is approached. Indirectly the coefficients can indicate if there is a strong correlation or lack of correlation between xi and y.
This method is often applied for several hydrology problems such as: forecasting equations for standardized runoff in a region of a country with standardized teleconnection indices, when El Niño or La Niña phenomenon occur [43] (González et al., 2000), or as an auxiliary method in estimating intensity-duration-frequency curves. In this research, regressions were made using the Microsoft Excel data analysis tool.

Results and discussion
Measured climatic data of the above variables, corresponding from January to June of 1998, were fed into both the symbolic regression genetic programming model and the multiple linear regression model in order to estimate water temperature. The models were then applied using data from January to June of 1999 in order to approach water temperature averages. Comparisons for the1998 and 1999 results were then made.
The genetic programming algorithm (equation 4) determined the next mathematical model which approaches the water temperature (average of each ten minutes). Where: Tw corresponds to the average water temperature each ten minute interval at instant t+160 in ºC Ta is the average air temperature each ten minute interval, with seven days filtering, corresponding to instant t+160, in ºC hr represents the average relative humidity each ten minutes interval, with seven days filtering, corresponding to instant t+160 in decimals rs is the average solar radiation each ten minutes interval, at instant t, in W/m 2 rn corresponds to the average net radiation each ten minutes interval, corresponding to instant t+160, in W/m 2 and finally, vv represents the average wind speed each ten minutes interval, corresponding to instant t+160, in m/s.
The objective function value using equation 5 was 0.8724. Figure 4 represents both measured and calculated water temperature variation versus time using both equations (4) and (5). Measured and calculated water temperature values also appear in Figure 5 with equations (4) and (5) in comparison with the identity function. In Figures 5 the measured data were compared against the identity function and the best correlation between these values was found using genetic programming (r=0.9697). Water temperature approach with multiple linear regression and genetic programming algorithm from January to June 1999 Equations (4) and (5) were applied to measured data from 1999 at the Ribarroja Station in order to arrive at the average water temperature. Measured water temperature data and the obtained residuals using both models are shown in Figure 6.
According to Figure 6 the differences between measured and calculated water temperature shown were up to 5.5 °C (underestimation) and about 0.5 °C (overestimation) while differences with equation 5 reported an underestimation near to 4.5°C and the overestimation of almost 2°C, so the range of variation in water temperature reported by both equations is almost the same.
In order to get better results in future works must be analyzed the data standardization as a preprocessing to get new mathematical linear and nonlinear models [44], The variables could be standardized by subtracting the mean and dividing by the standard deviation: where: Z standardized variable, dimensionless Tw variable before standardization, with physical dimensions Another possibility to analyze is the splitting of the considered function by taking into account the different times of year that causes a variation in water temperature behavior.

Conclusions
Water temperature adjustment curves, in a gauged station on the Ebro River in Spain, were obtained by means of two procedures: a genetic programming algorithm (equation 4) and a multiple linear regression (equation 5), using data from 1998. The multiple linear regression method yielded a function containing the five considered variables (solar radiation, net radiation, wind speed, air temperature and relative humidity) with each variable weighted. The genetic programming algorithm yielded a function where water temperature was obtained only as a function of air temperature and relative humidity. The others variables were eliminated by the evolution algorithm due to the lack of correlation between water temperature and the remaining variables although solar radiation is implied inside the air temperature term.

Residuals
Measured Residuals MLR eq(5) Residuals GP eq(4) Residuals eq(4) mean =1.04, standard deviation=1.08 eq(5) mean = 0.43, standard deviation=1.09 Comparing measured data with calculated data, for the year 1998, led to only minor errors in estimating the average water temperature using the genetic programming algorithm. When equations (4) and (5) were applied to another year, 1999, minor mean quadratic error in estimating water temperature was obtained using the multiple linear regression equation (5). The mean quadratic error associated with the multiple linear regression equation (5) for 1999 was 1.375 ºC; whereas with the genetic programming equation (4) was 2.248 ºC. This error can be considered acceptable if one takes in account the average temperature from January to June 1998 was 12.54 ºC, whereas the average temperature in 1999 for the same period was 11.62 ºC. The residuals obtained with equations (4) and (5) using data for the year 1999 had average values of 1.04 ºC and 0.43 ºC, respectively and with this criteria, multiple linear regression model can be considered better than the GP. However, reviewing the standard deviations, both models had almost the same value (1.09 ºC and 1.08 ºC, respectively).
The described procedures are then useful because equations similar to (4) or (5) can estimate important water quality characteristics, such as water temperature, using previously measured climatic data, predicted climatic data, and hydrological parameters for a given time period.
Engineer's criteria and common sense must be considered before to apply any model to simulate physical variables.
Some standardization procedures to the involved data are suggested in order to improve the results from new models that can be obtained.
The methods here applied are undoubtedly useful in several areas of knowledge, and can led us to new approaches to physical phenomena by considering measured field data.
Future work is focuses on the use of NARMAX (Non-linear Autorregressive Moving Average with eXogenous inputs) model combined with genetic programming in order to model the water temperature providing more accurate equations.