Nonlinear Evapotranspiration Modeling Using Artificial Neural Networks

Reference evapotranspiration (ET o ) is an important and one of the most difficult components of the hydrologic cycle to quantify accurately. Estimation/measure-ment of ET o is not simple as there are number of climatic parameters that can affect the process. There exists copious conventional (direct and indirect) and non con-ventional/soft computing (artificial neural networks, ANNs) methods for estimating ET o . Direct methods have the limitations of measurement errors, expensive, impracticality of acquiring point measurements for spatially variable locations, whereas the indirect methods have the limitations of unavailability of all necessary climate data and lack of generalizability (needs local calibration). In contrast to conventional methods, soft computing models can estimate ET o accurately with minimum climate data which have advantages over limitations of conventional ET o methods. This chapter reviews the application of ANN methods in estimating ET o accurately for 15 locations in India using six climatic variables as input. The performance of ANN models were compared with the multiple linear regression (MLR) models in terms of root mean squared error, coefficient of determination and ratio of average output and target ET o values. The results suggested that the ANN models performed better as compared to MLR for all locations. backpropagation Broyden-Fletcher-Goldfarb-Shanno


Introduction
Evapotranspiration (ET) is the combining process of evaporation and transpiration losses. Almost 62% of precipitation falls on continents are returned back to the atmosphere through the ET process [1]. ET plays a significant role in the hydrological cycle and its estimation is very important in various fields of water resources. A common procedure for estimating actual crop evapotranspiration (ET crop ) is to first estimate reference evapotranspiration (ET o ) and to then apply an appropriate crop coefficient (k c ). ET o is an important and one of the most difficult components of the hydrologic cycle to quantify accurately. ET o is measured from a hypothetic crop of uniform height (12 cm), active growing (crop resistance of 70 s m À1 ), completely shading the ground (albedo of 0.23) and unlimited supply of water [2]. The Food and Agricultural Organization (FAO) consider the above definition as standard and sole method for estimating ET o if sufficient climatic data are available [3,4].
Estimation of ET o is complex due to influence of various climatic variables (maximum and minimum air temperature, wind speed, solar radiation and maximum and minimum relative humidity) and existence of nonlinearity in between climatic data and ET o. Though users have number of methods for measuring or estimating ET o directly or indirectly, most of them have some limitations regarding data availability or regional applicability. In addition, in order to use these methods, users are required to make reasonable estimates for some of the parameters in the employed ET o models, which involve some uncertainties and might not result in reliable ET o estimates [5]. Further, it is difficult to develop accurately representative physically based models for the complex non-linear hydrological processes, such as ET o . This is because the physical relationships involving in a system can be too complicated to be accurately represented in a physically based manner.
The above limitations lead to the need of developing some techniques that can accurately estimate ET o values with a minimum input data and are also easy to apply without knowing physical process inside the system. Artificial neural network (ANN) techniques, which can provide a model to predict and investigate the process without having a complete understanding of it, can be a useful tool for the above purpose. These techniques are also interesting because of its knowledge discovering property. In contrast to conventional methods, ANNs can estimate ET o accurately with minimum climate data, which may have the advantages of being inexpensive, independent of specific climatic condition, ignored physical relations, and precise modeling of nonlinear complex system. In the last decade, many researchers have used ANN techniques for modeling of the ET o process [6][7][8][9][10][11][12][13][14][15][16][17][18][19][20][21][22][23][24][25].

Review of literature
This section follows the discussion of some of the significant contributions made by various researchers in the application of different ANN techniques for modeling ET o or pan evaporation (E p ). A radial basis function (RBF) neural network was developed in C language to estimate daily soil water evaporation [26]. The input layer of network consists of average relative air humidity, air temperature, wind speed (Ws) and soil water content. They compared the results of RBF networks with the multiple linear regression (MLR) techniques. A feed-forward back propagation (FFBP)-based ANN model was developed to estimate daily E p based on measured weather variables [27]. They used different input combinations to model E p . They compared the developed ANN models with the Priestly-Taylor & MLR models. RBF neural network model was developed to estimate the FAO Blaney-Criddle b factor [28]. The input layer to RBF model consists of minimum daily relative humidity (RH min ), day time Ws and mean ratio of actual to possible sunshine hours (n/N). The b values estimated by the RBF models were compared to the appropriate b values produced using regression equations. FFBP ANN models were implemented for the estimation of daily ET o using six basic climatic parameters as inputs [16]. They trained ANNs using three learning methods (with different learning rates and momentum coefficients), different number of processing elements in the hidden layers, and the number of hidden layers. The compared the results of developed ANN models with the Penman Monteith (PM) method and with a lysimeter measured ET o . ANN-based back propagation models for estimating Class A E p with minimum climate data (four input combinations) were developed and compared with the existing conventional methods [22].
The potential of RBF neural network for estimating the rice daily crop ET using limited climatic data was demonstrated [23]. Six RBF networks, each using varied input combinations of climatic variables were trained and tested. The model estimates were compared with measured lysimeter ET. A sequentially adaptive RBF network was applied for forecasting of monthly ET o [29]. Sequential adaptation of parameters and structure was achieved using extended Kalman filter. Criterion for network growing was obtained from the Kalman filter's consistency test. Criteria for neuron/connections pruning were based on the statistical parameter significance test. The results showed that the developed network was learned to forecast ET o , t + 1 (current or next month) based on ET o , t-11 (at a lag of 12 months) and ET o , t-23 (at a lag of 24 months) with high reliability. The study examined that whether it is possible to attain the reliable estimation of ET o only on the basis of the temperature data [24]. He developed RBF neural network for estimating ET o and compared the developed model with temperature-based empirical models.
ANN-based daily ET o models were trained based on different categories of conventional ET o estimation methods such as temperature based (FAO-24 Blaney-Criddle), radiation based (FAO-24 Radiation method for arid and Turc method for humid regions) and combinations of these (FAO-56 PM) [14]. A comparison of the Hargreaves and ANN methods for estimating monthly ET o only on the basis of temperature data was done [19]. They developed ANN models with the data from six stations and tested these developed models with the data from remaining six stations, which were not used in model development. The capability of ANN for converting E p to ET o was studied using temperature data [18]. The conventional method that uses pan coefficient (K p ) as a factor to convert E p to ET o was considered for this comparison. Generalized ANN (GANN)-based ET o models corresponding to FAO-56 PM, FAO-24 Radiation, Turc and FAO-24 Blaney-Criddle methods were developed [15]. These models were trained using the pooled data from four California Irrigation Management Information System (CIMIS) stations with FAO-56 PM computed values as targets. The developed GANN models were tested with different stations which were not used in training. Multilayer perceptron (MLP) neural networks for estimating the daily E p using input data of maximum and minimum air temperature and the extraterrestrial radiation was developed [20]. The potential for the use of ANNs to estimate the ET o based on air temperature data was examined [21]. He also conducted comparison of estimates provided by the ANNs and by Hargreaves equation by using the FAO-56 PM model as reference model.

Study area and data collected
For the purpose of this study, 15 meteorological stations in India were chosen. Figure 1 shows the geographical locations of these selected stations and their related agro-ecological regions (AERs). These stations are having daily meteorological data of from 2001 to 2005 of following variables: minimum air temperature (T min ), maximum air temperature (T max ), minimum relative humidity (RH min ), maximum relative humidity (RH max ), mean wind speed (w s ), and solar radiation (S ra ). Table 1 shows the details of 15 climatic stations of India along with altitude and duration of available data. The study area is bounded between the longitudes of 68°7 0 and 97°25 0 E and the latitudes of 8°4 0 and 37°6' N. The annual potential evapotranspiration of India is 1771 mm. It varies from a minimum of 1239 mm in Jammu and Kashmir to a maximum of 2100 mm in Gujarat [30].

Theoretical consideration
The concept of neural networks was introduced by [31]. The neural-network approach, also referred to as 'connectionism' or 'paralleled distributed processing',   adopts a "Brain metaphor" of information processing. Information processing in a neural network occurs through interactions involving large number of simulated neurons. A neural network (NN) is a simplified model of the human brain nervous system consisting of interconnected neurons in a parallel distributed system, which can learn and memorize the information. In NN, the interneuron connection strengths, known as 'synaptic weights' are used to store the acquired knowledge [32]. In other words, ANN discovers the relationship between a set of inputs and desired outputs without giving any information about the actual processes involved; it is in essence based on pattern recognition. ANNs consist of a number of interconnected processing elements or neurons. How the inter-neuron connections are arranged determines the topology of a network. A neuron is the fundamental unit of human brain's nervous system that receives and combines signals from other neurons through input paths called 'dendrites'. Each signal coming to a neuron along a dendrite passes through a junction called 'synapse', which is filled with neurotransmitter fluid that produce electrical signals to reach to the soma or cell body where processing of the signals occurs [16]. If the combined input signal after processing is stronger than the threshold value, the neuron activates, producing an output signal, which is transferred through the axon to the other neurons. Similarly, ANN consists of a large number of simple processing units called neurons (or nodes) linked by weighted connections. A comprehensive description of neural networks was presented in a series of papers [33][34][35], which provide valuable information for the researchers.

Model of a neuron
The main function of artificial neuron is to generate output from an activated nonlinear function with the weighted sum of all inputs. Figure 2 illustrates a nonlinear model of a neuron, which forms the basis for designing ANN. The input layer neurons receive the input signals (x i ) and these signals are passed to the cell body through the synapses. A set of synapses or connecting links is characterized by its own weight or strength. A signal at the input of synapse 'i' connected to neuron 'k' is multiplied by the synaptic weight 'w ki '. The input signals, weighted by the respective synapses of the neuron are added by a linear combiner. An activation function or squashing function is used for limiting the permissible amplitude range of the output of a neuron to some finite value. An external bias (b k ) has an effect of increasing or decreasing the net input of the activation function depending on whether it is positive or negative, respectively. In the mathematical form, a neuron k may be described by the following equations: where x 1 , x 2 , x 3, ……….. x n = input signals; w k1 ,w k2 , …….w kn = synaptic weights of neuron k; u k = linear combiner output due to the input signal; b k = bias; φ(.) = activation function; y k = output signal of the neuron k.
Let v k be the induced local field or activation potential, which is given as: Now, Eqs. (1), (2) and (3) can be written as: yk ¼ ϕðvkÞ (5) In Eq. (5), a new synapse with input x 0 = +1 is added and its weight is w k0 = b k to consider the effect of the bias.

Neural network architecture parameters
Determination of appropriate neural network architecture is one of the most important tasks in model-building process. Various types of neural networks are analyzed to find the most appropriate architecture of a particular problem. Multilayer feed forward networks are found to outperform all the others. Although multilayer feed forward networks are one of the most fundamental models, they are the most popular type of ANN structure suited for practical applications.

Number of hidden layers
There is no fixed rule for selection of hidden layers of a network. Therefore, trial and error method was used for selection of number of hidden layers. Even one hidden layer of neuron (operating sigmoid activation function) can also be sufficient to model any solution surface of practical interest [36].

Number of hidden neurons
The ability of the ANN to generalize data not included in training depends on selection of sufficient number of hidden neurons to provide a means for storing higher order relationships necessary for adequately abstracting the process. There is no direct and precise way of determining the most appropriate number of neurons to include in hidden layer and this problem becomes more complicated as number of hidden layer increases. Some studies indicated that more number of neurons in hidden layer provide a solution surface that closely fit to training patterns. But in practice, more number of hidden neurons results the solution surface that deviate significantly from the trend of the surface at intermediate points or provide too literal interpretation of the training points which is called 'over fitting'. Further, large number of hidden neurons reduces the speed of operation of network during training and testing. However, few hidden neurons results inaccurate model and provide a solution surface that deviates from training patterns. Therefore, choosing optimum number of hidden neurons is one of the important training parameter in ANN. To solve this problem, several neural networks with different number of hidden neurons are used for calibration/training and one with best performance together with compact structure is accepted.

Types of activation functions
The activation function or transfer function, denoted by φ(v), defines the output of a neuron in terms of the induced local field v. It is valuable in ANN applications as it introduces a degree of nonlinearity between inputs and outputs. Logistic sigmoid, hyperbolic tangent and linear functions are some widely used transfer function in ANN modeling.
Logistic sigmoid function: This function is a continuous function that reduces the output into the range of 0-1 and is defined as [32]: Hyperbolic tangent function: It is used when the desired range of output of a neuron is between À1 and 1 and is expressed as [32]: Linear function: It calculates the neuron's output by simply returning the value passed to it. It can be expressed as:

Neural network architectures
The manner in which the neurons of a neural network are structured is intimately linked with the learning algorithm used to train the network. This leads to the formation of network architectures. The neural network architectures are classified into distinct classes depending upon the information flow. The different network architectures are: (a) multilayer perceptions, (b) recurrent, (c) RBF, (d) Kohonen self-organizing feature map, etc.

Multilayer perceptions (MLPs)
MLPs are layered (single-layered or multi-layered) feed forward networks typically trained with static back-propagation ( Figure 3). Therefore, it is also called as FFBP neural networks. These networks have found their way into countless applications requiring static pattern classification. This architecture consists of input layers, output layer(s) and one or more hidden layers. The input signal moves in only forward direction from the input nodes to the output nodes through the hidden nodes. The function of hidden layer is to perform intermediate computations in between input and output layers through weights. The major advantage of FFBP is that they are easy to handle and can easily approximate any input-output map [37].

Recurrent neural networks (RNN)
RNN may be fully recurrent networks (FRN) or partially recurrent networks (PRN). FNN sent the outputs of the hidden layer back to itself, whereas PRN initiates the fully RNN and add a feed-forward connection ( Figure 3). A simple RNN could be constructed by a modification of the multilayered feed-forward network with the addition of a 'context layer'. At first epoch, the new inputs are sent to the RNN and previous contents from the hidden layer are passed to context layer and at next epoch, the information is fed back to the hidden layer. Similarly, weights are calculated hidden to context and vice versa. The RNN can have an infinite memory depth and thus find relationship through time as well as through the instantaneous input space. Recurrent networks are the state-of-the-art in nonlinear time series prediction, system identification, and temporal pattern classification [37][38][39].

Radial basis function (RBF) networks
RBF is a three-layer feed-forward network that consists of nonlinear Gaussian transfer function in between input and hidden layers and linear transfer function in between hidden and output layers ( Figure 3). The requirement of hidden neurons for the RBF network is more as compared to standard FFBP, but these networks tend to learn much faster than MLPs [37]. The most common basis function used is Gauss function and is given by: where R i = basis or Gauss function; c = cluster center; σ ij = width of the Gaussian function. The centers and widths of the Gaussians are set by unsupervised learning rules, and supervised learning is applied to the output layer. After the center is determined, the connection weights between the hidden layer and output layer can be determined simply through ordinary back-propagation (gradient-descent) training. The output layer performs a simple weighted sum with a linear output and the weights of the hidden layer basis units (input to hidden layer) are set using some clustering techniques.
where w i = connection weight between the hidden neuron and output neuron; w o = bias; x i = input vector.

ANN learning paradigms
Broadly speaking, there are two types of learning process namely, supervised and unsupervised. In supervised learning, the network is presented with examples of known input-output data pairs, after which it starts to mimic the presented input output behavior or pattern. In unsupervised learning, the network learns on their own, in a kind of self-study without teacher.
Supervised learning: It is also called 'associative learning' involves a mechanism of providing the network with a set of inputs and desired outputs. It is like learning with the help of a teacher. The so-called teacher has the knowledge of the environment and the knowledge is represented by a set of input-output examples. The environment is, however, unknown to the neural network. The network parameters (i.e., synaptic weights and error) are adjusted iteratively in a step-by-step fashion under the combined influence of the training vector and the error signal. After the completion of training, the neural network is able to deal with the environment completely by itself [32]. In supervised learning, FFBP NN is the most popular ones. In the FFBP NNs, neurons are organized into layers where information is passed from the input layer to the final output layer in a unidirectional manner. Any network in ANN consists of 'neurons or nodes or parallel processing elements' which interconnects the each layer with weights (W). A three layer (input (i), hidden (j) and target/output (k)) FFBP NN with weights W ij and W jk is shown in Figure 4. During training the FFBP NN, the initial or randomized weight values are corrected or adjusted as per calculated error in between output and target values and back-propagates these errors (from right to left in Figure 4) un till minimum error criteria achieved.
Unsupervised learning: Network is provided with inputs but not with desired outputs. The system itself must then decide what features it will use to group the input data. This is often referred to as self-organization or adaption. Provision is made for a task-independent measure of the quality of representation that the network is required to learn and the free parameters of the network are optimized with respect to that measure [32]. The most widely used unsupervised neural network is the Kohonen self-organizing map, KSOM.

Kohonen self-organizing map (KSOM)
KSOM maps the input data into two-dimensional discrete output map by clustering similar patterns. It consists of two interconnected layers namely, multidimensional input layer and competitive output layer with 'w' neurons ( Figure 5). Each node or neuron 'i' (i = 1, 2, … w) is represented by an n-dimensional weight or reference vector w i = [w i1 ,….,w in ]. The 'w' nodes can be ordered so that similar neurons are located together and dissimilar neurons are remotely located on the map. The topology of network is indicated by the number of output neurons and their interconnections. The general network topology of KSOM is either a rectangular or a hexagonal grid. The number of neurons (map size), w, may vary from a few dozen up to several thousands, which affects accuracy and generalization capability of the KSOM. The optimum number of neurons (w) can be determined by below equation [41]. w ¼ 5√N (11) where N = total number of data samples or records. Once 'w' is known, the number of rows and columns in the KSOM can be determined as: where l 1 and l 2 = number of rows and columns, respectively; e 1 = biggest eigen value of the training data set; e 2 = second biggest eigen value.

Training the KSOM
The KSOM is trained iteratively: initially the weights are randomly assigned. When the n-dimensional input vector x is sent through the network, the Euclidean distance between weight 'w' neurons of SOM and the input is computed by, where x i = ith data sample or vector; w i = prototype vector for x i ;jdenotes Euclidian distance.
The best matching unit (BMU) is also called as 'winning neuron' is the weight that closely matching to the input. The learning process takes place in between BMU and its neighboring neurons at each training iteration 't' with an aim to reduce the distance between weights and input. Kohonen self organizing map [40].
where α = learning rate; l and m = positions of the winning neuron and its neighboring output nodes; h lm = neighborhood function of the BMU l at iteration t.
The most commonly used neighborhood function is the Gaussian which is expressed as: where l-m = distance between neurons l and m on the map grid; σ = width of the topological neighborhood.
The training steps are repeated until convergence. After the KSOM network is constructed, the homogeneous regions, that is, clusters are defined on the map. The KSOM trained network performance is evaluated using two errors namely, total topographic error (t e ) and quantization error (q e ).
The topographic error, t e , is an indication of the degree of preservation of the topology of the data when fitting the map to the original data set.
where u(x i ) = binary integer such that it is equal to 1 if the first and second best matching units of the map are not adjacent units; otherwise it is zero.
The quantization error, q e , is an indication of the average distance between each data vector and its BMU at convergence, that is, the quality of the map fitting to the data.
where w li = prototype vector of the best matching unit for x i .

Type of ANN training algorithms
Training basically involves feeding training samples as input vectors through a neural network, calculating the error of the output layer, and then adjusting the weights of the network to minimize the error. There are different methods for adjusting the weights. These methods are called as "training algorithms". The objective of the training algorithm is to minimize the difference between the predicted output values and the measured output values [6]. Different training algorithms are: (i) gradient descent with momentum backpropagation (GDM) algorithm, (ii) Levenberg-Marquardt (LM) algorithm, (iii) Broyden-Fletcher-Goldfarb-Shanno (BFGS) quasi Newton algorithm, (iii) resilient back propagation (RBP) algorithm, (iv) conjugate gradient algorithm, (v) one-step secant (OSS) algorithm, (vi) cascade correlation (CC) algorithm, and (vii) Bayesian regularization (BR) algorithm. The training algorithms used in this study are only briefly described below.

Gradient descent with momentum back propagation (GDM) algorithm
This method uses back-propagation to calculate derivatives of performance cost function with respect to the weight and bias variables of the network. Each variable is adjusted according to the gradient descent with momentum. The equation used for update of weight and bias is given by: where Δw ji (n) = correction applied to the synaptic weight connecting neuron i to neuron j; α = momentum; η = learning-rate parameter; E = error function. The equation is known as the generalized delta rule and this is probably the simplest and most common way to train a network [37].

Levenberg-Marquardt (LM) algorithm
This method is a modification of the classic Newton algorithm for finding an optimum solution to a minimization problem. In particular the LM utilizes the so called Gauss-Newton approximation that keeps the Jacobian matrix and discards second order derivatives of the error. The LM algorithm interpolates between the Gauss-Newton algorithm and the method of gradient descent. To update weights, the LM algorithm uses an approximation of the Hessian matrix.
where W = weight; e = errors; I = identity matrix; λ = learning parameter; J = Jacobian matrix (first derivatives of errors with respect to the weights and biases); J T ¼ transpose of J; J T J ¼ Hessian matrix. For λ = 0 the algorithm becomes Gauss-Newton method. For very large λ the LM algorithm becomes steepest decent algorithm. The 'λ' parameter governs the step size and is automatically adjusted (based on the direction of the error) at each iteration in order to secure convergence. If the error decreases between weight updates, then the 'λ' parameter is decreased by a factor of λ À . Conversely, if the error increases then 'λ' parameter is increased by a factor of λ þ . The λ À and λ þ are defined by user. In LM algorithm training process converges quickly as the solution is approached, because Hessian does not vanish at the solution. LM algorithm has great computational and memory requirements and hence it can only be used in small networks. It is often characterized as more stable and efficient. It is faster and less easily trapped in local minima than other optimization algorithms [37].

Online and batch modes of training
On-Line learning updates the weights after the presentation of each exemplar. In contrast, Batch learning updates the weights after the presentation of the entire training set. When the training datasets are highly redundant, the online mode is able to take the advantage of this redundancy and provides effective solutions to large and difficult problems. On the other hand, the batch mode of training provides an accurate estimate of gradient vector; convergence of local minimum is thereby guaranteed under simple conditions [23].

Multiple linear regression (MLR)
MLR technique attempts to model the relationship between two or more explanatory (independent) variables and a response (dependent) variable by fitting a linear equation to the observed data. The general form of a MLR model is given as [42]: (20) where Y i = ith observations of each of the dependent variable Y; X 1, i, X 2,i, ⋯, X k, i = ith observations of each of the independent variables X 1, X 2 , ⋯, X k respectively; β 0 , β 1 , β 2 , ⋯, β n = fixed (but unknown) parameters; ε i = random variable that is normally distributed.
The task of regression modeling is to estimate the unknown parameters (β 0 , β 1 , β 2 , ⋯, β n ) of the MLR model [Eq. (20)]. Thus, the pragmatic form of the statistical regression model obtained after applying the least square method is as follows [42].
Therefore, estimate of The difference between the observed Y and the estimatedŶ is called the residual (or residual error).
The purpose of developing MLR models is to establish a simple equation which is easy to use and interpret. The MLR modeling is very useful, especially in case of limited field data. Moreover, it is versatile as it can accommodate any number of independent variables [43].

The FAO-56 Penman-Monteith method
The FAO-56 PM method is recommended as the standard method for estimating ET o in case of locations where measured lysimeter data is not available. The equation for the estimation of daily ET o can be written as [3]: where ET o = reference evapotranspiration calculated by FAO-56 PM method (mm day À1 ); R n = daily net solar radiation (MJ m À2 day À1 ); γ = psychrometric constant (kPa o C À1 ); Δ = slope of saturation vapor pressure versus air temperature curve (kPa o C À1 ); e s and e a = saturation and actual vapor pressure (kPa), respectively; T = average daily air temperature (°C); G = soil heat flux (MJ m À2 day À1 ); W s = daily mean wind speed (m s À1 ).
The ET o values obtained from above equation are used as target data in ANN due to unavailability of lysimeter measured values.

Methodology
For the purpose of this study, 15 different climatic locations distributed over four agro-ecological regions (AERs) are selected. The selected locations are Parbhani, Kovilpatti, Bangalore, Solapur, Udaipur (semi-arid); Anantapur and Hissar (arid); Raipur, Faizabad, Ludhiana, and Ranichauri, (sub-humid); and Palampur, Jorhat, Mohanpur, and Dapoli (humid). Daily climate data of T min , T max , RH min , RH max , W s , S ra for the period of 5 years (January 1, 2001 to December 31, 2005) was collected from All India Coordinated Research Project on Agrometeorology (AICRPAM), Central Research Institute for Dryland Agriculture (CRIDA), Hyderabad, Telangana, India. These data were used for the development and testing of various ANN-based ET o models. Due to the unavailability of lysimeter measured ET o values for these stations, it is estimated by the FAO-56 PM method, which has been adopted as a standard equation for the computation of ET o and calibrating other Eqs. [10]. The normalization technique was applied to both the input and target data before training and testing such that all data points lies in between 0 and 1. The normalization process removes the cyclicity of the data. The following procedure was adopted for normalizing the input and output data sets. Each variable, X i, in the data set was normalized (X i, norm ) between 0 and 1 by dividing its value by the upper limit of the data set, X i, max . Resulting data was then used for mapping.  (2005) sets. ANN models were trained with the LM algorithm consists of one hidden layer (sigmoid transfer function) and one output layer (linear transfer function). The parameters that were fixed after a number of trials include: RMSE = 0.0001, learning rate = 0.65, momentum rate = 0.5, epochs = 500, and initial weight range = À0.5 to 0.5. The developed various ANN models were compared with basic statistical MLR models. The developed ANN models were evaluated and compared based on different error functions described in Table 2. Training window of the model contains general information used for training the networks like, error tolerance, Levenberg parameter (lambda) and maximum cycles of simulation. For weights selection, two options are there, weights can be randomized or it can be read from an existing weight file of previous training.
6. Results and discussion 6  to highlight the necessity of using complex ANN models, it is necessary to show the results obtained using MLR models.

Training of ANN models for daily ET o estimation
All the ANN models were trained as per the procedure mentioned in methodology and after each training run; three performance indices (RMSE, R 2 , and R ratio ) were calculated, to find the optimum neural network. Several runs were used for determining the optimal number of hidden neurons with different architectural configurations. The optimum neural network was selected based on criteria such that the model has minimum RMSE and maximum R 2 values. Here, it is worth to mention that the R ratio is used only to know whether the models overestimated or underestimated ET o values. Training with higher number of hidden nodes might increase the performance of ANN models. But training with a several number of hidden nodes requires more computation time and cause complexity in architecture as it has to complete number of epochs [7]. Therefore, to avoid the above difficulty, the selection of an optimum node was fixed with a trial run of 1-15 hidden nodes only (i.e., not tried beyond 15 hidden nodes). Figure 6 shows the relationship between RMSE and number of hidden nodes of ANN models for four locations (Parbhani, Hissar, Faizabad, and Dapoli) during training. These locations are chosen randomly from each agro-ecological region such that Parbhani, Hissar, Faizabad, and Dapoli represent semi-arid, arid, sub-humid, and humid climates, respectively.
For ANN models, the best network was resulted at a hidden node of i + 1 (where i = number of nodes in the input layer) for most of the locations. Thus, i + 1 hidden nodes are sufficient to model the ET o process using the ANN models [13][14][15][16][44][45][46]. Table 3 shows the performance statistics of ANN models for 15 locations during training. The results pertaining to the optimal network structure of ANN models, resulted at i + 1 hidden nodes, are only summarized in Table 3 for 15 locations.

FAO-56 PM-based ANN models
ET o process is a function of various climatic factors (T max , T min , RH max , RH min , W S , and S ra ). Therefore, it is pertinent to take into account the combined influence of all the climatic parameters on ET o estimation. The ANN models corresponding to the FAO-56 PM were developed considering T max , T min , RH max , RH min , W s , and S ra as input and the FAO-56 PM ET o as target. Table 4 shows the performance statistics of ANN and MLR models for 15 locations during testing. Comparison of results obtained using MLR and ANN models indicated that the ANN models performed better than the MLR models for all locations except for Bangalore. This is confirmed from the low values of RMSE (mm day À1 ) and high values of R 2 for ANN models as compared to the MLR models. The R ratio values of MLR models for 15 locations are nearly approaching one, which simply indicates that on an average these models neither over-nor underestimated ET o . However, high values of RMSE and R 2 indicate that on a daily basis, these models over-and under-estimated ET o values. Though the performance of ANN models was good as compared to MLR models, in some locations these models over-or under-estimated the ET o values. The ANN models overestimated (R ratio > 1) ET o values at Palampur. The over-and under-estimations by all ANN models for the above locations were less than 3% which is negligible. The overall performance of all the models was represented as ANN > MLR for most of the locations except for Bangalore where, the performance of models was represented as MLR > ANN. The results suggest that the non-linearity of ET o process can be adequately modeled using ANN models.
The scatter plots of the FAO-56 PM ET o and ET o estimated with the ANN models for 15 climatic locations in India are shown in Figure 7. The scatter plots confirm the statistics given in Table 4. Regression analysis was performed between the FAO-56 PM ET o and ET o estimated with the ANN and the best-fit lines are shown in Figure 7. The values of R 2 for ANN models were found to be >0.968. The fit line equations (y = a 0 x + a 1 ) in Figure 7 gave the values of a 0 and a 1 coefficients closer to one and zero, respectively. Due to the superior performance of ANN models over the MLR models, the time series plots of these models with 1 year data (during testing) for four selected locations Parbhani, Hissar, Faizabad, and Dapoli are shown in Figure 8. The location figures indicated that, ET o estimated using ANN models matched well with the FAO-56 PM ET o except for a few peak values in case of Faizabad.

Summary and conclusions
Evapotranspiration is an important and one of the most difficult components of the hydrologic cycle to quantify accurately. Prior to designing any irrigation system, the information on crop water requirements or crop evapotranspiration is needed, which can be calculated using reference evapotranspiration. There exist direct measurement methods (lysimeters) and indirect estimation procedures (physical and empirical based) for modeling ET o . Direct methods have the limitations of arduous, cost-effective, and lack of skilled manpower to collect accurate measurements. The difficulty in estimating ET o with the indirect physically based methods is due to the limitations of unavailability of all necessary climate data, whereas the application of empirical methods are limited due to unsuitability of these methods for all climatic conditions and need of local calibration. ANNs are efficient in modeling complex processes without formulating any mathematical relationships related to the physical process. This study was undertaken to develop ANN models corresponding to FAO-56 PM conventional ET o method for 15 individual stations in India.
The potential of ANN models corresponding to the FAO-56 PM method was evaluated for 15 locations. The ANN models were developed considering six inputs (T max , T min , RH max , RH min , W s , and S ra ) and the FAO-56 PM ET o as target. The optimum number of hidden neurons was finalized with a trial of 1-15 hidden nodes. The ANN models gave lower RMSE values at i + 1 (i = number of inputs) hidden nodes for estimating ET o . Comparison results of MLR and ANN models indicated that the ANN models performed better for all locations. However, on an average the over-and under-estimations of ET o (<3% which is negligible) estimated by using MLR models was less as compared to ANN models. In brief, based on the above discussion on ET o modeling, the following specific conclusions are drawn: • For estimating ET o using ANN model, a network of single hidden layer with i + 1 (i = number of input nodes) number of hidden nodes was found as adequate.
• ANN-based ET o estimation models performed better than the MLR models for all locations.
However, it should be noted that only climate data from different agroecological regions of India was used in this analysis and the results might be different for various climates in other countries.