## 1. Introduction

While water scarcity has become a key concern worldwide, it is particularly so in arid and semiarid regions with limited potable water sources. In designing water distribution systems (WDS), engineers have typically used a “fixture unit” method, which considers the sum of fixture unit demands, facility types, and socioeconomic factors to determine peak demand. However, this overestimates the actual peak demand by as much as 100% [1]. Due to various uncertainties, including those associated with demand, engineers often include large safety factors when designing WDS. Given that WDS rely mainly on regional energy and resources, an overdesigned system can have environmental impacts that will appear in region(s) beyond the jurisdictional boundaries of the system. While short-term demand forecasts are critical to a WDS daily operations [2], long-term forecasts are required for future planning and management of the systems. In providing an accurate estimate of water demand, a robust demand-forecasting model assists managers in designing a more environmentally sustainable WDS and in managing available water resources more efficiently. When coupled with a water demand management strategy, such models can help managers overcome operational problems (e.g., low pressure during peak demands) and issues related to asset management (e.g., nonreplacement of assets or replacement by lower capacity assets reaching the end of their economic life). It has been estimated that a well-predicted monthly average demand might be up to 400% lower than peak demands that cause low pressure; however, a more realistic model can enhance resource management and operating systems. This will eventually lead to significant savings for water and energy (for running pumps, treatment plants, etc.) industries. Considering weather conditions and population, the prime objective of the present study was to develop a predictive model for monthly average water demand. While the present study proposed a generic framework that could be easily adjusted for any specific case, the City of Kelowna (British Columbia, Canada) was employed as a test case.

## 2. Literature review

Water demand varies greatly both regionally and seasonally. Increasing urbanization and industrialization as well as emerging issues such as shifting weather patterns and population growth have significant impacts on water demand. The main components in demand prediction are the explanatory variables and time scales used. Selecting explanatory variables for a predictive model depend on the desired time scale and the availability of data. Simple models using very few explanatory variables have shown promising accuracy for short-term prediction [3, 4]. In general, the explanatory variables affecting water demand are of two types: weather (e.g., temperature, relative humidity, and rainfall) and socioeconomic (e.g., population and income). Weather conditions affect short-term prediction while their socioeconomic counterparts can affect long-term predictions [5–7]. As has been highlighted by significant worldwide changes in climate, both in terms of weather conditions and global warming, water availability is prone to great uncertainty [8]. Therefore, the impact of evolving weather conditions on long-term water demand predictions should receive greater attention. Furthermore, researchers who have considered weather conditions in short-term water demand prediction have established that it is not feasible to feed online automated WDS with real time weather information [9]. As a result, limited studies have considered weather conditions in their demand forecasting models [10–12]. **Table 1** summarizes the relevant literature. Temperature, precipitation, pan evaporation, and number of days since the last rainfall were used in a forecasting model [13]. Another study used temperature, relative humidity, rainfall, wind speed, and air pressure as weather parameters in their hourly water demand model for Sao Paulo, Brazil [12]. **Table 1** shows the previous researchers did not consider socioeconomic and weather conditions simultaneously since their effects are highly dependent on the forecast’s time scale. Traditionally, WDS utilities have used historical patterns as explanatory variables in predicting future water demands. Scarce water reserves and the rapid increase in urbanization have raised awareness and led to implementation of statistical approaches. Multiple linear regression (MLR) and time series were the most popular techniques used in the early stages of demand forecasting [6]. While MLR has been widely used to better understand the determinants of water demand [14–18], its major drawback is the fact that it considers linear relationships among variables and water demand, such relationships are nonlinear by nature. Time series have been introduced along with regression as methods for demand forecasting [10, 19]. Due to the common belief that they can deal with complex systems [20], artificial neural networks (ANNs) have been widely applied in water demand forecasting [21–23, 2]. Comparing regression, univariate time series, and ANN models, one study found ANN models drawing on standard rainfall and maximum temperature data could better predict weekly water demand than other models [6]. Similarly, drawing on temperature and rainfall data in their forecasting models, researchers concluded that ANN models provided more reliable forecasts for peak weekly demand than time series and simple and multiple linear regressions [22]. Results of another study showed ANN models performed better for hourly forecasts, whereas regression models were more accurate in forecasting daily demand [23]. To improve the accuracy and robustness of demand forecasting models, hybrid models combining or modifying ANN, MLR, and time series techniques have been tested [24–27]. However, application of nonlinear regression in demand forecasting has remained limited to studies using support vector machines (SVMs) [28–30] and training nonlinear relationships through linear regression models [6, 31]. The present study compares gene expression programming (GEP) and SVM nonlinear approaches. Inspired by Darwin’s theory of evolution, GEP was recently proposed in engineering disciplines to optimize the structure of input variables fed into predictive models [32]. Being a self-learning algorithm, GEP has several advantages over conventional predictive models. GEP defines individual block structures (input variables, response, and function sets) and selects the optimized operating functions and multipliers through the process of learning algorithms. Results of one study indicated GEP models outperformed traditional linear models in the field of hydrology [32]. Since weather information is one of the major determinants of water demand, this research employed GEP to develop a robust and accurate demand-forecasting model.

No. | Reference | Method | Determinant | Time scale |
---|---|---|---|---|

1 | [16] | Linear regression | Seasonal dummies, derivatives of weather and price | Monthly demand |

2 | [17] | Linear regression | Density, building size, lot size, household size, income, price, temp, rain, drought dummies | Bimonthly demand |

3 | [18] | Regression using Bayesian moment entropy | Population density | Annual demand |

4 | [13] | Decomposed daily demand followed by composite forecasts | Daily demand and hourly demand | Daily demand |

5 | [19] | Univariate time series | Y_{t−1} | Annual residential demand |

6 | [22] | Regression and ANN | Temp, rainfall, and lags of peak demand | Peak weakly demand |

7 | [23] | ANN | Temperature, rainfall, and delayed demand | Daily demand |

8 | [2] | Time series | Univariate demand series, temperature in a multivariate model | Daily, weekly, monthly, annual |

9 | [6] | Time series and ANN | Delayed demands, temperature, and rainfall | Weekly demand |

10 | [24] | Holt-Winters multiplicative smoothing modified regression | Precipitation, temperature, humidity, lagged demand | Weekly (6 days) |

11 | [26] | Weighted average regression and ANN | Historical demand and time | Annual demand |

12 | [27] | Decomposed annual demand, regression and ANN | GDP, population, temperature, greenery coverage, delayed demand | Annual demand |

13 | [31] | Wavelet-deinoizing and ANN | 7-year long time series of demand | Monthly demand |

14 | [28] | SVM with RBF function is compared with ANN | Delayed demand, population | Daily demand |

15 | [29] | ANN, SVM, Monte Carlo | Rain, demand, wind speed, atmospheric pressure | Hourly demand |

16 | [30] | SVM and adaptive Fourier series | Wind speed, temperature, demand, humidity, and rainfall | Hourly demand |

## 3. Study area and data collection

This research focused on the City of Kelowna located in the Okanagan Valley (British Columbia, Canada). The City has five water districts including the City of Kelowna District (CKD), Glenmore Ellison Irrigation District (GEID), Black Mountain Irrigation District (BMID), Rutland Water District (RWD), and the South East Kelowna Irrigation District (SEKID). The CKD served as the study area of this research. Using three major pumping stations, the CKD primarily supplies water from the Okanagan Lake. The present study used monthly mean water demand data from 1996 to 2010 (http://www.kelowna.ca/). The population censuses of 1996, 2001, 2006, and 2011, along with the best-fit parabolic equation (with coefficient of determination of *R*^{2} ≈ 1) allowed estimation of the population in noncensus years. Weather indices including temperature, wind speed, relative humidity, and rainfall, were drawn from the Environment Canada weather data (http://kelowna.weatherstats.ca/) collected at Station A (latitude 49°57′13″N, longitude 119°22′29″W) located at the City of Kelowna’s airport.

## 4. Methodology

### 4.1. Model development

To determine water demand (*D*) in millions of liters (ML), this research used population (*P*) and hotel occupancy factor (HOR) as socioeconomic parameters (the City of Kelowna is one of the hot spots for tourism in North America), and temperature (*T*) in °C, relative humidity (RH) in percent, and rainfall (*R*) in millimeters as weather parameters. As these parameters did not have the same order of magnitude, they were normalized prior to models development by

where *X* is the standardized magnitude of parameter *x*, *μ* and σ are the corresponding mean and standard deviation, respectively. Phase space reconstruction of each explanatory variable was used prior to GEP modeling to define the structure of the model inputs. This was done to identify the stochastic or deterministic nature of the collected data. For a given proper lag time, the phase space was built by applying Taken’s theorem [33] and transforming the time-series data into the geometry of a single moving point along a trajectory, where each point corresponds to a datum. Average mutual information (AMI) was used to determine the proper lag time of water demand for phase space reconstruction of all input factors. This was done to achieve a comprehensive understanding of input factors, variable self-interaction, and assess the use of lag times in demand forecasting models. Labeled *M*_{a}*D*_{b}OP_{c}, where *a*, *b*, and *c* ∈ {1, 2, 3} a total of 27 models were created (**Table 2**), which combined three input types [*M*_{1}: demand data only; *M*_{2}: demand and climatic data; *M*_{3}: demand, climatic, and demographic data], three lag times [*D*_{1}: 1 month lag; *D*_{2}: 1 and 2 month lags; *D*_{3}: 1, 2, and 3 month lags], and three types of genetic operators [OP_{1}: {+, −, *x*}; OP_{2}: {+, −, *x*, *x*^{2}, *x*^{3}}; OP_{3}: {+, −, *x*, *x*^{2}, *x*^{3}, √, e^{x}, log, ln}] used in developing the GEP models.

Data were used in partitions of 144 samples for training (1996–2007) and 35 samples for validation (2008–2010). The time series of water demand over the time period of 1996–2010 (**Figure 1**) shows a relatively regular periodic cycle of water demand in CKD that is mainly due to seasonal changes.

### 4.2. Genetic expression programming (GEP)

Introduced by Ferreira, GEP is an emerging soft computing technique [34]. The strategy used for the learning algorithms was the optimal evolution using the genetic operators. Following Ferreira, this research defined the overall structure of the GEP model by: 30 chromosomes, eight head sizes, and three genes [35]. The selected head size determined how complex each model parameter was. Each of the gene heads underwent a set of different arrangements to model the feeding data. Selecting new random populations was followed by reproduction in order to reach the most suitable model under optimized stopping conditions. Models were developed based on three genes linked together by an addition function. The number of genes per chromosome specified the layers or blocks involved in building the whole model. Although a large gene was useful, dividing the chromosomes into simpler units resulted in a more efficient and manageable learning process. RMSE was used as a fitness function to fit a curve to target values. The stopping condition was a maximum fitness and coefficient of determination (*R*^{2}). Ten numerical constants were used as floating data point in each gene.

### 4.3. Lag time

The literature lists three methods for estimating lag time, AMI, autocorrelation function (ACF), and correlation integral (CI) [36–38]. AMI is considered the best since ACF reflects only linear properties and CI requires a large set of data [39]. Consequently, the present study employed AMI defined as:

where the joint probability of two successive time series, *P*(*X*_{i}, *X*_{i+τ}) and the product of their individual marginal probability, *P*(*X*_{x}) · *P*(*X*_{i+x}), were used to find the optimum lag time. This lag can contribute to the maximum information added on *X*_{i} by the successive time series *X*_{i+τ}. The prime objective of using this approach was to make sure these time series were independent and thereby better represented the dynamics of the system in the phase space. In other words, a balanced independency was desirable in identifying an optimum delay time.

### 4.4. Support vector machines (SVM)

For SVM models, in which genetic operators are not used, the input types remained *M*_{1}, *M*_{2}, or *M*_{3}, while the lag times remained *D*_{1}, *D*_{2}, or *D*_{3}. This study compared the performance of radial basis function (RBF), polynomial (Poly), and Linear (Lin) kernels. These were appended to the input type and lag, e.g., *M*_{1}*D*_{1}RBF, *M*_{1}*D*_{1}Poly, *or M*_{1}*D*_{1}Lin. **Figure 2** shows the structure of the SVM model. Kernel functions (RBF, Poly, or Lin) were used to map the input vectors into higher dimensions in space.

In this method, the input vectors are considered as supports forming the backbone of the whole model structure through a training process. If *N* samples of the population given by

where *X* is an input parameter with *m* components and *Y* is its response output variable, *W* is a weight vector, *b* represents a bias, and *φ* is a transfer function which exhibits nonlinear behavior, mapping the input vectors into a higher dimensional space. As these mapped vectors can compromise the complex nonlinear regression of the input space, Cortes and Vapnik introduced the convex optimization problem with an insensitivity loss function [40]:

where ξ_{k} and *C* is a positive trade-off parameter that determines the degree of the empirical error in the optimization problem. Following previous researchers [41, 42], the optimization was simultaneously undertaken through Lagrangian multipliers under Karush Kuhn-Tucker (KTT) conditions.

## 5. Results and discussion

The prime objective of using phase space reconstruction was to find a proper lag time for developing the models in this study. In order to have a comprehensive understanding of model performance, GEP models were defined by all lag times up to the optimum value determined for water demand in the CKD. The AMI calculations of the water demand in the CKD resulted in a lag time of 3 months. **Figure 3** shows that the first local minimum point occurs at 3 months, allowing the AMI an optimum lag time for phase space reconstruction (*τ* = 0.6591 for 2 months, *τ* = 0.5073 for 3 months).

**Figure 4a–c** shows the phase space diagrams of water demand for *τ* = 1, 2, and 3 months, respectively. Each figure represents the state of WDS demand at the given time. The evolution of phase space in this time series was given by reconstructing a pseudo phase space in which the demand of CKD, a nonlinear system, was considered by its self-interaction using AMI [43]. **Figure 4c** (*τ* =3) has a more regular pattern in comparison with the other two previous states of phase space (*τ* = 1, 2; **Figure 4a** and **b**, respectively), showing a lag time of 3 months to be optimum.

Prior to analysis with GEP models, a correlation table between the explanatory variables and water demand provided a better understanding of how to define the input factors (**Table 3**). The correlations were 0.92, 0.84, −0.83, 0.11, and −0.01 for *D* vs. *T*, *D* vs. HOR, *D* vs. RH, *D* vs. *P*, and *D* vs. *R*, respectively. Interestingly, water demand was highly correlated to temperature and hotel occupancy rate in CKD, showing the periodic cycle of demand due to seasonal changes. This research, however, employed all input factors in evolving the GEP models.

**Table 4** shows all 27 GEP models developed in the present study. Three superior models were highlighted in each category or classification of determinants. Interestingly, a lag time of 3 months outperformed other combinations in all different classifications which show the importance of using phase space construction in studying complex systems. This shows that an appropriate lag time determined by AMI can significantly improve the performance of the forecasting model. Different genetic operators were also used to understand which mathematical operations better define the nature of these determinants. The first operator {+, −, *x*} showed better performance in the first two classifications, i.e., for demand based and demand plus climatic info based categories. The second operator (OP_{2}) {+, −, *x*, *x*^{2}, *x*^{3}} outperformed other operators in (OP3) (demand + socioeconomic + climatic information) of input parameters in which socioeconomic factors were included. It is interesting that using more complex mathematical operations, as in OP_{3} {+, −, *x*, *x*^{2}, *x*^{3}, √, *e*^{x}, log, ln} consistently reduced the quality of the models’ performance. This showed that water demand forecasting could be reasonably explained by models using basic mathematical operations despite its complexity. Used to investigate the sensitivity of the models to determinant classification, the genetic operator, and lag time, the performance indices of MAE and RMSE did little to distinguish among the best performing models (*M*_{1}*D*_{3}OP_{1}, *M*_{2}*D*_{3}OP_{1}, and *M*_{3}*D*_{3}OP_{2}) in each category, i.e., MAE = 0.304, 0.3035, and 0.291, respectively, and RMSE = 0.3984, 0.3664, and 0.3660. While *R*^{2} values showed M2 and M3 models to slightly outperform M1 models, plotting observed and predicted demand over time, as well as scatter plots of observed vs. predicted demand served to further delineate differences in performance (**Figure 5**). Comparing cumulative water demand calculated by each of the three top models to observed values showed the *M*_{1}*D*_{3}OP_{1} and *M*_{3}*D*_{3}OP_{2} models to be more accurate than *M*_{2}*D*_{3}OP_{1} (**Figure 6**). In order to distinguish between *M*_{1}*D*_{3}OP_{1} and *M*_{3}*D*_{3}OP_{2} a plot of cumulative (observed – predicted) was plotted (**Figure 7**). This showed model *M*_{3}*D*_{3}OP_{2} to be the best given the lesser fluctuations in errors and a consistent pattern throughout the plot’s time period. This better performance may be attributable to the combination of socioeconomic factors with demand and climatic data; this might having resulted in a more consistently accurate model, which lowered the error associated compared to the other two models.

[i] -
**M*_{1}, Demand; *M*_{2}, Demand + Climactic; *M*_{3}, Demand + Climactic + Socioeconomic; *D*_{1}, **τ** (lag) = 1 month; *D*_{2}, **τ** = 2 months; *D*_{3}, **τ** = 3 months; *OP*_{1}, {+, −, x}; *OP*_{2}, {+, −, x, x2, x3}; *OP*_{3}, {+, −, x, x2, x3, √, ex, log, ln}; *R*^{2}, coefficient of determination; MAE, mean absolute error; RMSE, root mean square error.

The superior GEP models from each classification were compared to SVM models implementing three different kernel functions (RBF, Poly, and Lin). Training and testing performance indices for the SVM models developed with each of the three kernel functions showed *Poly* kernel functions to outperform RBF and Lin functions (**Table 5**). The fact that Lin kernels performed poorly indicates that the nature of input parameters could not be considered using such functions. The *M*_{2}*D*_{3}Poly model was selected as the superior SVM model to be compared with the GEP models (**Figure 8**).

## 6. Conclusion

In an attempt to improve model prediction accuracy, a wide range of modeling techniques has been proposed by researchers over recent years in the water demand forecasting field. The present research explored a new approach to modeling water demand, namely genetic expression programming along with phase space reconstruction. In this method, input factors are not randomly chosen as in previous studies. Instead, appropriate lag time determinations made by the AMI method defined the structure of the explanatory variables employed in the models. The outcome of this research demonstrated GEP models to be highly sensitive to classification of input factors, proper lag time, and selection of genetic operators. In general, soft computing techniques like GEP should receive more attention in forecasting behaviors of complex systems such as WDS. These models can offer valuable information to WDS operators and designers to deploy optimum determinants in their forecast models. The three best GEP models proposed in this research were compared using different performance indices, however, differentiating between them was difficult due to the similarity in statistical index values. One of three GEP models was selected due to lower cumulative error in predicting demand and less fluctuation in comparison with the other two GEP models. However, these models were slightly outperformed by a SVM model, which showed even better performance indices. This shows that both GEP and SVM can be useful techniques in water demand forecasting and can account for nonlinearity of the input parameters