Using Principal Component Scores and Artificial Neural Networks in Predicting Water Quality Index

The management of river water quality is a major environmental challenge. One of the major challenges is in determining point and non-point sources of pollutants. Industrial and municipal wastewater discharges can be considered as constant polluting sources, unlike surface water runoff which is seasonal and highly affected by climate. According to Aiken et al. (1982), 42 tributaries in Peninsular Malaysia are categorized as very polluted including the Langat River. Until 1999, there were about 13 polluted tributaries and 36 polluted rivers due to human activities such as, industry, construction and agriculture (Department of Environment, Malaysia (DOE), 1999). In 1990, there were 48 clean rivers classified as clean but the number is reduced to 32 rivers in 1999 (Rosnani Ibrahim, 2001).


Introduction
The management of river water quality is a major environmental challenge. One of the major challenges is in determining point and non-point sources of pollutants. Industrial and municipal wastewater discharges can be considered as constant polluting sources, unlike surface water runoff which is seasonal and highly affected by climate. According to Aiken et al. (1982), 42 tributaries in Peninsular Malaysia are categorized as very polluted including the Langat River. Until 1999, there were about 13 polluted tributaries and 36 polluted rivers due to human activities such as, industry, construction and agriculture (Department of Environment, Malaysia (DOE), 1999). In 1990, there were 48 clean rivers classified as clean but the number is reduced to 32 rivers in 1999 (Rosnani Ibrahim, 2001).
Surface water pollution is identified as the major problem affecting the Langat River Basin in Malaysia. Increase in developing areas within the river basin has in turn increased pollution loading into the Langat River. To avoid further degradation, the DOE have installed telemetric stations along the river basin to continuously monitor the water quality. As a result, abundant data were collected since 1988. There are 927 monitoring stations located within 120 river basins throughout Malaysia. Water quality data were used to determine the water quality status and to classify the rivers based on water quality index (WQI) and Interim National Water Quality Standards for Malaysia (INWQS). WQI provides a useful way to predict changes and trends in the water quality by considering multiple parameters. WQI is calculated from six selected water quality variables, namely dissolved oxygen (DO), biochemical oxygen demand (BOD), chemical oxygen demand (COD), suspended solid (SS), ammonical nitrogen (AN) and pH (DOE, 1997). It is a well-known phenomenon that the contribution of pollution loading into river systems from the environment involves a complex interaction of many factors (e.g. chemical, physical and meteorological interaction). These primary pollutants are emitted from land use activities surrounding the river basin (e.g. agriculture, forest, urban, industrial and others) Rapid urbanization along the Langat River plays an important role in the increase of point source www.intechopen.com

272
(PS) and non-point source (NPS). In view of this complex interaction, use of modelling techniques to solve this problem, is needed. However, the problem of obtaining models that adequately represent the dynamic behaviour of field data is not easy. Lack of good understanding and description of the phenomena involved, the availability of reliable and complete field data set and the estimation of the numerous parameters involved are the major factors contributing to this problem. Beck (1986) noted that, increase in model complexity will undoubtedly increase the number of parameters, leading to the problems of identification.
Applications of ANN (Artificial Neural Networks) to environmental problems are becoming more common (Silverman and Dracup, 2000;Scardi, 2001;Recknagel et al., 2002;Bowden et al., 2005;Muttil and Chau, 2007). The applications of ANN, which are computing systems that were originally designed to simulate the structure and function of the brain (Rumelhart et al, 1986) is a relatively new concept in environmental modeling. If trained properly, a neural network model is capable of 'learning' linear as well as the nonlinear features in the data (Elsner and Tronis, 1992).
ANN consists of a set of simple processing units (neurons) arranged in a defined architecture and connected by weighted channels which act to transform remotely-sensed data into a classification. The classification techniques of ANN are unlike the conventional ones. It is distribution-free, may sometimes use small training sets (Hepner et al., 1990) and, once trained; it is rapid computationally, which will be of value in processing large data sets (Gershon and Miller, 1993). Furthermore, ANNs have been shown to be able to map land cover more accurately compared to many widely used statistical classification techniques (Benediktsson et al., 1990;Foody et al., 1995) and alternatives such as evidential reasoning (Peddle et al., 1994).
It has been proposed that the best tool to model non-linear environmental relationship is ANN (Zhang and Stanley, 1997;Jain and Indurthy, 2003). Research have been undertaken at Imperial College, London which attempts to investigate the capability of ANN approach in modelling spatial and temporal variations in river water quality (Clarici, 1995). ANNs were used as a predictive model to predict cyanobacteria Anabaena spp. in the River Murray, South Australia (Maier et al., 1998). DeSilets et al. (1992), have also used ANN to predict salinity. Ha and Stenstrom (2003), proposed a neural network approach to examine the relationship between storm water quality and various types of land use.
ANN has been successfully applied on the study of river water quality in Malaysia (Zarita Zainudin, 2001;Mohd Ekhwan Toriman and Hafizan Juahir, 2003;Hafizan Juahir et al., 2003a,b;Hafizan et al, 2004a,b;Ruslan Rainis et al., 2004). An approach for identifying possibilities of water quality improvement could be developed by using this concept. Such information could provide opportunities for better river basin management to control river water pollution in Malaysia. In the Malaysian context, Hafizan Juahir et al. (2003a) showed that the ANN model gives a better performance compared to the autoregressive integrated moving average (ARIMA) model in forecasting DO. The use of ANN for river regulation (Mohd. Ekhwan Toriman and Hafizan Juahir, 2003) and the application of the second order back propagation method (Hafizan Juahir et al., 2004a)  In natural environment, water quality is a multivariate phenomenon, at least as reflected in the multitude of constituents which are used to characterize the quality of water body. Water quality is very difficult to model because of the different interactions between pollutants and meteorological variables. The principal component analysis (PCA) is one of the approaches to avoid this problem and has received increasing attention as an accepted method in environmental pattern recognition (Simeonov et al., 2003;Wunderline et al., 2001;Helena et al., 2000;Loska and Wiechula, 2003) The objective of this study is to use the PCA method to classify predictor variables according to their interrelation, and to obtain parsimonious prediction model (i.e., model that depend on as few variables as necessary) for WQI with other physico-chemical and biological data as predictor variables to model the water quality of the Langat river. For this purpose, principal component scores of 23 physico-chemical and biological water quality parameters were generated and selected appropriately as input variables in ANN models for predicting WQI.

The data and monitoring sites
The water quality data in this study were obtained from seven stations along the main Langat River (Fig. 1). Fig. 1. Data from seven water quality stations (Sb) were selected in this study along the main river.

www.intechopen.com
The water quality monitoring stations are manned by the DOE and Ministry of Natural Resource and Environment of Malaysia. The selected stations are illustrated in Table 1. The data used in the study is from September 1995 to May 2002. Seven sites were chosen, namely, Teluk Panglima Garang (site 7), Teluk Datok (site 6), Putrajaya (site 5), Kajang (site 4), Cheras (site 3), Hulu Langat (site 2), Pangsoon and Ulu Lui (site 1). Sites 3 to site 7 are located in the region of high pollution load as there are a several wastewater drains situated in the middle and downstream of the Langat River basin. Site 2 is partly situated in the middle stream region, designated as moderately polluted. Site 1 and a part of site 2 are located upstream of the Langat River, in an area of relatively low river pollution. It is worth mentioning here that some stations have missing data and not all stations were consistently sampled.

Principal component analysis
In this work, PCA was performed on the above mentioned water quality parameters to rank their relative significance and to describe their interrelation patterns. Chosen PC scores of www.intechopen.com (1) Where z is the component score, a is the component loading, x the measured value of variable, i is the component number, j is the sample number and m is the total number of variables.
The PCs generated by PCA are sometimes not readily interpreted; therefore, it is advisable to rotate the PCs by varimax rotation. Varimax rotation ensures that each variable is maximally correlated with only one PC and a near zero association with the other components (Abdul-Wahab et al., 2005;Sousa et al., 2007). Varimax rotations applied on the PCs with eigenvalues more than 1 are considered significant (Kim and Mueller, 1987) where the typical criteria are 75-95% of total variance (Chen and Mynett, 2003). The rotations were carried out, in order to obtain new groups of variables. Variables with communality greater than 0.7 are considered, having significant factor loadings (Stevens, 1986).

Artificial neural networks for WQI prediction
In this work, the back propagation (BP) ANN was used in the development of all the prediction models. The Activation Transfer Function of a back-propagation network is usually a differentiable Sigmoid (S-shape) function, which helps to apply the non-linear mapping from inputs to outputs. A three layer back-propagation ANN is used in this study. The number of input and output neurons is determined by the nature of the problem under study. In this study, the networks were trained, tested and validated with one hidden layer and 1 to 10 hidden neurons. This choice was based on the work of Jiang et al. (2004), who found that the results with one hidden layer was better than that of two hidden layers, and the best performance was obtained using a structure with 3 to 6 neurons in the hidden layer. The output neuron (layer) gives the predicted WQI value.
Two different types of ANN models were developed. In the first type, prediction was performed based on the original PCs. In the second type of ANNs developed, scores of rotated (varimax rotation) PCs (ANN-RPCs) with eigenvalues greater than 1 were selected as input. For this model, prediction of WQI was performed using two to six rotated principal components separately.
The original PCs and rotated PCs (RPCs) data sets consist of 305 observations (305 rows) and are divided into training, testing and validating phases for WQI prediction. The ANN predicted WQI values are compared to the WQI values calculated using the DOE-WQI formula which is based on 6 water quality parameters, namely the DO, COD, BOD, AN, SS and pH (DOE, 1997). The input data matrix consists of 23 water quality variables (column) and 305 observations (rows) [23×305]. The observed data for each station is arranged according to time of observation from September 13, 1995to June 7, 2002. Table 2 describes the data structure. The validation data is at least 10% of the whole data set, with 75% training set and 25% testing set data (Kuo et al., 2007).

Determination of model performance
The model's behaviour in both learning (training and testing) and validating phase, is evaluated using the following statistical methods; the correlation coefficient (R) at 95% confidence limit, given by equations; Coefficient of correlation (R), (2) and the mean bias error or residual error given by; Mean bias error (MBE), Where ˆi x and i x represent observed values and the corresponding forecast values for i =1,2,…..,n.
The prediction performance evaluated using these two methods are used to evaluate the accuracy of the forecast and for comparing the forecasting ability of each approach.
The 95% confidence limit is used to determine that the predicted output lie within the confidence range. It is assumed that a predicted value fall into an interval within which there is an associated uncertainty. According to Wackerly et al. (1996), this uncertainty is derived from the residual errors that have already been calculated within that range of values. If the residual errors are randomly distributed, there is a general rule of thumb which states that they will lie within two standard deviations of their mean with a probability of 0.95. This method was used in the measurements of the ANN prediction performance conducted by some researchers (Bishop, 1995;Tibshirani, 1996;Shao et al., 1997;Zhang et al., 1998;Lowe and Zapart, 1999;Townsend and Tarassenko, 1999) ANN models and statistical analyses were carried out using MATLAB 7.0 and XLSTAT2008 (Excel2003 add-in) for Windows.

Results and discussion
Post PCA, out of the 23 principal components generated, only six PCs with eigenvalues higher than 1 (Table 3) were selected for the ANN input parameters. Selected PCs explained 75.1% of the total variation. Furthermore, communality values were high for the selected PCs, for example, the values are 93% for Cond., 95% for Sal, 98% for DS and TS (Table 4). These results further confirm the choice of the selected number of PCs (Stevens, 1986).
For the first six rotated PCs (RPCs), the loadings from PCA are given in Table 4. The highest correlations between variables are noted in bold. For instance, Cond., Sal, DS, TS, Cl, Ca, K, Mg and Na, have high correlations with RPC1. Eighteen variables with strong loadings were included in the six selected RPCs. Significant variables in RPC1 are Cond., Sal., DS, TS, Cl, Ca, K, Mg, and Na; in RPC2 they are DO, BOD and AN; in RPC3 they are SS and Tur and in RPC4, NO 3 -and PO 4 3-. The only meaningful loads in RPC5 and RPC6 are pH and Zn.  Table 3. Descriptive statistics of selected original PCs with eigenvalues more than 1.  Table 4. Rotated factor loadings using six PCs.

www.intechopen.com
Using the original principal component scores as inputs, the best architecture consist of a three layer network with 23 input neurons, 10 neurons in the hidden layer and one neuron in the output layer. Considering RPC scores as inputs, the best architectures were achieved with almost the same number of hidden neurons. The hidden neurons consist of 9 and 10 neurons respectively. Training was carried out for a maximum 10000 iterations. Selection of the network was performed at maximum correlation coefficient (R) and 95% confidence limit. This study also attempts to allocate 95% confidence interval on the WQI prediction produced by the best ANN model. Figure 3, 4 and 5 show the comparison between predicted values and the upper (UL) and lower limits (LL) lying within 95% confidence interval. This was carried out for ANN-RPC6 and ANN-PC23 models. It can be seen that only 4.3% out of the 305 predicted values were identified beyond the 95% confidence limit (1% fall below the LL and 3.3% fall beyond the UL) for ANN-RPC6. For ANN-PC23, 25% of the 305 observations fall beyond the upper and lower of 95% confidence interval limit (14% fall below the LL and 11.8% fall beyond the UL). This basically shows that by using reduced rotated PC scores as input, better results can be obtained without losing information. It is thus apparent that ANN prediction using scores of varimax rotated PCs result in a more accurate WQI prediction.

Conclusion
In this work, a combination of PCA and ANN is used to predict WQI based on 23 historical water quality parameters. The original predictors were selected based on the available Malaysian DOE data. To obtain the latent variables as inputs into the ANN, two different approaches were used; one based on un-rotated original PCs and the other based on varimax rotated PCs.
Using six PCs, significant loadings are observed for Cond, Sal, DS, TS, Cl, Ca, K, Mg and Na in PC1, DO, BOD and AN in PC2, SS and Tur in PC3, NO3-and PO43-in PC4, pH in PC5 and Zn in PC6. ANN models based on these 6 PC scores can predict WQI with acceptable