Open Access is an initiative that aims to make scientific research freely available to all. To date our community has made over 100 million downloads. It’s based on principles of collaboration, unobstructed discovery, and, most importantly, scientific progression. As PhD students, we found it difficult to access the research we needed, so we decided to create a new Open Access publisher that levels the playing field for scientists across the world. How? By making research easy to access, and puts the academic needs of the researchers before the business interests of publishers.
We are a community of more than 103,000 authors and editors from 3,291 institutions spanning 160 countries, including Nobel Prize winners and some of the world’s most-cited researchers. Publishing on IntechOpen allows authors to earn citations and find new collaborators, meaning more people see your work not only from your own field of study, but from other related fields too.
Constructing models, comparing their predictions with observations, and trying to improve them, constitutes the core of the scientific approach to understanding complex systems like large river basins (Even et al., 2007). These processes require manipulation of huge historical data sets, which might be available in different formats and from various stakeholders. The challenge is then to first pre-process the data to similar lengths, with minimal loss of integrity, before manipulating it as per initial objectives. In the Upper and Middle Vaal Water Management Areas (WMAs) of the Vaal River, bounded by Vaal dam outlet and Bloemhof dam inlet, the overall objective of on-going research is to model surface raw water quality variability in order to predict cost of treatment to potable water standard. This paper reports on part of the overall research. Its objective was to show how a huge and non-consistent water quality data set could be downsized to manageable aspects with minimal loss of integrity. Within that scope, challenges were also highlighted.
One of the more important forms of knowledge extraction is the identification of the more relevant inputs. When identified, they may be treated as a reduced input for further manipulation. In water quality data analysis, data collection, cleaning and pre-processing are often the most time-consuming phases. All inputs and targets have to be transferred directly from instrumentation or from other media, tagged and arranged in a matrix of vectors with the same lengths (Alfassi et al., 2005). If vectors have outliers and/or missing values these have to be identified for correction or to be discarded. More complex mathematical correlations are sometimes employed to identify redundant, co-linear inputs, or inputs with little information content (Alfassi et al., 2005).
Sources and sinks of variables in hydrodynamics, also known as forcing functions, are the cause of change in water quality (Martin et al., 1998). To capture intermediate scale processes that are spotty in spatial extent, extensive sampling and averaging of the calibration data over sufficient spatial scales is done to capture that condition over time. Although many water constituents are non-conservative in nature, a few conservative ones that approach ideal behaviour under limited conditions, could be used for modelling and calibration.
Data sets spanning many years have been collected by various stakeholders including the Department of Water Affairs (DWA) and Water Boards which treat bulk water for potable use. For management of the basin as a whole these data sets come handy but the major challenge is collating them into uniform and useable data, while noting that the different stakeholders monitor selected parts of the basin for their own specific purposes. Some sampling points might be dropped off or new points picked up as emerging pollution threats require tracing and monitoring in order to mitigate effects. Still a useable data set has to be constructed to monitor pollution and other threats, in addition to informing and alerting decision makers regarding environmental and human health issues. This paper shows how inconsistent and scattered data sets from 13 monitoring points were pre-treated and downsized to SO42- inter-relationships. SO42- is a very important parameter in surface water quality variability in this region because of the existence of gold and coal mining activities. Threats from acid mine drainage are real.
The study area as indicated in Fig. 1 shows spatial relationships of the sampling points located on VR and its tributaries as follows: B1-B10 on Blesbokspruit River (BR); K10-K10, K6-K25 and K9-K19 on Klip River (KR); K12-N8 on Natalspruit River (NR); K1-R2 on Withokspruit River, which is a tributary of Rietspruit River (RR); K3-R3 on another tributary of RR; K2-R1 and K4-R4 on RR; S1-S1 and S4-S2 on Suikerbosrant River (SR); and V7-VRB37 and V9-VRB24 on Vaal River (VR).
Water quality data from 13 surface raw water quality monitoring points covering the period 1 January 2003 to 30 November 2009 was manipulated to remove limits of detection as well as gaps in sampling periods. An example of raw data is presented in Table 1 for sampling points Y and Z and for only Chl-α, COD, EC and DOC. The extracted data sample covered 5 July 2004 to 26 July 2004.
Using the list of variables in Table 2, comparisons among points entailed obtaining or converting the raw data to match sampling periods among the points. Although there are several interpolation techniques, cubic interpolation was chosen for the time-series data set because the method is shape-preserving. Interpolation created date-interpolated daily data using Matlab R2009b.
3.1. Manipulating data falling below or above detectable limits
Data that was above limit (e.g. 500 < x) was assumed to be one magnitude higher than the given value, whereas that which was reported as below detectable limit (e.g. x< 1.1) was multiplied by 0.75 to give absolute values that could be manipulated as normal data (Ochse, 2007).
Figure 1.
Monitoring points in study area bounded, by the two dams.
Date
Chl-α
COD
EC
DOC
Chl-α
COD
EC
DOC
Sampling point
Y
Z
5-Jul-04
17.00
19.00
105.00
4.90
7-Jul-04
8.10
20.00
80.00
8.30
12-Jul-04
5.60
19.00
99.00
6.10
19-Jul-04
8.30
21.00
96.00
21-Jul-04
74.00
27.00
88.00
8.70
26-Jul-04
6.90
24.00
97.00
5.50
Table 1.
Raw data for monitoring points Y and Z.
Parameter
Unit
Description
Abbreviation
so42_
mg/L
sulphate
SO42-
cn_
mg/L
cyanide
CN-
ec
mS/m
conductivity
EC
do
mg/L
dissolved oxygen
DO
fc
CFU/100mL
faecal coliforms
Fc
Hg
µg/L
mercury
Hg
Cl_
mg/L
chloride
Cl-
f_
mg/L
fluoride
F-
no2_
mg/L
nitrite
NO2-
no3_
mg/L
nitrate
NO3-
Low_Hg
µg/L
low mercury
Hg
Mn
mg/L
manganese
Mn
pH
-
-
-
po43_
mg/L
phosphate
PO43-
s
mg/L
sulphur
S
ss
mg/L
suspended solids
SS
Temp
oC
temperature
-
T_Silica
mg/L
total silica
-
Turb
NTU
turbidity
-
nh4_
mg/L
ammonium
NH4+
Chla
µg/L
chlorophyll -α
Chl-α
cod
mg/L
chemical oxygen demand
COD
doc
mg/L
dissolved organic carbon
DOC
Mo
mg/L
molybdenum
Mo
Si
mg/L
silicone
Si
p
mg/L
phosphorus
P
Fe
mg/L
iron
Fe
Table 2.
Parameters under consideration.
3.2. Matlab codes for cubic interpolation
3.2.1. Cubic interpolation
Data interpolation is an application based on underlying geometric algorithms. Data may be uniform, that is, sampling occurs over uniform intervals or it may be scattered, that is, sampling occurs over irregular intervals. When the sample data is scattered, the interpolation techniques use a triangulation-based approach as a basis for computing interpolated values. Table 3 provides a Matlab code for date-interpolating a single column.
To interpolate many columns, the single-column code was adjusted as in Table 4.
3.2.2. Challenges during interpolation
An empty cell at any position of the matrix, for example a missing date or value, returned an error similar to the one in Table 5.
% Load the data with lots of missing dates. Note that in this example % missing dates are not represented by NaN but are left out completely
"/"/[data,textdata] = xlsread('book.xls');
% Convert the text date to date numbers (you may have to change the date % format depending on how your dates appear in Excel)
Warning: NaN found in Y, interpolation at undefined values will result in undefined values. In interp1 at 178
Warning: All data points with NaN in their value will be ignored. In polyfun\private\chckxy at 103 In pchip at 59 In interp1 at 283
Elapsed time is 0.042557 seconds.
Table 5.
NaN.
Another common error was that of a misplaced decimal point or full stop during data capture (Table 6). Matlab would not be able to manipulate this entry for interpolation because it was not a value. A duplicated or non-formatted date would also present an error that would require debugging before a complete interpolated data set could be obtained. These, among other similar errors, required manual debugging through a whole data set, each a 2526 x28 matrix. With a perfect matrix, an interpolation took a fraction of a second.
Measured parameter
Measured parameter
72.00
0.29
3.75.0
0.31
70.00
0.29
Table 6.
A highlighted error arising from data capture.
The 13 sampling points’ data was interpolated to the same lengths from 1 January 2003 to 30 November 2009, for the 27 parameters, and then combined into one file for processing using Stata, in order to reduce the matrix. Analysis used case-wise correlation, factor analysis, multivariate linear regression and one-way ANOVA.
Initial inspection indicated that the data exhibited gross temporal inconsistency. Sampling dates did not match, in addition to missing values. Table 7 shows the interpolated data for points Z and Y for 5 to 21 July 2004.
Date
Chl-α
COD
EC
DOC
Chl-α
COD
EC
DOC
Sampling point
Y
Z
5-Jul-04
17.00
19.00
105.00
4.90
6-Jul-04
16.26
19.00
104.74
4.97
7-Jul-04
8.10
20.00
80.00
8.30
14.58
19.00
104.04
5.14
8-Jul-04
8.80
20.13
80.12
8.32
12.36
19.00
103.06
5.37
9-Jul-04
10.80
20.35
80.44
8.35
9.97
19.00
101.92
5.63
10-Jul-04
13.93
20.66
80.94
8.37
7.80
19.00
100.77
5.86
11-Jul-04
18.01
21.04
81.59
8.39
6.21
19.00
99.75
6.03
12-Jul-04
22.87
21.50
82.33
8.41
5.60
19.00
99.00
6.10
13-Jul-04
28.35
22.01
83.15
8.44
5.75
19.07
98.41
6.09
14-Jul-04
34.28
22.56
84.00
8.46
6.14
19.26
97.82
6.06
15-Jul-04
40.48
23.16
84.85
8.48
6.66
19.54
97.26
6.01
16-Jul-04
46.79
23.78
85.67
8.51
7.24
19.88
96.76
5.96
17-Jul-04
53.04
24.43
86.41
8.54
7.76
20.25
96.36
5.90
18-Jul-04
59.05
25.08
87.06
8.58
8.15
20.64
96.10
5.85
19-Jul-04
64.66
25.73
87.56
8.61
8.30
21.00
96.00
5.80
20-Jul-04
69.70
26.38
87.88
8.65
8.22
21.39
96.03
5.75
21-Jul-04
74.00
27.00
88.00
8.70
8.02
21.86
96.12
5.70
Table 7.
Date-interpolated data for monitoring point Y and Z.
A full length raw data set for Z (2003 to 2009), shown in Fig. 2, was interpolated and graphed in Fig. 3, for only 4 out of the 27 variables, that is, Chl-α, COD, EC and DOC, to reduce congestion and enhance clarity to the cubic interpolation concept.
Figure 2.
Monitoring point (Z)’s raw input data.
Figure 3.
Monitoring point (Z)’s cubic-interpolated data.
Whereas Fig. 2 showed a legend with 4 data sets, Fig. 3’s legend included the interpolated data, colour-coded for clarity. IChla, Icod, Iec and Idoc (IChl-α, ICOD, IEC and IDOC) represented the interpolations of the 4 variables used. Daily interpolation was chosen for this study because after interpolation, any other data interval, for example monthly or yearly variation, could be computed without repeating the time-consuming interpolation process.
4.1. Case-wise correlation analysis
Although case-wise correlation analysis indicated that SO42- had a significant linear relationship with all variables except DO, it was strongly positively correlated with EC (0.8720), Cl- (0.7273), S (0.9053) and Mn (0.4779). It was strongly negatively correlated with pH (-0.5380). Table 8 provides detailed output.
4.2. Factor analysis
The major aim of factor analysis is to orderly simplify a large number of interrelated measures to a few representative constructs or factors (Ho, 2006). The 27 variables were subjected to this technique for that reason, to reduce the data set. The data was collapsed into 3 latent constructs (Table 9 and Table 10).
Their Eigen values were noted to be 5.82041, 2.62148 and 2.12070. Factors 1 and 3 were cross-loaded thus Table 11 was constructed because DOC appeared to be conceptually relevant to Factor 3 (physical parameters) while cod remained relevant to Factor 1 (conductivity related). Factor 2 incorporated unique variables which were not cross-loaded into any of the other factors but for which no good common description could readily be assigned. Variables which could not be placed into any of the 3 factors were also deleted from Table 11, effectively reducing the variables, (see Ho, 2006).
Rotated factor loadings (pattern matrix) and unique variances.
EC and Cl-, together with FC, Hg, F-, NO3-, Low_Hg, Mn, pH, S, SS, Temp, T_Silica, Turb, NH4+, COD, Si, P and Fe, were good predictors for SO42- concentration, and the fitted model explains 82% of the total variation (Table 12).
4.3. One-way ANOVA
Table 13 gives the means and standard deviations for each of the sampling points over the entire sampling period.
Comparison of SO42- by sample_ID (Table 14) showed that K6-K25, K9-K19, V7-VRB37 and V9-VRB24; K10-K10 and K3-R3; and K2-R1 and K4-R4, were statistically similar. The mean values of SO42-of the remaining sampling points were significantly different.
Analysis of Variance Source SS df MS F Prob "/ F ------------------------------------------------------------------------ Between groups 4.1391e+09 12 344925487 3926.94 0.0000 Within groups 2.8832e+09 32825 87835.795 ------------------------------------------------------------------------ Total 7.0223e+09 32837 213853.757
Bartlett's test for equal variances: chi2(12) = 7.4e+04 Prob"/chi2 = 0.000
Case-wise correlation, focussing on SO42-, indicated that the variable ‘DO’ was not significant. Among the other significant variables, it was noted that SO42- was highly significantly correlated to EC, Cl- and S.
Factor analysis yielded some underlying correlations to support the case-wise correlation analysis. In addition to grouping the variables into 3 factors, the variables which were highly correlated to SO42- from case-wise correlation, were loaded together with SO42- in Factor 1. This was expected because factor analysis is also based on the assumption that all variables are correlated to some degree. Factor 3 was made up of largely physical parameters while Factor 1 contained variables that had something to do with conductivity of a water sample. Factor 2 did not exhibit any cross-loading with the other 2 factors, yet it was still very difficult to assign a common description to it. Variables CN, DO, FC, F-, PO43-, Chl-α and P could be safely deleted as they were not loaded into any of the 3 factors.
Multivariate linear regression indicated that out of the 26 variables that could predict SO42-, only 20 were significant, accounting for 82% of the total variation of SO42-.
While correlation and regression provided linear relationships, factor analysis, on the other hand, could be used for data reduction. Even though sometimes it is difficult to find a common name to assign to a factor, still, based on these statistical approaches, individual factors or elements within a factor could be further analysed as necessary, with minimal loss of data integrity.
From one-way ANOVA, SO42- mean concentration values indicated that monitoring point K2-R1 (1128.82±815 mg/L) was within the vicinity of the source of SO42-. Attenuation of the variable was noted as its mean value decreased along the Rietspruit River at K4-R4 and then Klip River at K6-K25 and K9-K19, before Klip River discharged into the Vaal River. From monitoring point B1-B10 (also close to a source of SO42-), another established route was through S4-S2, before Suikerbosrant River discharged into the Vaal River upstream of the Klip River. Surface raw water containing high levels of SO42- was not draining via K1-R2 and S1-S. Based on SO42- mean concentration values only and for management purposes, K1-R2 and S1-S could be left out of the monitoring programme, saving on financial resources. Comparison of SO42- by sample_ID showed that K6-K25, K9-K19, V7-VRB37 and V9-VRB24; K10-K10 and K3-R3; and K2-R1 and K4-R4, were significantly similar.
The major challenge was pre-processing of the non-consistent water quality data over the 7 years. Non-consistent data was as a result of missing data, largely where some of the stakeholders dropped or established some water quality variables and monitoring points over the years as monitoring prioritizations changed because of new and emerging pollution threats. The challenge of insufficient and inconsistent data for water quality modelling remains a limitation in the formulation of good and practically useable models. However, interpolations and correlations, including factor analysis and regression, could help build better data sets, especially for pollution trending in river basin management. This could be used to support large-scale public decisions.
The financial assistance of the South African Department of Science Technology (DST) is hereby acknowledged. Opinions expressed and conclusions arrived at, are those of the authors and are not necessarily to be attributed to the DST. The authors would also like to thank Tshwane University of Technology for hosting and co-funding this research. DWA, the Water Research Commission, Rand Water Board (co-funding), Midvaal Water Company and Sedibeng Water, are also sincerely acknowledged, especially for providing very valuable and vital data.
2.ClootA.RouxG. L.1997Modelling algal blooms in the middle Vaal River: a site specific approachWater Research271 EOF
3.DWAF2007Integrated water quality management plan for the Vaal River system. Pretoria, South Africa.
4.DzwairoB.OtienoF. A. O.201 EOF2010Integrating quality and cost of surface raw water: Upper and Middle Vaal Water Management Areas South AfricaWater Science and Technology: Water Supply 10, 2, 201-207, 1606-9749.
5.Dzwairo, B., Otieno, F. A. O. & Ochieng’, G. M. (2010a). Making a case for systems thinking approach to integrated water resources management (IWRM). International Journal of Water Resources and Environmental Engineering, 1, 5, 107-113 2141-6613
6.Dzwairo, B., Otieno, F. A. O., Ochieng’, G. M. & Letsoalo, M. A. (2010b). Downsizing water quality data for river basin management- Focussing on Sulphate: Vaal River, South Africa. Proceedings of the 11th WaterNet/WARFSA/GWP-SA Symposium: ‘IWRM for National and Regional Integration: Where Science, Policy and Practice Meet. Elephant Hills Hotel, Victoria Falls, Zimbabwe. 27 October- 29 October, 2010
7.EvenS.BillenG.BacqN.ThéryS.RuellandD.GarnierJ.CugierP.PoulinM.BlancS.LamyF.PaffoniC.2007New tools for modelling water quality of hydrosystems: An application in the Seine River basin in the frame of the Water Framework DirectiveScience of The Total Environment274 EOF291 EOF
8.GouwsK.CoetzeeP. P.1997Determination and partitioning of heavy metals in sediments of the Vaal Dam System by sequential extractionWater SA, 23, 3, 217 EOF
9.Herold, C. E., Le Roux, P. J., Nyabeze, W. R. & Gerber, A. (2006). WQ2000 Salinity Model: enhancement, technology transfer and implementation of user support for the Vaal system. Umfula Wempilo Consulting. Pretoria, South Africa
10.HoR.2006Handbook of univariate and multivariate data analysis and interpretation with SPSSFlorida, Chapman and Hall/CRC: Tailor and Francis Group, 1584886021.
11.MartinJ. L.MccutcheonS. C.MartinM. L.1998Hydrodynamics and Transport for Water Quality Modeling Taylor & Francis, Inc, 9780873716123
12.NaickerK.CukrowskaE.MccarthyT. S.2003Acid mine drainage arising from gold mining activity in Johannesburg, South Africa and environsEnvironmental Pollution29 EOF40 EOF
13.OchseE.2007Seasonal rainfall influences on main pollutants in the Vaal River barrage reservoir: a temporal-spatial perspective.Magister Artium MA, University of Johannesburg.
14.PieterseA.RoosJ.RoosK.PienaarC.1987Preliminary observations on cross-channel and vertical heterogeneity in environmental and algological parameters in the Vaal River at Balkfontein, South Africa. Water SA, 12, 4, 173-184, 0378-4738.
15.StevnD. J.ToerienD. F.1976Eutrophication levels of some South African impoundments. IV. Vaal dam. Water SA,225357
Written By
Bloodless Dzwairo, George M. Ochieng’, Maupi E. Letsoalo and Fredrick A.O. Otieno
Submitted: November 18th, 2010Published: August 1st, 2011