Analysis of Water Quality Data for Scientists

The most often used models are deterministic, although they are prepared from one sampling event. It must be stated that the statistics and model results obtained from this sampling event can significantly change if the sampling is to be reproduced because their results are probability variables (Kovács & Székely, 2006). In the case of deterministic models this problem is solved by means of sensitivity analyses, thus the uncertainty in the applied model remains. This may be the reason why the following can be found in the international literature regarding this question: “The future is stochastic modeling” (Kovács & Szanyi, 2005; Wilkinson, 2006).


Introduction
In the last few decades the need for stochastic models and the use of time series and data analysis methods in surface and groundwater research has increased greatly.The reason behind this phenomenon is the increase in the size and number of datasets in which it has become necessary to investigate the connection between random variables, and in the case of time series, their characteristics.
The most often used models are deterministic, although they are prepared from one sampling event.It must be stated that the statistics and model results obtained from this sampling event can significantly change if the sampling is to be reproduced because their results are probability variables (Kovács & Székely, 2006).In the case of deterministic models this problem is solved by means of sensitivity analyses, thus the uncertainty in the applied model remains.This may be the reason why the following can be found in the international literature regarding this question: "The future is stochastic modeling" (Kovács & Szanyi, 2005;Wilkinson, 2006).This chapter is intended to introduce the application of a few exploratory data analysis techniques, primarily through examples.Exploratory data analysis methods are useful and important tools for obtaining an overview of systems which can be described by many different parameters, for determining the latent and explicit connections between the parameters and for sorting and grouping the data obtained based on mathematics.
The greatest value of this chapter lies in its interdisciplinary character; it casts light on environmental problems originating from a wetland ecosystem a river and a groundwater system as well.Fig. 1.Data in four dimensions (Sampling locations (x i ; y i ), parameters, time) (based on Kovács et al., 2008) As an example let us imagine that in a certain area many parameters are sampled from more than one groundwater monitoring well (GMW) at the same time (plane S1).The data obtained is then recorded on worksheets where each row corresponds to a GMW and each column to a parameter.This data matrix is considered to be static.Besides the common univariate statistical methods multi-variate ones can be used, such as cluster-, discriminant-, principal component-and factor analysis along with multi-dimensional scaling.
Cluster analysis (CA) and multi-dimensional scaling can be used on sampling locations when there is a need to reveal similarities.Another aim can be to determine the background processes that explain most of the original dataset's variance.This can be achieved using principal component analysis (PCA) or factor analysis (FA).During CA the rows of the data matrix and during PCA of FA its columns are the object and the analyses.
In most cases the datasets obtained contain the parameter determining the fourth dimension, time.In this case the data matrix is not static.Staying with the previous example, if more than one GMW is sampled equidistantly in time one is dealing with problems described in plane S2.In this plane the most frequent question is which background processes drive the sampled parameter's (in our case the water levels) temporal fluctuation.Because consecutive temporal samples are not independent of each other only dynamic factor analysis (DFA) can be applied to answer the question raised (Márkus et al., 1999;Ritter et al., 2007).Its application began only in the last years of the 20 th century, and with its promising results its role in solving environmental problems is expected to increase (Kovács et al., 2004).Plane S3 is where the time series of multiple parameters are examined at only one sampling location.Here "classical time series methods" (Shumway & Davis, 2000;Hans, 2005) can be employed, the use of which implies determining the parameters' trend and periodicity etc.In many cases determining these two characteristics is a key question of the study.If the periodicity and trend of a certain process is extracted it can be used for forecasting, but only if one is certain that the driving processes still exist and will exist in the distant future.However further expansion of this topic is not an aim of this chapter.
Returning to the spatial alignment of the sampling locations, by having two spatial coordinates arranged for every parameter sample one will be able to visualize its spatial distribution (on isoline maps) which could be of great use.However if a parameter's spatiotemporal changes are of interest -for example in non-stationary cases-only a few tools are at hand, and the research to solve four dimensional problems in ongoing.Here we refer the reader to Dryden et al. (2005).

Common problems in data handling
Accurate results can only be expected from multi-variate methods if the datasets used contain the desired information that describes the investigated processes precisely; in other words, the amount of data obtained is sufficient.Determining what amount is sufficient information is the duty and responsibility of the given discipline.As an important requirement the number of sampling locations should exceed the number of analyzed parameters (Füstös et al., 1986), in this way statistical stability would be ensured.
One of the most important criteria regarding the data matrix is that there should be no data missing.Unfortunately in many cases (mostly water quality data) data is missing from the datasets.The solution could be data replacement but this has to be done with caution.In practice it often happens that the missing data is replaced with '0'.This is a huge error and should be corrected at all times.Another frequent mistake is when a measured parameter's values are below detection limit and the analyst therefore sets its value in the matrix to, for example: the detection limit, or half of the detection limit etc.A dataset like this can give misleading results.
Extreme or outlying values can also lead to inaccurate results.To decide precisely which datum is really outlying or extreme and which one is mistyped is a key question.The certain parameters' variability can be of help in deciding this question.
In water quality data it is often observable that a parameter is the linear combination of one or more others.This of course cannot be used in the course of multi-variate data analysis.Every criterion can only be held if the data matrix is checked for these and other kinds of errors before analysis.This is the most annoying and time-consuming part of the research.However, skipping this step will inevitably lead to incorrect results and conclusions.

Applied methods
The most effective order in which the presented methods should be used is the following.After checking the data matrix for the errors described above it is necessary to examine the data using uni-variate methods like descriptive statistics, distribution analysis, hypothesis testing of some sort and finally determining the stochastic connections with correlation analysis (Helsel & Hirsch, 2002).Out of these methods, correlation analysis is the one that will be discussed.The next step is the application of multi-variate methods.The first method suggested is cluster analysis.Its results are groups of similar sampling locations.As a verification tool discriminant analysis and as a tool for determining which parameters influenced the formation of the cluster groups the most, Wilks' lambda distribution is suggested, along with the overview of the groups' statistics (Box-and-whiskers plot).
As a last step it is proposed to determine the driving background processes using PCA and if possible visualizing the results on maps for better interpretation.

Correlation (stochastic connections)
A frequent question is what kind of relation can be revealed between two parameters.The connection's strength should be described numerically.The most common is the Pearson correlation coefficient which measures the strength and direction of a linear relationship between two variables.It is calculated as follows: This means that the correlation coefficient includes the covariance's every good quality and with the division by the standard deviation it will be independent from measurement units, and the upper lower boundary problem will be solved as well.The properties of the correlation coefficients are the following: If the relation between X and Y is positive, then • if the correlation coefficient is zero the two variables are uncorrelated, however this does not mean that they are independent • if two variables are independent then r (x,y) =0 In the studies presented correlation coefficients (in absolute value) higher than 0.71 were considered to represent strong linear relationship (Füst, 1997).

Cluster
Clustering is a kind of coding, in which a certain sampling location -originally described with many parameters (runoff, chemical oxygen demand etc.) is now described with only one value, its group code (cluster number).It is important to note that during clustering not the number of parameters but the number of sampling locations is decreased by placing the similar ones into groups.It is an important criterion that every sampling location has to belong to a group, but only one group.It is obvious that there are many possible group conformations.The main aim is to settle the similar sampling locations into the same group, however this similarity has to be measured by assigning a distance (metrics) to each sampling location which is placed in an N † dimensional space.If the distance between two sampling locations is small, then they are highly similar to each other.If the distance is zero they are perfectly similar.From this it should be clear that choosing the right distance is a key question.It needs skill and practice.This means that the verification of cluster results is compulsory.
"Cluster analysis (CA) classifies a set of observations into two or more mutually exclusive "unknown" groups based on combinations of interval variables.The purpose of cluster analysis is to discover a system of organizing observations, usually people, into groups.
where members of the groups share properties in common" (Stockburger, 2001).
There are basically two types of clustering, the K-Means CA and the Hierarchical CA.In the former, one has to predetermine how many groups are required and is frequently suggested to be used with large datasets, in latter one only has to determine the groups after obtaining the dendrogram, the graphical output of the CA.In this study divisive Hierarchical CA was applied, where one group is divided to many more and so on.Its opposite is the agglomerative Hierarchical CA when the number of groups is reduced during the analysis.

Discriminant analysis and Wilks' lambda distribution
To verify the accuracy of the results, discriminant analysis can be used.It shows to what extent the planes separating the groups can be distinguished by building a predictive model for group membership.The model is composed of a discriminant function (for more than two groups a set of discriminant functions) based on linear combinations of the predictor variables that provide the best discrimination between the groups.The functions are generated from a sample of cases for which the group membership is known; the functions can then be applied to new cases that have measurements for the predictor variables but their group membership is as yet unknown (Afifi et al., 2004).The result of the discriminant analysis is often visualized on the surface stretched between the first two discriminating planes (function 1 & function 2, e.g.Fig. 13) (Ketskeméty & Izsó, 2005).
After the verification of the cluster groups the role of each parameter should be analyzed in determining the formation of the cluster groups.Using Wilks' λ distribution a Wilks' λ quotient is assigned to every parameter, where the quotient is: Where x ij is the j th element of the i th group, i x the i th group's mean, and x the total mean.
The value of λ is the ratio of the within-group sum of squares to the total sum of squares.It is a number between 0 and 1.If λ=1, then the mean of the discriminant scores is the same in all groups and there is no inter-group variability, so, in our case the parameter did not affect the formation of the cluster groups (Afifi et al., op. cit.).On the contrary, if λ=0, then that particular parameter affected the formation of the cluster groups the most.The smaller the quotient is, the more it determines the formation of the cluster groups.

Box-and-whiskers-plots
Box-and-whiskers plots are great tools for visualizing more than one statistic of a parameter on one graph, making the interpretation clearer.The boxes show the interquartile range and the black line in the box is the median.Two upright lines represent the data within the 1.5 interquartile range.The data between 1.5 and 3 times the interquartile range are indicated with a circle (outliers), and the ones with higher values than 3 times the interquartile range are considered to be extreme values indicated with an asterisk (Norusis, 1993).For example see Fig. 8.
For better interpretation it is necessary in every case possible to visualize the results on maps of some sort.

Principal component analysis compared with factor analysis
The principal component analysis (PCA) and factor analysis (FA) methods are used to analyze multidimensional data.The goal is to determine the background processes, while describing the observed parameters with fewer hypothetical variables without any significant information-loss in the original data.
During PCA the measured chemical and physical parameters are correlated, whereas the hypothetical variables (called principal components) are uncorrelated and are obtained as a linear combination of the original parameters.The PCA decomposes the total variance of the original variables to principal components which explain the original variance in a monotonically decreasing way.The correlation coefficients between the original parameters and the principal components are the factor loadings.They explain the weights of the original parameters in the principal components, however they do not give an exact answer whether a weight has to be considered as significant or not, and how many principal components are important.
During FA the hypothetical variables are called factors.They are classified into three categories: (1) common factors (influencing multiple parameters), (2) specific factors (influencing one parameter) and (3) error factors (arising for example from inaccuracy during the measurement).In this method only the common factors and their factor loadings are determined, because identifying specific-, and error factors along with their factor loadings usually causes mathematical difficulties.As a result, the common factors only explain a part of the total variance of the original parameters.While the PCA uses a correlation matrix, the FA uses an adjusted correlation matrix.In this matrix the elements on the diagonal (commonalities) can be estimated in different ways, as a result different solutions (of the FA) may be acceptable.The basic model of FA suits the conditions of datasets in earth sciences better (Geiger, 2007), this may be the reason for its successful application (Voudouris et al., 1997;2000).In this chapter only PCA was applied.
In regard to the software used, we suggest using STATISTICA for its good visual output and user friendly interface, SPSS because its more commonly used and for its "syntax system" and last but not least R because it is an open source freeware and the most up-todate software available.

Case studies giving an example on the data analysis methods
In sections 3.1-3.3three cases studies are presented to give an example of the most efficient application of the basic data analysis methods (section 2) using the datasets of a mitigation wetland, a river and a groundwater area (Fig. 2).
Fig. 2. The location of the three case studies numbered in order of appearance

The Kis-Balaton Water Protection System (KBWPS)
The KBWPS is a mitigation wetland located at the mouth of the River Zala at Lake Balaton, the largest shallow freshwater lake in Central Europe (Padisák & Reynolds, 2003).It was settled on the remains of the original Kis-Balaton Wetland (KBW), which due to artificial water level modifications decreased in area and functionality in the late 19 th century.The original KBW used to naturally filter the waters of the River Zala which supplies 45% of Lake Balaton's water and 35-40% of its nutrient input (Lotz, 1988;Kovács et al., 2010).
During the time period between the function loss of the KBW and the construction of the KBWPS the Zala's waters were less filtered.This resulted in the water quality deterioration of Lake Balaton.As a solution to the problem of nutrient retention at the mouth of the River Zala and stop the degradation of Balaton's water quality the KBWPS was constructed (Tátrai et al., 2000;Kovács et al., 2010).

The structure of the KBWPS
The construction of the KBWPS was planned to take place in two phases (Fig. 3) as an extended wetland.Phase I was finished in 1985 after a five step flooding (Korponai et al., 2010).It resulted in a eutrophic pond, which gives ideal conditions for algae to reproduce.
Phase II remains incomplete.Only a 16 km 2 area has been functioning since 1992.Its habitat could be described as a "classic" wetland, with 95% macrophyte coverage, primarily reeds (Nguyen et al., 2005;Tátrai et al., op. cit.).It is a highly protected nature conservation area and under the de-jure protection of the Ramsar Convention (1971).

Sampling locations and examined parameters on the KBWPS
Since the installation of Phase I weekly sampling has been carried out by the laboratory of the West Transdanubian Water Authority's Kis-Balaton Department on thirteen sampling locations.In the following the most important locations are described in detail heading downstream.
• "Z15", the inlet of Phase I, which typifies the water of the River Zala • "Kb4; Kb6; Kb7" • "Kb9", the sampling location that typifies the waters of the Cassette, which is the site of biological experiments, and is the most eutrophic place in the system • "Kb10" www.intechopen.com • "Z11", the interface between Phase I and Phase II, which typifies the water generated by Phase I, the eutrophic pond • "202i; 203; 209; 210" • "205" is the location where the Combined Belt Canal and drainage pipes from fishing lakes coming from Somogy County enter the KBWPS • "Z27", the outlet of KBWPS, which typifies the water output of Phase II, the "classic" wetland area of the KBWPS In this study the following parameters were examined for the time interval  with the methods described in section 2.2.
The sampling was conducted and the samples were prepared according to the current Hungarian standards (MSZ 12749 & MSZ 12750).
In case of the KBWPS the dataset was prepared by the same authority during the whole investigated time period.However in most cases the scientist is not this fortunate, he has to deal with the problems that originate from different sampling methods, and even standards (in case of an international research).To explore these problems and maybe even solve them a preanalysis and probably cross-verification of the data must be carried out before its analysis.Even in our case -the KBWPS-each-and-every data was checked whether it is an extreme-, an outlier-or a mistyped value or not.An x-y scatterplot was created for each parameter for the whole time interval and analyzed manually.It would not have been sufficient to set a high and low margin or even apply the three-sigma rule (Pukelsheim, 1994) because there were certain events that resulted in extremely high values (e.g.floods) that can easily be recognized from their surroundings in the graphs, and these would have been falsely discarded.

Cluster-, discriminant and Wilks' lambda analyses
Using cluster analysis on the 13 sampling locations and the annual averages of the parameters specified in table for the years 1984-2008 we were able to point out the alignment of the similar sampling locations for each year.The sampling locations basically grouped up according to the two constructional phases, but there were interesting exceptions (Fig. 4).

•
Between 1997 and 1998 sampling location 202i disconnects from Phase II and connects to the cluster group, covering Phase I.This is the only location that changes its orientation and keeps it for the rest of the time analyzed.This occurred because of the constantly high water level of the area.Because of that the reeds died out and the surroundings of sampling location 202i became similar to Phase I, an open and eutrophic water space became dominant and the area of the "classic" wetland decreased along with the system's efficiency (Hatvani et al., 2011).
• Sampling locations Kb9 and 205 form separate groups, the former in 1997 and 1999, and the latter in 1996 and 1997.Previous because it is located in the Cassette, this constituted a separate waterspace with almost no water flow and high water retention time, so it is highly eutrophic, while with the latter (Kb205), because of the Combined Belt Canal and drainage pipes of fishing lakes from Somogy County joining the system here.
If one uses cluster analysis the results always have to be verified.Discriminant analysis results pointed out that 100.0% of the original grouped cases proved to be correctly classified.Using cluster and discriminant analyses it was possible to point out key problems in the water level management of the system (Hatvani et al., 2011).For better interpretation the visualization of the spatial cluster results on a map of some sort in every possible case is suggested.
As discussed previously, Wilks' lambda distribution determines the parameters that effect the conformation of the previously discussed cluster groups the most.A Wilks' λ quotient was assigned to every parameter for every year and then clustered.Grouping the eleven quotients, the parameters were separated to three different groups according to how much they determine the original spatial cluster groups (Fig. 5).
Group 1 contains: chlorophyll-a, dissolved P, suspended solids and pH, CO 3 2-, HCO 3 -, dissolved O 2 .These parameters had the smallest Wilks' λ quotients (average: 0.32), so they affected the conformation of the cluster groups the most and were responsible for the separation of the cluster group covering the eutrophic pond.
Group 2 contains the parameters which also stand in close relation to the eutrophication processes: Ca 2+ , COD C , total N, and total P, their average quotient was 0.55.
In fine the parameters in Group 3 (NH 4 -N, NO 2 -N , NO 3 -N , Mg 2+ , SO 4 2-, Cl -, Fe 2+ , Mn 2+ , Na + , K + ) generated an average quotient of 0.69 during the Wilks' λ distribution, so they affect the orientation of the spatial cluster groups the least and play a great role in the separation of the cluster group covering the Wetland (Phase II).
With these three methods used together a global picture was obtained regarding the similarities of the KBWPS' sampling locations and the parameters that drive these similarities.After gaining knowledge concerning the connection between the sampling locations and the parameters behind them, the next step would be to familiarize oneself with the processes evolving in the different areas of the water protection system.

Stochastic connections
From the correlation matrix it became clear that at Z15 only the phosphorus forms (TP, SRP, dissolved P) in case of Z11 and Z27 besides the phosphorus forms the Na + and Cl -ions have correlation coefficients (in absolute value) higher than 0.71.Meanwhile, at all three sampling locations the number of weak linear connections (values of correlation coefficients is between ±0.2) is 137-120-136 respectively.
In summary it can be said that in the eutrophic area of the KBWPS (Phase I) there are fewer weak linear connections than in the riverine or wetland areas.However, regarding the whole system, the linear connection between the 22 parameters cannot be considered generally significant.This is reflected in the following PCA results.

Principal component analysis
The PCA results presented concern the three cardinal, and one peculiar sampling locations of the KBWPS (Z15, Z11, Z27 and Kb9) and their surroundings.Their summer datasets were analyzed for the time interval 1984-2008.
The aim was to find the parameters that determine the processes evolving -mainly-in Phase I of the Kis-Balaton Water Protection System (KBWPS).
The biggest problem which had to be faced in the course of the PCA was that the datasets were time series, where data follow each other and are therefore not independent ‡ because ‡ The spatial or temporal distance when consecutive data turn independent can be measured for e.g. with variogram analysis (Kovács et al., 2011), they are too close to each other in time, and the PCA cannot handle this kind of input § .To solve this problem only three summer months (May-July-September) were analyzed with two-month gaps in between.This is the growing season when primal production is most intensive (Wetzel, 2001).
Regarding the results, only the first factor can be considered as significant.It explains 25 to 30% of the data's variance, while the second PC only explained 15-19%, and so was discarded.The low explanatory value may originate from the parameters' weak linear connection, reflected in the correlation matrixes discussed in section 3.1.3.2.
If the constitution of the first component is analyzed in more detail it can be said that at: • Z15 (inlet, the River Zala) where the parameters TP, Dissolved P and SRP are present in the first PC.It corresponds what is already known about the River Zala, that there is technically no planktonic life in its water, and bentic eutrophication is not dominant in its sediment.These parameters originate from diffuse loads, thus their concentration only depends on the runoff of the River Zala.The highest concentrations were observable at peak flooding times.• Kb9 (Cassette) the parameters Chl-a and TP are present in the first PC.These are the main parameters that the OECD (Vollenweider & Kerekes, 1982) uses to classify the trophic conditions.Their presence is no surprise, because at Kb9 the water is still, with almost no water flow, and long hydraulic residence time.Bottom-up ** processes dominate at this sampling location and its processes can be described with the Vollenweider model (Vollenweider, 1976).In summary it can be stated that PCA is a universal tool for determining the dominant processes of a certain sampling location's area or a whole system.The facts that were perceptible to the naked eye, such as the eutrophic conditions of the Cassette are now written down in numbers and can be subject of further scientific studies.

Conclusions regarding the results obtained from the KBWPS
The development and the functioning of the KBWPS is a good testing ground for new habitat remediation techniques and a great example of the presentation of the application and the use of the methods described above.The result of each method gave extra information to scientists and confirmed their previous suspicions.For example the constant water level deteriorates the wetland vegetation (Pomogyi et al., 1996), therefore decreases § Instead of PCA dynamic factor analysis was developed to handle temporal dependence ** Bottom-up control: ecological scenario in which the abundance or biomass of organisms is mainly determined by a lack of resources and mortality owing to starvation (Pernthaler, 2005) the efficiency of the system and that this process reached a peak in 1997 and 1998.It was always known that the water need of different vegetation types is different; the key result was that the irreversible change at 202i happened in those particular years.This was just one example highlighted regarding the mitigation wetland.
In the next section (3.2) the same methods were applied to the largest tributary of Europe's second longest river the Danube.

River Tisza
The River Tisza collects the waters of the Carpathian Basin's eastern region.According to Lászlóffy Woldemár (1982), its watershed area is 157,186 km 2 .Less than one third of this is located in Hungary (Fig. 6).From its spring in the Maramureşului Mountains to its confluence with the Danube, it stretches for 966 km across the Ukraine, Romania, Slovakia, Hungary and Serbia (Sakan et al., 2007).The Hungarian section of the Tisza from border to border is 594.5 km long.Its average runoff is 25.4 billion m 3 per year (Pécsi, 1969).
Despite the fact that in the last one and a half centuries numerous anthropogenic activities have influenced this area, in comparison to Europe's other large rivers it is still considered to have one of the most natural river valleys in Europe (Zsuga & Szabó, 2005).It is for this reason that it is in our common interest to protect it.If we only take Hungary into account, then approximately 400 settlements and 1,500,000 inhabitants' lives depend on its runoff and water quality.

Sampling locations and examined parameters on the River Tisza
Many surface waters are monitored as part of the National Sampling Network.In case of the Tisza, data from the first five Hungarian sampling locations were analyzed (258.7 river km) (Fig. 7).The exact specifications of these monitoring locations can be found in Hungarian Standard No. MSZ12749:1993.

Results of the analyses conducted on the River Tisza
During our work many temporal approaches can be employed, meaning the whole year can be examined or the dataset can be separated to seasons.In the former case the whole year's processes, while in the latter obviously the seasonal changes can be followed.To shed light on the seasonal changes the winter and summer data were analyzed separately.Summer was considered to last from June to October, while winter from November to March.

Cluster-and discriminant analyses and Wilks' lambda statistics
Regarding the River Tisza's data series, cluster analysis was conducted on averages formed from the parameters.This approach gives a longer perspective on the data.One average was formed at each sampling location from each parameter's total dataset.
The cluster analysis resulted in three groups.As a verification tool, discriminant analysis was applied.† † Council for Mutual Economic Assistance founded by the Soviet Union 1949.‡ ‡ Mineral-N is the summary of the NH4-nitrogen, NO2-nitrogen, NO3-nitrogen.
The main question was whether these groups (formed from the averages) present at all and whether they are discoverable if the data is analysed in more detail (not in averages).The answer is yes.
By using discriminant analysis on the discrete data, the cluster groups formed from the averages were realized for example 94.8% in summer and 89% in winter.This means that the clustering from the averages is correct and representative.
To stay with our examples, Table 1 shows the Wilks' lambda coefficient in summer and winter for each parameter.It is clear from Table 1 that the sulphate ion is the most determining in both seasons.Parallel to the Wilks' lambda distribution, it is suggested that the parameters' variability be analyzed, because it gives a much wider picture of certain parameters.In Fig. 8, three parameters are presented on box-and-whiskers plots.One with a small (sulphate, Fig. 8/A) one with a medium (calcium, Fig. 8/B) and one with a high (pH, Fig. 8/C) Wilks' lambda coefficient.It is clear that the sulphate is the most variable, calcium is less so, and pH (which was the most influential regarding the Wilks' lambda distribution) is the least variable parameter.

Stochastic connections
The stochastic connections were analyzed using correlation analysis.The connection between the parameters was analyzed through different approaches.First the whole year, then the different seasons (winter, summer) were taken into account.
According to each approach (whole year, winter/summer), the number of strong correlations increases downstream.At Tiszabecs, the number of strong correlations is only seven (Table 2), at Tiszalök this number reaches 36.This can be explained by the flow conditions.In the area of the water barrage system, the water-flow slows down, and (according to the spiral model § § ) so does physical transport; suspended solids are deposited, the water becomes more transparent and light limitation decreases (Padisák, 2005).This gives an opportunity for organisms to compose the nutrients into their systems faster and more efficiently.Usually the River Tisza is autotrophic during the summer months, but tributary input may considerably exceed net autochtonous production (Istvánovics et al., 2010).During the summer months there were fewer strong linear connections than in the winter months.If the results from the whole year are compared to the summer and winter ones, it can be stated that the correlation matrix obtained from the winter data more closely resembles the annual correlation matrix than does the summer one.
As can be seen in Table 2, in the Tiszalök area the number of correlating parameters suddenly rises in both winter and summer.
Summarizing the correlation analysis, it can be stated that the number of correlations increases downstream.The annual, winter and summer results are different in the case of the number of correlations and in the case of the parameters which correlate as well.During the summer there are fewer linear connections, but these few are between the parameters in relation to the organic processes.
As previously stated, if the results for the whole year are compared to the summer and winter ones, it can be seen that the correlation matrix obtained from the winter data more closely resembles the annual correlation matrix than does the summer one.So, if just the whole year had been analyzed, vital differences between the summer and winter would have been lost.
After analyzing the connections between the SL based on the sampled parameters the next step is to take a closer look at the processes evolving in the river.§ § The riverine spiral model describes recycling of nutrients together with physical transport downstream (Padisák, 2005).

Principal component analysis
To answer the question which background processes determine the water quality and processes of the River Tisza, PCA was applied to the summer, winter and whole year data.
Before the PCA was conducted, the number of parameters had to be decreased, either because the parameter was not sampled during certain time periods or because it was unsystematically sampled over the whole investigated time period.Or, in other cases, the parameter itself contained information concerning other parameters (e.g.specific conductance).There are other cases when the dataset has to be decreased, more examples can be seen in section 3.3.2.3.
In terms of their importance, only the parameters with a factor score (in absolute value) higher than 0.7 were taken into account in the first and second principal components (PC).
The summarized results of the PCA can be found in Table 3. Table 3. Summarized results of the PCA, None: There were no factor scores ≥ 0.7 Regarding the results, it can be said that the first two components explain approximately 50% of the data's total variance, independent of their spatial and temporal distribution.
Regarding the summer results, at Tiszabecs, Záhony and Balsa mostly the N-forms can be found in the first PC.In the second PC, the ions responsible for halobity (Mg 2+ , Na + , K + , Cl -) take place.Between Balsa and Tiszalök the background processes show a peculiar change: the scale tilts from the organic components towards the inorganic ones.At Tiszalök in the summer (according to the first PC), the major ions (e.g.Ca 2+ , Mg 2+ , Na + , K + , Cl -) play a determining role; the Polgár SL shows the same pattern.The fact that at Tiszabecs, Záhony and Balsa the N-forms are the most determining in the first PC leads us to the assumption that biological processes such as saprobity and trophic conditions are responsible for the background processes.Since there was no direct relationship observed between nutrient levels and phytoplankton biomass (Istvánovics et al., 2010) other factors are responsible for changes in the N-forms.In contrast, the results from Tiszalök and Polgár show a change in the determining processes.After Tiszalök, the inorganic processes (e.g.aggregation, dissolution) take the place of the N-forms in the first PC.
From the perspective of the winter results, the first PC's explanatory power varies between 20% and 40%.At each SL except for Tiszabecs the ions determine the background processes.
In the second factor, the N-forms are dominant.
Regarding the whole year's PCA results it can be stated that the annual conditions resemble the winter ones to a high degree (just as in the case of the correlation results).In the first PC (explanatory power: 21-39%) the ions take on the determining role, while in the second one the N-forms are dominant.
Regarding the temporal distribution, it is clear that during the winter inorganic processes are dominant in determining the Tisza's water quality and these results represent the annual conditions much more than the summer results do.
As in the case of the correlation analyses, and of the PC as well, results which are not temporally divided are not satisfactory, but this can only be confirmed if the summer and winter data are analyzed separately, as it was done in this case.

Conclusions regarding the result obtained from the River Tisza
It is clear that the methodology applied in the case of the KBWPS can successfully be applied to a river as well.The advantage of the methodology used is that it delimits subsystems not based on geography but mathematics.It was known from the literature, that three large hydro-geomorphic sections can be distinguished along the river (Istvánovics et al., 2010).The meandering upstream and downstream sections are separated by the impact by the two reservoirs.The PCA for example clearly separated the Hungarian section of the River Tisza to two sub areas (Tiszalök SL (water barrage system)in the middle), where Tiszabecs and Balsa SLs belong to the upstream and Tiszalök and Polgár SLs to the reservoir section.This section however causes discontinuity in the environmental gradient along the river.The separation is the strongest in summer when autotrophic processes become driving forces in the reservoir.After presenting the two surface waters the final case study, the analysis of SE Hungary's groundwater will be discussed.

Sampling locations, and examined parameters
Drinking water has high arsenic content in south-eastern Hungary.Within the framework of the pilot project (Sustainable management and treatment of arsenic bearing groundwater in Southern Hungary (SUMANAS) -LIFE05 ENV/H/000418), the formation of the arsenic (geological in origin) and the development of its decontamination were examined.The geological part of the study was prepared by Smaragd GSH Ltd.Appropriate water chemistry analysis results made the utilization of multi-variate data analysis methods possible.
Subsurface water resources of the Pleistocene aquifer in the south-eastern part of the Great Hungarian Plain contain more or less arsenic.SE Hungary is a subsiding back arc basin that was loaded with river sediments (aleurit, sand and clay) in the Pleistocene.One of the two rivers, the Tisza, has a metamorphic, volcanic and re-accumulated sedimentary catchment area, while the Maros River mainly derives its sediments from volcanic territory (Nádor et al., 2007).Arsenic has been transported by fluvial fine-grained sediments (fraction: <2 µm), adsorbed on the surfactant amorphous iron-oxyhydroxides (Varsányi & Ó. Kovács, 2006), clay minerals and organic materials.
The primary transportation, accumulation and desorption of the arsenic from the sediments into the water is determined by absorbents (amorphous iron-hydroxides and the surface of the organic matter, clay minerals) (Lin & Puls, 2000;Varsányi & Ó. Kovács, 2006), physicochemical conditions (Redox conditions, pH), and changes in the groundwater's parameters (recharge, flow regime).These agents display great variability over time and space, causing divergent arsenic concentrations in the groundwater.
Groundwater sampling was carried out at 202 groundwater monitoring wells (like those mentioned in section 2.1.1)All the wells were located in SE Hungary and the bordering area of Romania).Most of these were supply wells plus a few monitoring wells situated at an average elevation of 93 m above Baltic sea level.Groundwater temperature varies according different depth intervals.The average water temperature is 19.6 ºC, the lowest is 12.1 ºC, while the highest is 81 ºC.
In the course of water chemistry analysis, the form of the separate As formulas is the most important step, because As (III) is 60 times more toxic to the human body than As (V).For this reason determining the quantity of the different arsenic formulas present in the groundwater is a crucial point in this study.The method for measuring As (V) and As (III) separately was developed during the project by Bálint Analitika Ltd.In: Körös Valley District Environment and Water Directorate, 2008.
The parameters analysed can be found in Table 4, excluding Na + (mg l -1 ) and conductivity (µS cm -1 ).One of the biggest problems concerning the database was when the concentration of the analysed parameter was beneath the detection limit.This occurred in many cases regarding a few parameters (cadmium, mercury, lead etc.), so these were simply left out of the calculations.
In the case of a few other parameters (ammonium, nitrite, nitrate, sulphate), concentrations both above and below detection limit were observable.On occasions when a parameter is beneath the detection limit, a common practice is that the values are supplemented with for example half of the detection limit (other solutions are mentioned in section 2.1.2).If it is done in this way so the examination of stochastic relationship (PCA) may generate huge errors.In the case of cluster analysis, however, supplementing data (beneath detection limit) with values close to zero may only lead to minor errors, which can be accepted.In summary, it must be stated that keeping in mind which method can handle the values under the detection limit and which cannot is a key question.In some cases using these values will lead to huge errors (e.g.PCA) in other cases the results can still be considered to be correct (CA).

Stochastic connections
The correlation matrix shows whether there is a linear relationship between the measured parameters or not.The example is as follows.
The Correlation matrix shed light on a strong linear connection between conductivity (measured on site) and Na + and HCO 3 -.If conductivity and Na + content are plotted on a diagram (Fig. 9), besides the linear relationship one may easily recognize the different character of the connection in cases of different concentration ranges.Thus, the application of diverse regression functions should be practical if the aim is to determine the relationship to this resolution.The above-mentioned case is depicted in Fig. 9/A.This graph is split into three parts (Fig. 9/ B, C & D).The groups of sampling points presented in these figures are the results of the subsequent cluster analyses.They represent different geographical regions.
The figure series draws attention to the fact that great differences exist between the presented groups that can be interpreted as the result of different hydrogeological conditions (Fig. 9/ B, C, D).Regarding the correlation relationships, space does not allow us to enter into further details in the present paper.

Cluster-, discriminant analyses and Wilks' lambda statistics
In the next section the cluster results are presented, one containing the "problematic *** " values and one conducted only on the subsurface waters' anions and cations used for facies determination.
During cluster analysis, the first step was to examine the datasets containing the parameters with values lower than the level of detection, and then to replace non detectable values with half of the detection limit (as mentioned above); as a result, the amount of data available for the cluster analysis increased.Two out of three strongly correlating parameters were removed (conductivity and Na + ) in order to reduce their collective effect.This is the reason why these cannot be found in the following analyses (e.g.Table 4).
Four groups were determined and placed on a map, three of which separated explicitly (Fig. 10).Different groups are marked with different colours.The fourth group contains outliers (e.g.shallow monitoring wells and a deep thermal well), thus there is as yet no explanation for its constitution.For the purposes of better interpretation, the three groups were given names after the geographical region they are located in and the fourth one Outlier.From *** In section 3.3.2.2 the term "problematic" refers to parameters with few datum under the detection limit.If their numbers is high the dataset's variance is low.
now on these will be referred to using the following names.The first group is the Maros group in the area of Maros alluvial fan and Makó graben.The second group is the Körös group in the area of the Körös basin.The third group's wells are geologically and geographically situated in the Maros alluvial fan, however it was given the name Arad group after the city around which they are located.As mentioned before, the clustering was repeated with a dataset without the "problematic" values.This time only anions and cations used for subsurface waters' facies determination were considered to be valid.Results were again visualized on a map (Fig. 11).
In the next paragraph the similarities and differences will be discussed between the results of the two clusterings (Fig. 10 & 11).The main difference is that the wells in the second group (Körös) in Fig. 10 after the second clustering (Fig. 11) with the decreased database (containing parameters: Ca 2+ , Mg 2+ , Na + , K + , HCO 3 -, Cl -and SO 4 2-.) connect to the first (Maros) and the fourth (Outlier) groups' wells.Nevertheless a few wells kept their original relationship and stayed in the second group (Körös).
Using the database without the "problematic" parameters, the data was plotted on a Piper diagram (Fig. 12) (the colour codes of the groups were retained).It is easy to see that the spatially separated groups are no longer present.This is no surprise because the application of the two methods has different intentions.
To verify the grouping, discriminant analysis was used, which pointed out that 94.6% of the original grouped cases were correctly classified using the grouping based on the data including the "problematic" parameters.Then the contents of the actual groups were changed according to the software's (SPSS) suggestions.Finally as a result 100% of the original grouped cases were proved to be correctly classified with the cross-validation resulting in a value of 96%.In order to demonstrate the effect of the individual parameters on the formation of the cluster groups, a Wilks' lambda quotient was calculated for each parameter (Table 4).As expected the most influencing parameter was the organic NO 3 -.Among the inorganic chemical components, the cluster grouping was notably influenced by anions such as: HCO 3 -, SO 4 2-and cations Mg 2+ , Ca 2+ and pH and Fe total as well.
At this point enough results had been obtained to be able to see clearly why the two cluster results (Figs 10 & 11) show a resemblance to each other.The reason for this is that the main parameters that are necessary for determining hydrogeological facieses play an important part in forming the groups (based on the Wilks' lambda quotients), while the origin of the differences is the other influencing parameters.Spatial separation is much more obvious in the latter case.It is suggested to present the statistics of the different groups' parameters on box-and-whiskers diagrams.As an example HCO 3 -and As (V) are presented (Figs 14 & 15).Based on these results, the individual groups can be described.
The most important aim of the introduced area's hydrogeological investigation was to determine the distribution of arsenic in the groundwater.The groundwater's different As (V) concentrations can be seen in Fig. 15.It is obvious that the arsenic accumulated mostly in the waters of the deep and a few shallow wells (Outlier group), and the least in the Arad group's wells.In the Arad group arsenic was traceable in 5 out of 33 wells.Two of the five sampled wells fall to the area of the Maros alluvial fan.The different forms of arsenic show great variance in the groundwater of the other areas but the arsenic content of groundwater is clearly higher in the area of Körös basin.

Principal component analysis
According to the literature, two out of the three groups (Arad and Maros) are located in different parts of the same flow system, while the third represents another flow direction.It is worth examining the dominant processes taking place in the different groups.However, a new problem had to be faced: a few parameters' variance was critically small in the different groups.In the cases of many hydrogeologically important parameters, their standard deviation is too small, so they had to be discarded.Examples included: sulphate in the Körös group, or As (III) and As (V) in Arad group, but the same thing can be said regarding the total iron content in the Arad area.
The most important results of the PCA -regarding the parameters and groups -are presented in Table 5. Absolute values of factor loadings higher than 0.71 are indicated in bold and red.
Regarding the PCA result it can be said that no consistent conclusion can be drawn without discrepancy, either globally, or regarding the different groups and parameters.  5.The factor scores of the parameters which suited the conditions of the PCA in each cluster group

Conclusions regarding the results obtained from the groundwater system's analysis of SE Hungary and the bordering Romanian area
The explicit separation of the groundwater characteristics in the different parts of the Great Hungarian Plain has been a well-known fact for a long time (Rónai, 1985).Based on the dissolved cations, the individual water types are related to the three extended hydrogeological units (Duna-Tisza interfluve South-Tiszántúl, Maros alluvial fan, Körös basin) (Rónai 1985;Varsányi & Ó. Kovács, 2006).Our investigations confirm this.Furthermore, the investigations expand what is already known with the result from the Romanian area.Based on the results of the applied multi-variate data analysis methods, the groundwater sampled in the Makó graben originating from the Duna-Tisza interfluve does not separate from the characteristic water type of the Maros alluvial fan.
Based on hydraulic modelling, results obtained from the upper 600 m (screen depths of drinking water supply wells) and water age data, the Maros alluvial fan consists of one uniform gravitational flow system.Towards Romania hydraulic heads gradually increase.
Based on the literature and results obtained, it seems that the Arad group is located at the beginning of the regional flow system, while its middle part and discharge area is situated in the Hungarian part of the Maros alluvial fan.Along the flow path in the Maros alluvial fan, depending on the quantity of bounded cations in the clay minerals, the ions with one and two valences may interchange.This results in the systematic change of the dissolved cations' concentration in the direction of the flow.In the cases of the Ca(HCO 3 ) 2 and Mg(HCO 3 ) 2 water types, the concentration of cations with two valences decreases, while the concentration of Na + increases in the direction of the flow (Varsányi, 2001).
The Körös basin, (bordering the Maros alluvial fan), is an individual hydrogeological system that is -based on the water ages and high Na + content of the groundwater -situated at the end of a NE-SW oriented gravitational flow system.Besides the gravitational flow, the area can be characterized with slow up flow originating from sediment compaction.
These statements do not contradict the results explained above; nevertheless, our results show that differences in the three areas' groundwater chemistry are affected not only by gravitational flow systems.It is important to mention that data analysis methods may provide significant extra information during the exploration of a certain area's hydrogeological conditions, but separating different flow systems and flow regimes based only on data analysis is not possible.
The results of the PCA may reveal background processes taking place in a gravitational flow system, like cation change processes, or the role of Na + , which has an important place in both Körös and Arad group.In the case of the Arad group this fact contradicts with the group's location in the flow system.Regarding other parameters (for example high chloride and sulphate concentrations), anthropogenic contamination can be in the background or regarding river's ablation area geological origin is feasible as well.
In order to determine the origin of the contaminants further investigations are needed.The place of the Körös group in the flow system does not contradict the high factor score of the Na + , however this high value in comparison to its factor score in the Maros alluvial fan implies a background process, the albite-montmorillonite reaction in the sediments of the basin at depths of 60-500 meters (Varsányi, 2001).

Summary
In our chapter we introduced a few methods known for decades in earth sciences and geology (Davis, 2003).We developed an order of application, which seemed beneficial during our work.The results obtained were of great use in studies when a hypothesis needed verification or discarding.When choosing the cases studies our direct aim was to present data sets with problems commonly faced by scientist through the three water environments: Kis-Balaton Water Protection System, River Tisza and a groundwater system of SW Hungary and SW Romania.We hope this chapter will be of use for every scientist who has to work with water quality data. www.intechopen.com the standard deviation of variables X and Y, while the numerator is the covariance.

Fig. 5 .
Fig. 5. Dendrogram of Wilks' lambda quotients • Z11 (interface, representing Phase I) the parameters TP, Chl-a, TN Ca 2+ , Suspended solids and Cl -are present in the first PC, indicating Phase I's eutrophic and algae dominant waterbody.Again, the Vollenweider model can be used to describe this environment.The calcium ion indicates biogenic carbonate precipitation which is a dominant process in certain locations of Phase I.• Z27 (outlet, describing Phase II) the parameters K + , Na + , Cl -and NO 3 -are present in the first PC, where decomposition processes are dominant.The reason for NO 3 -being one of the dominant parameters is that nitrification is the main process in this section of the system indicating aerobic conditions.

Fig. 7 .
Fig. 7. River Tisza's first five sampling locations in HungaryDuring the research, a 31-year-long dataset was analyzed, consisting of 300,000 data.From 1970 only one sample a week was taken at Tiszabecs and Polgár, according to COMECON † † 's specifications (T.Nagy et al., 2004).At Balsa only one sample was taken per month.At Záhony 26 samples were taken every year.In 1994 Hungarian Standard No. MSZ 12749:1993 came into force.As a result, since 1993 26 samples have been taken every year uniformly.

Fig. 9
Fig. 9. x-y scatterplot showing the relationship between Na + and conductivity for all the groups (A), group 1 (B), group 2 (C), and group 3 (D)

Fig. 10 .
Fig. 10.Cluster results obtained from the database including the "problematic" parameters, where the same colour indicates the same group.

Fig. 11 .
Fig. 11.Cluster results obtained from the database excluding the "problematic" parameters.The same color indicates the same group.

Fig. 12 .
Fig. 12. Piper diagram obtained from the database excluding the "problematic" parameters The result of the discrimination analysis can be visualized on Fig. 13 on the surface stretched between the first two discriminating planes (function 1 & function 2).Separation of the different groups can be easily observed.The wells in the Outlier group separated significantly from all the other groups.

Fig. 13 .
Fig. 13.Visualized results of the cross-validated discriminant results with the "problematic" parameters included.

Table 1 .
Parameters' Wilks' lambda coefficients in summer and winter.

Table 2 .
The number of strong linear connections (|r|≥0.7)at each sampling location by temporal distribution

Table 4 .
Wilks' lambda quotients of each parameter in increasing order