Descriptive statistics for dissolved oxygen concentration between January 2002 and May 2013.
“Water is the principle of all things” Thales of Miletus.
Water quality models are essential tools for the assessment of the impact of ecosystem changes in response to variable inputs, as well as the interactions occurring within the system . Taking this into account, developed countries have continually been monitoring and classifying their water. The over-abstraction of fresh water is a problem in certain parts of Europe, especially in coastal countries and islands in the Mediterranean, leading to the decrease of groundwater bodies, the deterioration of water quality and the destruction of certain habitats . Water quality monitoring procedures may be used in decision-making processes for supporting policy options. For this reason, several European Union (EU) countries have developed a national water quality system considering the characteristic structures of their own rivers and have used these types of indicators to evaluate their current water quality levels. Water quality monitoring is usually conducted using physical and chemical parameters; nevertheless, biological factors such as algae, especially diatoms, benthic macro invertebrates and fish, have recently been used for this purpose . Nowadays, the majority of the population in northern and central Europe is connected to wastewater treatment plants that make use of advanced treatments. This development allows for a decrease in the load of organic matter and nutrients transported by rivers and subsequently lead to an improvement of the state of water resources .
The management of water resources is regulated by EU directives and their transposition into national legislation. For instance, in Portugal, the Law nº58/2005 (Law of Water) ensures implementation into national law the Directive nº2000/60/CE (the Water Framework Directive, WFD), which creates the institutional framework for the sustainable management of surface, interior, transitional, coastal and even groundwater. According to this directive, each member-state has to project, improve and recover all surface waters in order to achieve the good qualitative and quantitative status of all water bodies by 2015 [5,6]. The Decree-Law nº 77/2006 complements the WFD by characterizing the waters of a river basin. A regulatory instrument establishes the status of surface waters and groundwater and their ecological potential.
According to the directive's framework, the monitoring of water resources essentially has two purposes. The first is the evaluation of water status (surveillance monitoring) and the second is the implementation of programmes that include measures for identifying water resources at risk of failing environmental objectives (operational monitoring). If surveillance-monitoring reveals that the water in question has not reached the necessary "ecological status", a basin management plan should be implemented for the duration of one year. During this process, indicative parameters of all elements of biological, hydromorphological and general physico-chemical quality, as well as significant discharges in the basin containing pollutants, should be monitored in order to permit the classification of ecological status. In operational monitoring, measures are implemented and the status of water is evaluated as a result of the implementation of such measures .
In this context, every European country organizes itself using various entities that monitor water resources at various levels. The water area in Portugal consists of six large sets of entities: the trusteeship policy (national and regional), advisory bodies (national and regional), public water management (national and regional), utilities users and regulators (urban, agricultural, energy developments and multipurpose), mixed structures and associations and non-governmental entities.
The Water Portal website aims to respond to the provisions of Chapter VIII of the Water Law, especially its art. 87º, which empowers the National Water Authority – Institute of Water (Instituto da Água I.P.) to create a national information system for water and place it at the disposal "of... entities that have responsibilities [in] exercising public functions or [in] providing public services directly or indirectly related to water [to] the technical and scientific community and the public in general." . The national water resources monitoring system is supported by a database that stores, prepares and publicize both hydro-meteorological and water quality (surface and groundwater) data, which have been collected by the monitoring network of the Ministry of the Environment (through the http://snirh.pt portal system). The National Information System for Water Resources (Sistema Nacional de Informação de Recursos Hídricos – SNIRH) was created by the Institute of Water in mid-1995. The monitoring network consists of automatic and conventional stations, some of which are equipped with remote transmission. The portal system also publishes monthly thematic syntheses aimed at characterizing the national water availability, technical reports, the mapping of water resources (e.g., flood zones) and maintaining technical documents and photographs related to water resources , following the significant economic support from European funds devoted to investments in sewage systems. Portugal has had slow growth in the area of wastewater treatment. The rate of the population connected to public wastewater treatment plants was about 56% in 2005 and 74% in 2009 (see Figure 1).
Despite high investments at various levels of Portuguese public administration and those of economic actors, the economic crisis of recent years has led to a disinvestment in the monitoring and in the water treatment systems. Undoubtedly, the economic constraints of local authorities in Portugal have led to a reduction in investments in waste management, which may jeopardize the quality of surface water in several Portuguese hydrological basins (see Figure 2).
Surface water-monitoring systems are important tools for analysing environmental processes and for identifying changes within these processes. In this context, statistical tools and methodologies are relevant for assessing water quality and for allowing the analysis of available data yielded by information systems. The temporal evolution analysis of water quality indicators is an important tool in this context, as the real-time monitoring procedures. Thus, modern statistical techniques are an integral part of any water monitoring process.
This paper presents a set of issues that analysts encounter in the analysis of data that support decision-making processes. The focus of this work is on contributions to the analysis of time series of water quality variables. The study of these time series has characteristics related to data collection procedures, as well as to several economic and geographic aspects. Some characteristics of water quality variables pose challenges to their statistical modelling, namely, seasonality, change-point detection, the existence of outliers, etc. Some of these topics will be addressed in this chapter. The presentation of these topics is accompanied by the study of water quality data from the hydrological basin of the Vouga River and the Ria de Aveiro lagoon. In particular, statistical modelling is performed using data pertaining to the dissolved oxygen concentration variable (in mg/l) from the Carvoeiro water-monitoring site in the Vouga River basin. Section 2 discusses the area under study, while Section 3 presents a data description with an exploratory analysis of the primary monitoring sites of the Vouga basin. Section 4 addresses the analysis of a time series through the decomposition into its main components and their treatment, while Section 5 presents the modelling of time series through a state space model approach.
2. The study area
The Vouga River is situated at the centre of Portugal at an altitude of about 930m, near the geodesic landmark Facho da Lapa in Serra da Lapa, a mountain located in the district of Viseu. The river flows 148 km before empting into Ria de Aveiro. The watershed of the Vouga (also referred to as Vouga e Ribeiras Costeiras) is the second largest basin among the watercourses that run exclusively in Portuguese territory, comprising a total area of 3706 Km2. More specifically, the Vouga basin is located in the transition zone between the north and south of Portugal, i.e., between the watersheds of the Douro in the north and the Mondego in the south (see Figure 3).
Several rivers flow into the Ria de Aveiro estuarine system. The hydrologic regime involves a summer low flow condition and the dynamics of the coastal lagoon are dominated by tidal oscillation. Ria de Aveiro is characterized by its rich biodiversity, as well as by increasing pressure related to anthropogenic activities near its margins, i.e., building and land occupation and agricultural and industrial activities. This has resulted in a significant change to the lagoon's morphology and in the constant input of a large volume of anthropogenic nutrients, as well as contaminant loads, which in turn has had a negative impact on water circulation and the water quality of the lagoon, . The construction, management and operation of the multi-municipality system drainage of the Ria de Aveiro is the responsibility of SIMRIA – Integrated Sanitation of Municipalities of Ria, SA, a private company with a public capital majority (established by Decree-Law nº 101/97 of 26 April).
Several studies have been developed around this watershed and in particular, around the Ria de Aveiro lagoon. Some studies focus on the ecological systems related to Ria de Aveiro and the diversity of its flora and fauna (e.g., [12, 13]. Other studies have examined the biochemical properties of the region's environmental systems. Still others have studied the hydro-meteorological aspects of the watershed. This diversity and multidisciplinary approach to the Ria de Aveiro lagoon and its associated watershed was recently formalized by the University of Aveiro as 'Grupo uariadeaveiro'. The group aims to monitor and contribute to the management of the Ria de Aveiro by focusing mainly on knowledge transfer activities; by doing so, it can also contribute to the enhancement of other activities within the University of Aveiro. Additionally, special attention to the activities of the primary entities and relevant stakeholders associated with the management of the Ria de Aveiro lagoon can in this way be maintained in order to ensure the useful and effective contribution of the University of Aveiro. The University of Aveiro, through its uariadeaveiro group, is committed to facilitating the provision of scientific knowledge about this lagoon area and places it at the service of the region. Thus, the University of Aveiro hopes to contribute to the improvement of the protection, enhancement and management of the Ria de Aveiro.
The website http://www.ua.pt/riadeaveiro/ (see Figure 4) was established as a mechanism for the dissemination of knowledge and activities of interest pertaining to the Ria de Aveiro. This work in progress aims to establish itself as the most important collection of data about the Ria de Aveiro in the country. It is hoped that this initiative will help to create internal synergies within the scientific community focused on the Ria de Aveiro and also to strengthen bridges of communication with regional stakeholders in order to maximize the successful management of the Ria.
The webpage of the uariadeaveiro group includes five menus. The first is dedicated to the presentation of the group and its goals. The second is dedicated to the library, where access can be gained to all scientific and non-scientific publications regarding the Ria de Aveiro. The third page provides spatial information on the Ria de Aveiro and allows for access to other pages with geographic information of interest regarding the Ria systems. The fourth page is devoted to the approximation entities and users of the Ria. The fifth page discusses research projects completed and ongoing, relevant training courses and activities related to the Ria de Aveiro .
The average fresh water flow into the Ria de Aveiro is about 40 m3/s. The Vouga and Antuã rivers are the main sources of fresh water, with an average annual flow of 24 m3/s and 2.4 m3/s; both of these rivers are part of the Vouga watershed, .
The main tributaries of the Vouga River are, from upstream to downstream, the River Mel, the Sul River, the Varoso, the River Teixeira, the River Arões, the River Mau and the Caima River on the right bank; on its left bank, the River Ribamá, the MarneI and the River Águeda with its major tributary, the Alfusqueiro. Officially, the catchment of the River Vouga consists in the rivers that dewater into the Ria de Aveiro with the exception of Vala da Fervença, in the Mira area, which flows to the southern end of the Ria de Aveiro .
A set of water monitoring sites (see Figure 5), where several variables pertaining to water quality can be collected, is available in the hydrological basin of the Vouga River. However, some problems have arisen in the context of statistical modelling, i.e., some water monitoring sites and variables yield little data and/or have missing values. Furthermore, due to a lack of economic resources and other factors, data collection was discontinued at some sites.
Within the SNIRH system, 78 water-monitoring sites are registered within the hydrological basin of the Vouga River. However, data collection is not continuous and some stations had been deactivated somewhere in time. Relative to the DO concentration, 26 stations showed a significant data set, with data up to May 2013 (the last month available in the system). Hence, the statistical analysis will be focus mainly on this data set.
3. Data set description
The statistical analysis was performed using a data set related to the dissolved oxygen concentration in a large set of water monitoring sites located in the hydrological basin of the River Vouga, a basin associated with the Ria de Aveiro lagoon. The analysis focused on the DO concentration because it is one of the most important variables in the evaluation of a river's water quality  and because of its continuity in terms of measurements taken at all selected water quality monitoring sites under analysis.
The DO variable is widely recognized as one of the most relevant for the monitoring of water quality. DO concentration is controlled by several physical processes such as the rate of production of oxygen by algae, the nitrogen cycle and other biochemical processes . At the same time, available oxygen is the primary factor affecting aquatic life in rivers. A good level of dissolved oxygen is essential for aquatic life. Oxygen concentration is therefore, the most important parameter for aquatic life and ecological states in estuaries . Dissolved oxygen analysis measures the amount of gaseous oxygen dissolved in an aqueous solution. Oxygen gets into water by diffusion from the surrounding air, by aeration and as a waste product of photosynthesis. Adequate dissolved oxygen is necessary for good water quality and is a necessary element to all forms of life. Natural stream purification processes require adequate oxygen levels in order to provide for aerobic life forms. When dissolved oxygen levels in water drop below 5 mg/l, aquatic life is put under stress. With a decrease in oxygen concentration the stress on life forms increases. Oxygen levels that remain below 1-2 mg/l for a few hours can result in large numbers of fish deaths .
The SNIRH system holds data regarding the DO concentration of 78 water-monitoring sites between April 1989 and May 2013. However, among them several sites hold little or exhibit several missing values, which are associated with large gaps between measurements. Thus, the statistical descriptive analysis will be performed primarily on the largest time series. A significant number of locations had less than 23 observations for all time periods. These time series will not be considered in the descriptive analysis. On the one hand, only 26 water-monitoring sites had at least 100 observations and these will be considered in the exploratory analysis. On the other hand, only between January 2002 and May 2013 are consistent data sets available for all locations. Thus, the main data set that will be considered consist of these 26 time series during this period.
Data collection was not equally distributed during all time periods. Sometimes there was more than one measurement in a month. Therefore, in order to not lose information, we took the average of all measurements if there was more than one observation in a month. Table 1 presents descriptive statistics for the 26 time series between January 2002 and May 2013 (137 months) and Figure 6 shows the 26 time series representations. The graphical representation clearly shows that time series exhibited seasonal behaviour, as was expected due to the nature of the data.
|Site||Code||Abbrev.||N. obs.||Min||Max||Average||St dev||Skew||Kurtosis|
|Captação Rio Ínsua||08G/02||INS||122||6.4||12.4||9.31||1.05||-.33||.65|
|Ponte São João de Loure||10F/04||LOU||112||5.4||11.0||8.24||1.25||-.02||.01|
|Ponte Vale Maior||09G/01||MAI||112||6.2||12.0||8.62||1.12||-.05||-.05|
|Aç. Rio Alfusqueiro||09H/05||ALF||113||2.9||12.0||7.80||1.75||-.31||-.01|
|São João Serra||09H/04||SER||115||6.0||12.0||8.70||1.18||.01||-.08|
|São Miguel Mato||09I/06||MAT||111||4.3||12.0||8.44||1.53||-.41||.03|
|Ponte Canha (Vouga)||10G/06||CAN||114||2.6||10.1||6.89||1.92||-.26||-.96|
An exploratory analysis showed that, in general, observations were not normally distributed. Indeed, in some locations, the distribution of observations is skewed or leptokurtic. This fact must be considered in the modelling procedures, since Gaussian distribution is a usual assumption in several statistical analyses. Figure 7 represents box-plots for DO concentrations. Boxplots are able to identify several moderate outliers in many locations. However, due to existence of temporal correlation structure in the measurements, the analyses of outliers must be carefully considered.
As DO concentration exhibits seasonal behaviour, it is necessary to analyse the monthly averages. Figure 8 shows the monthly averages of DO concentration during the analysed period for the 26 water-monitoring sites. These results indicated that DO concentration was greater in the winter months and presented lower values in the summer months. This fact is directly related to hydro-meteorological conditions, since DO concentration is influenced primarily by precipitation amounts and temperature. The monthly standard deviations presented in Figure 9 exhibit a diffuse pattern. Although it is difficult to identify a common pattern for all sites, a graphic representation indicates that in several locations there was larger variability during the months of January, September, October and November. The lower variability of the monthly DO concentration measures in December may be justified by the reduced observations typically collected during this month.
4. Time series analysis
This section addresses several issues in the statistical modelling of time series water quality variables. Water quality variables present characteristics that pose some challenges to modelling procedures. By their nature, these variables have properties that must be incorporated in order for statistical techniques to yield good performance.
There are different ways of dealing with these features based on different approaches. One way of modelling a time series with components as either a trend or a seasonal/stochastic behaviour is by employing the time series decomposition method. This method consists of sequentially and individually modelling each component so that trends and/or seasonal components can be estimated and subtracted from the data to generate a noise sequence.
In the present work, we present, from a practical point of view, the most common methodologies applied to the time series of DO concentration in the Carvoeiro water-monitoring site. The choice of this monitoring site is related to its location in the main river, in the middle section, near to a point of abstraction of water for human consumption. Furthermore, the time series data in this location is long – the data is available between April 1989 and September 2012 – and it will used to apply the statistical procedures.
4.1. Heteroscedasticity and trend modelling
Generally, environmental time series can be non-stationary in a variety of ways. The most common case occurs when data present a trend and/or heteroscedasticity. When data has variance in terms of time, i.e., , the most commonly applied procedure is the application of data transformation.
In several areas there are models that accommodate heteroscedasticity within their structure. For example, in econometrics, ARCH models are commonly employed in the modelling of financial time series that exhibit time-varying volatility. When an autoregressive moving average model (ARMA) is assumed for the error variance, the model is a generalized autoregressive conditional heteroscedasticity  model.
Heteroscedasticity in an environmental time series is commonly treated using the Box-Cox transformation 
which is indexed by , where is the transformation of the original observation . Note that for , the transformation is the natural logarithm of . On the one hand, the Box-Cox transformation is not suitable for all heteroscedasticity types. On the other hand, this transformation, among others, changes the data magnitude hindering the interpretation of the model.
As will be shown later, this work considers a class of models (state space models) that is able to incorporate some types of heteroscedasticity and does not require data transformation. A time series frequently has a trend, i.e., it is not an observed process with a constant mean. In general, the trend is modelled using a deterministic function , usually a polynomial function, power function or logarithmic function. The adjustment of the function to data is generally performed using the least square method:
Nevertheless, the most common function is of the type . However, the simplicity of this function is not always appropriate to model real data, see for instance Figure 10. The DO concentration data for the Carvoeiro water-monitoring site represented in Figure 10 corresponds to an extended period of time. The graphical representation clearly indicates that the time series is not stationary in mean and the variability does not follow a monotonic function in time. Furthermore, there is no unique function among those referred above that globally fit the time series in order to model the trend. Thus, we can identify a trend function for the time series defined by sub-periods of time.
Figure 10 suggests that there is a time where the trend function changes its expression, possibly with a non-linear function prior to that time and a linear form following it. For simplicity and according to the exploratory analysis, we considered the most suitable function to the first part of the series to be a quadratic function and for the second part, a constant.
After fitting a regression model with a quadratic trend up to a certain time , to all possible times and a constant after that instant, we chose the value of with the smallest residual sum of squares . This procedure led to = 139, i.e., October 2000 and the estimated trend was:
Figure 11 presents the series resulting from the subtraction of the adjusted trend to the original data, . As is shown, this new series does not present a deterministic trend; however, as is the norm for monthly environmental data, seasonal behaviour is present.
The easiest way to handle seasonality is to subtract from January’s data the overall January average, from February’s data the overall February average, etc. . An alternative method involves estimating both trend and seasonality coefficients together. In this case, it is necessary to define a set of dummy variables as,
where is the period (in our case ). If the trend component is polynomial, as previously considered, the model shows unique representation through a linear regression model. As it is generally appropriate that the sum of the seasonal coefficients must be zero, the linear model should be re-parameterized in order to overcome the linear dependence of variables . Thus, the model can be presented as:
where is white noise. The vector of coefficients is estimated by the least squares method. In order to facilitate the interpretation of results, we can consider the model with a constant parameter , which leads to the following transformations:
In general, these two approaches produce similar estimates. The first method is easier to implement and less laborious than the second. Furthermore, the regression model provides optimal estimates in certain conditions that assume that the errors are not time correlated, an assumption that is not valid for the data in the current analysis. Hence, for simplicity and considering the data characteristics, we adopted the first method, that is, the decomposition procedure that estimates trends and seasonal separately. Table 2 presents the estimates for the seasonal coefficients of the Carvoeiro data after the subtraction of the adjusted trend component.
As expected, from the DO perspective, the water quality was better in the winter months and worse during the summer months. The DO concentration is associated with weather conditions, particularly with precipitation amounts .
The Box-Cox transformation is associated, for example, with multiplicative models where the time series components are multiplicative as . However, there are also other types of heterocedasticity. Figure 12 shows for each month both the overall monthly averages and standard deviations of the DO concentration in Carvoeiro. This analysis indicated that variability had not been constant throughout the year. Indeed, the linear correlation coefficient between averages and standard deviations was 0.44, which showed that when the monthly average was large, the standard deviation tended to be large as well. This type of heteroscedasticity will be taken into account in the modelling procedure via the state space model approach.
4.2. Temporal dependence
It is well known that there exists a certain level of persistence in the behaviour of nature . Indeed, environmental data is influenced by meteorological conditions that may persist from one month to another . The series of residuals pertaining to the DO concentration in Carvoeiro does not have the characteristics of white noise. Figure 12 shows that there is a significant temporal autocorrelation according to an autoregressive process of order 1, an AR(1). Thus, stochastic modelling of residuals is needed in order to accommodate the serial correlation.
When modelling a time series using the decomposition method approach, the procedure that is generally adopted for finding the best fit of an ARIMA model to past values of a time series is the Box-Jenkins methodology (for more details, see . The modelling of the dependence of environmental data significantly influences decisions in the application of other modelling statistical procedures such as, for example, the existence of non-stationarities in a series or in change point detection procedures. Therefore, in section 5, we consider a class of models that allows for incorporating several components as the temporal dependence in a dynamic manner and which will be used to model the DO concentration in the water-monitoring of Carvoeiro.
5. Time series modelling
State space models are a class of extremely versatile time series models. Their popularity is mainly due to the Kalman filter. This algorithm, , has been used in many different areas to describe dynamic systems evolution. The primary goal of the Kalman filter algorithm is to find estimates of unobservable variables based on observable variables related through a set of equations designated by a state space model (SSM).
In general, such models are defined by the following equations:
The first equation is the measurement equation and relates the vector of observable variables, , with the vector of unobservable variables, , referred to as states. The matrix is a matrix of known coefficients and is a white noise vector, called the measurement error, with covariance matrix .
Furthermore, the vector of states varies in time according to the second equation, the state equation, where is a matrix of autoregressive coefficients and is a white noise vector with covariance matrix . The disturbances and are assumed to be uncorrelated, that is, for all and s.
A subclass of these models with a particular interest arises when the state vector is a stationary process with mean , . In this case, to ensure that the state process and the state equation follows a stationary VAR(1) with a mean , it is assumed that the eigenvalues of the autoregressive matrix are inside the unit circle. For more details, see .
5.1. A brief description of the Kalman filter algorithm
As the state is unobservable, it is necessary to obtain its predictions. The Kalman filter algorithm uses an orthogonal projection of the state based on the available information up to the moment. Assuming that the parameters of a state space models are known, the Kalman filter recursions provide the best linear predictors for filtering and forecasting the vector of states.Let represent the predictor of based on the information up to time and be its mean square error (MSE). As the orthogonal projection is a linear estimator, the predictor for the observable variable, , is given by
When at time , is available, the prediction error or innovation, , is used to update the prediction of , the filter prediction, through the equation
where is called the Kalman gain matrix. Furthermore, the MSE of the update predictor verifies the relation . In turn, at time , the forecast for the state vector is given by the equation
with MSE matrix .
When the disturbances are Gaussian, the Kalman filter provides the minimum mean square estimate for .
5.2. Parameter estimation
The application of the Kalman filter to predict the unknown variables requires the estimation of the model’s parameters.
The vector of parameters and the mean vector if the state is stationary must be estimated from the data. When the Gaussian distribution is suitable to the disturbances and the conditional log-likelihood of a sample is given by:
where . The maximum likelihood estimates are obtained by maximizing the log-likelihood, i.e.,
The state space model’s parameters are estimated by the maximum likelihood via numerical procedures.,The Newton-Raphson method can be adopted or, more often, it is employed the EM algorithm .
If the Gaussian distribution is not appropriate or the optimal properties are not required in the modelling procedure, other estimators can be considered. For example, distribution-free estimators can be adopted, which do not assume any distribution for the disturbances (see [27, 23]).
5.3. A mixed-effect state space model
In order to incorporate the temporal correlation and to accommodate the heterogeneity of the variances of the sub-series of each month in terms of DO concentration in Carvoeiro, we adopted a mixed-effect state-space model. The model is defined as:
where is the trend component, is a periodic function with 12 averages as previously mentioned, the process is a stationary AR(1) with mean and Gaussian errors and is the measurement error, assumed to be a Gaussian white noise process.
The model has two primary components: a regression structure, which incorporates both the trend and the seasonality, and an unobservable process , the state. It is assumed that the state process will calibrate the deterministic structure previously estimated by and in presented in Table 2.
Another important feature of the model is that by its own formulation, it allows the existence of heterogeneity of variances previously identified. That is, the stochastic calibrator factor allows for the dynamic modelling of the heteroscedasticity during the year and incorporates the time dependence identified through the ACF and PACF functions in Figure 12.
Table 3 presents the maximum likelihood estimates of and their respective standard errors. As expected, the mean of the calibration factor was close to 1. This meant that the deterministic component represented the global behaviour of data. On the other hand, the calibration factor followed a stationary autoregressive process AR(1) process since .
Once the model had been well identified, the Kalman filter was run in order to obtain the forecasts and filtered predictions of the states . The Kalman filter provided predictions to the states and their respective MSE estimates. Figure 13 represents the filtered predictions of the calibration factors for each month and their respective empirical confidence intervals at a 95% level obtained by:
In order to assess the adjustment of the model, one step-ahead forecasts were computed with respective confidence intervals at 95% (see Figure 14):
The percentage of observations outside of the respective empirical confidence interval was 6.36%. The residuals analysis showed that there were some outliers; the biggest residual occurred in January 1992.
Normality was rejected at a 5% significant level, considering the Kolmogorov-Smirnov test, since the p-value was 2.8%. However, the Gaussian distribution was not rejected (K-S p-value=20.0%) when the largest outliers were not considered. As state space models are associated to the Kalman filter, in general, the impact of the outliers turns out not to be significant, since they are attenuated through the filtering procedure.
As can be seen in Figure 14, a large proportion of the observations outside the respective empirical confidence interval belonged to the first years of the series. This may indicate the existence of different structures in the observable variables, which can be found through the analysis of change points in the state variables. This issue is discussed in the following section.
5.4. Change point detection
The detection of structural changes in environmental data is one of the most interest issues in the applied statistical analysis. In the past decade, there has been significant progress within change point detection. However, these methodologies still require more complex statistical procedures or computational requirements. For example, in , a simple BIC-like multiple change point penalty is proposed that is based on the total number of change points. For further recent developments regarding change point detection in environmental applications, [29,30].
The state space modelling approach adopted in this work allows for applying standard change point detection procedures, since this model incorporates several data characteristics. Thus, as these properties of environmental data are taking into account in the modelling procedure, simple change-point detection techniques can be applied if adequately chosen.
This subsection deals with change point detection in the state process instead of the observed process This detection is made through the filtered predictions of the calibration factors, , obtained in the filter procedure. It is reasonable to investigate changes in the structure of that may be relevant to the water-monitoring process. For this purpose, the filtered predictions of the state process can be analysed in relation to the existence of one change (or more) in its mean as the best prediction of an AR(1) process.
To do so, we used maximum type test statistics, referred in , alongside their correct adaptation to an autoregressive process. The basic test evaluates the existence of a change point in the mean of independent and identically distributed Gaussian variables in an unknown time.
The hypotheses of the basic test are:
are independent and distributed according to the same Gaussian distribution
are distributed according to and are distributed according to , with .
The test for the existence of the change point can be performed by a statistic test known as the maximum type. Since the variance is generally unknown, this approach consists of a sequence of t tests and the analysis of the significance of their maximums. Thus, the test statistic is given by:
The null hypothesis is rejected if the statistics are larger than a corresponding critical value. However, the calculation of the exact critical value is not easy, due to the test statistic being the maximum of the sequence , ,..., . Nevertheless, approximated critical values can be obtained by different methods, namely:
the Bonferroni inequality and its improvement;
Table 4 replicates the approximated critical values obtained through simulation and presented in . However, these critical values assume that the observations are uncorrelated and according to , it is necessary to correct them. Thus, in order to obtain the corrected critical value that takes into account the time correlation of an AR(1), the initial critical value must be multiplied by .
The simplest method for detecting multiple change points is through binary segmentation.
The first work to propose binary segmentation within a stochastic process setting was . Binary segmentation is a generic technique for multiple change-point detection in which, initially, the entire dataset is searched for one change-point, typically using a CUSUM-like procedure. When a change-point is detected, the data are split into two (hence, ‘binary’) sub-segments, defined by the detected change-point. A similar search is then performed on each sub-segment, possibly resulting in further splits. The recursion on a given segment continues until a certain criterion is satisfied .
|5% critical value||1% critical value|
In order to apply the change point detection procedure to the state filtered prediction of Carvoeiro, it was necessary to interpolate the critical value at 5% to , which is 3.207. Taking into account the autoregressive parameter estimate presented in Table 3, the corrected critical value at 5% was noted as 5.079. Figure 15 represents the observed values of the statistics , , as well as both corrected and uncorrected critical values at 5% in the first step of the binary segmentation procedure. As can be seen, the null hypothesis is rejected, i.e., a statistically significant change point that can be identified as April 1990.
Table 5 presents the results from the complete binary segmentation procedure, which ended when the observed value of the test statistics was less than the respective corrected critical value (the critical value was obtained by interpolation according to the sub-sample size). This procedure identified two change points: April 1990 and December 1991. Note that in the final step of the binary segmentation procedure, the largest value of the statistic tests was 3.202, which would lead to the detection of another change point if the correction to the AR(1) process had not been considered, since the uncorrected critical value was 3.200. However, considering the AR(1) correction, the critical value was 5.069, i.e., the observed statistic test was not statistically significant.
|Month/Year||Critical value||Critical value|
The complete binary segmentation procedure identified two significant change points that corresponded to three different levels for the mean of the process . In chronological order, the mean levels of the process were 1.190, 0.835 and 1.005. This means that during the first period of time, the DO variable was on average 19% higher than those predicted by the regression component. In the second period of time, the observations were on average 16.5% lower, while in the last period of time, observations were on average close to the regression model's predictions.
The identification of statistically significant change points provides information that can be useful to environmental investigators or technicians. Nevertheless, from a statistical point of view, this information allows for obtaining more accurate models with better adjustments. Thus, we must consider the state space model defined in subsection 5.3, but only where the state process has three average levels, one for each sub-series identified in the change point procedure.
Thus, taking three means , and according to the identified change points of April 1990 and December 1991, the percentage of observations outside of the respective empirical confidence interval, at 95% level, for the forecasted one step-ahead of the DO concentration in the new model is 5.30% instead of 6.36% in the first model. On the other hand, the final model exhibited a determination coefficient equal to instead of in the first model.
The statistical analysis of water quality data is a relevant tool for the monitoring of ecosystems and water systems for human consumption. The information contained in water variable time series allows studying the past but also enables monitoring the present and planning the future. Biochemical analyses of water must be accompanied by statistical studies in order to identify patterns and analyse the evolution of water quality in its different forms.
This work presented a characterization of the River Vouga Basin in Portugal and focused primarily on the dissolved oxygen concentration variable. The complexity of databases renders more difficult to perform consistent statistical analyses of water quality. Statistical issues were presented and discussed in this paper to make their implementation easier for other researchers. Several problems were discussed and some solutions were provided, always keeping simplicity and practical relevance in mind.
Authors were partially supported by Portuguese funds through CIDMA – Center for Research and Development in Mathematics and Applications and the Portuguese Foundation for Science and Technology (FCT – Fundação para a Ciência e a Tecnologia") within project PEst-OE/MAT/UI04106/2013.