## 1. Introduction

The diffusion of a pollutant in a geographical space, is a phenomenon whose main characteristics are the following: it is driven by complex physical and chemical laws influenced by local environmental parameters, it spreads out in a three dimensional space and, finally it is of dynamic nature. Two main approaches may be proposed to allow a spatial estimation of population exposure:

a “deterministic” approach which transforms estimated emissions derived from inventories into atmospheric concentrations (Hauglustaine (1998), and

a “probabilistic” approach which consists of spatializing concentrations obtained from an air quality monitoring network.

Predicting the pollutant concentration at a given point in space at a given time and assigning to that value an indication of its reliability is out of the scope of entirely deterministic models. By choosing a probabilistic framework geostatistics opens the way to models that can be characterized by actual observations for providing practical answers to a large variety of questions specifically related to risk analysis. Typical questions addressed by this approach are: “what is the probable value of the pollutant concentration at a specific location without measurement?”, “what is the probability that the admissible pollution level is reached at the same point?”. We must pay attention that unlike deterministic modelling, a sufficient number of measures is required to ensure reliable mapping.

This framework aims at considering the phenomenon under observation as an outcome of a random process (Matheron,1989). This process is represented by a random function Z(x, y, z, t) where x, y and z are the coordinates in a geographical three dimensional space and t is the time axis. For simplicity we will illustrate the intention of this article by reducing the work space to two dimensions, hence the random function is now noted Z(x,y). By reintroducing the third dimension of the space and especially the temporal component, new difficulties arise for specifying completely the models from actual data. These will be discussed later.

Geostatistics generally proposes that the whole set of possible random functions is restricted to stationary random functions, particularly regarding their mean and variance values.

Even with this important restriction the geostatistical framework is richer than other possible models, like, for instance, the purely statistical models which suppose that measures taken at two different points are not correlated. Clearly the pollution at two different location is likely to be more similar when these points are close than at a larger distance. Characterizing that correlation is at the centre of the geostatistical approach.

The methods developed under the geostatistical framework are many and depend on the nature of the question. Simple issues can be easily solved by linear geostatistical methods. When the questions are more complex, particularly when dealing with risk analysis, non linear geostatistical methods are applicable.

The outline of the chapter is:

reminders of the most important concepts underlying the geostatistical methods.

necessity of non linear methods

discussion on what is achievable to estimate without resorting to simulations

example of the estimation of the population exposed to a pollution

example dealing with a temporal issue

## 2. Reminders on the models in geostatistics

Characterising a spatial distribution in the general case is complex, because of the numerous parameters that would be necessary to infer from the available measures. In order to get a map of a property measured at some points over a given area, geostatistics proposes a simple method called kriging[1] - named after Dr Krige, who first pointed out the drawbacks of traditional estimation methods of gold deposits in South Africa. The method aims to interpolate the value at any point by assigning weights depending on the correlation with the data measures (Chilès, 1999). The correlation is quantified by means of the variogram[1] - that can be estimated from bi-point statistics of the data (Cressie, 1998; Goovaerts, 1997). By introducing other measured parameters or factors that may have an influence on the parameter of interest the result of the interpolation can be obtained as a linear combination of all parameters at data point referred to as cokriging (Jeannée, 2006). Like any linear interpolator, kriging fails in providing more information related to confidence levels or risk. In order to go ahead in that direction non linear geostatistics based on Gaussian models has been developed. The particular case of data whose distribution is Gaussian is intensively used in statistics as well as in geostatistics for the simple reason that the density of probability only depends on two parameters, i.e. the mean and the variance. Introducing the idea of Gaussian spatial distribution, the additional knowledge of the variogram will entirely characterise the probabilities at several points simultaneously. When the distribution of values is not Gaussian, which is the case most of the time, we come back to the Gaussian case by means of a transform of the data measures, by a process called normal score transform. The variable of interest Z(x), x designating the location of a point in the space, can be written as the transform by the Gaussian anamorphosis function, noted

When the anamorphosis function

The probabilistic model gives also access to statistical quantities like quantiles. The order

Since the anamorphosis

In an analogous way the confidence interval for the statistical risk

In practice we will be interested to the distribution law of a Random Function in a given point, conditioned by the available measures at the experimental stations. For a Random Function of Gaussian spatial law, that conditional distribution is simply:

In absence of any measurement error, at any data location α we have_{2} in the example presented hereafter) it allows to account for the higher pollution at the town centre than at the periphery.

In an analogous way, the confidence interval at the statistical risk level

## 3. Pros and cons of linear estimation methods

Let’s take the example of the annual mean concentration in NO_{2} in the city of Rouen (France). If we imagine we had a perfect knowledge of the actual pollution, the map would resemble that of figure 1 (which is in fact a simulation), showing the high spatial variability of NO_{2} from one point to another.

The available data consist of samples measured on passive diffusive tubes (Roth, 2001), that have been located in the Rouen city and suburbs during the year 2000 for 6 fortnights in total. These measures have been made on 89 sites spread over an area of 300 km^{2}. On the maps (Fig.1, 2, 5, 6 and 7) only the central area of 9x9 km^{2} with 59 sites equipped with diffusive tubes is represented. Public health regulations require experts to evaluate the probability to overcome the annual threshold of 40µg/m^{3} as well as the number of inhabitants potentially exposed to excessive pollutant concentrations. We will assume that the threshold applies to the concentration at any point of the space and that the measuring period is large enough to neglect the temporal term of the estimation error.

A first solution would be to predict the pollution concentrations at unsampled locations. Among different interpolators, the techniques based on kriging provide the best linear interpolator in the sense that they minimise the variance of the estimation error. To be specific we have used in this example a co-kriging (extension of kriging for processing several correlated variables) with a co-factor combining information related to pollution sources and the spatial distribution of the inhabitants. The resulting map (Fig. 2.) provides a simplified image of the real phenomenon. If it reproduces the main tendencies of the pollution phenomenon, it fails in describing the variations of the NO_{2} concentration at a scale of a few hundred metres, i.e. inferior to the spacing between the measures (1km).

By looking at both maps (Fig. 1 and 2) it is immediately visible that applying a threshold (here 40µg/m^{3}) on the estimated concentrations would lead to an under-estimation of the “contaminated surface” with respect to the right surface obtained from the threshold applied on the real, but unknown, values. The bias cannot be neglected: the contaminated area is estimated as 352 ha whereas its real surface equals 569 ha. The underestimation is observed because the threshold is relatively high with respect to the average level of the measured concentrations. At the contrary for a low threshold, we would have observed an over-estimation. The figure 3 shows that the histogram of the estimated concentrations is more squeezed around the average values, that is known as the smoothing effect of kriging, and that the bimodal distribution is not kept on the estimated values.

The scatter diagram (Fig. 4) between the “reality” and its estimation illustrates the presence of an error due to the interpolation procedure. This error can only be diminished by adding information, i.e. other measures. The uncertainty resulting from that error can be represented by the notion of statistical risk.

The methods derived from kriging are by nature inadequate for reproducing the actual variability, because the kriging methods aim to minimise the difference between the actual value and its interpolation: the interpolated values by kriging include necessarily less extreme values and more intermediate values, conveying the so-called smoothing effect of kriging. What is at stake is the confusion between the unknown map of the actual concentration and the available maps that are only estimates of the reality. Contrary to a topographical map (which can be considered as “exact”) every map of a pollutant calculated from a sampling will never be the reality, is it resulting from geostatistical methods or from physic-chemical models. Those maps will only provide an image more or less close to the reality.

## 4. Non linear estimation or simulations?

In real situations we will never have access to the actual phenomenon which is unique and largely unknown. The geostatistical framework proposes to consider that phenomenon as a realisation – among other equally possible – of a random phenomenon characterised by its spatial structure. In our example it amounts to say that the real phenomenon could resemble one of the representations of figure 5, analogous to the initial representation from a general point of view while differing notably on detailed local features.

Reality being unknown at most locations, we use a model of a random function, from which it is possible to derive statements correct in probability. We will favour the models that, after non linear operators (like applying a threshold) produce non biased results. The maps of figure 5 have been obtained by simulating outcomes of a Gaussian random function back-transformed by an anamorphosis function. The principle of geostatistical simulations consist in drawing outcomes of the random function, keeping the statistical distribution and the spatial structure. The conditional simulations will recreate the measured values at the experimental data points, i.e. is an exact interpolator. When many simulations are available, the probability of exceeding a threshold can be empirically calculated at every point by computing the proportion of the outcomes that exceed the threshold.

Nevertheless in the framework of the model, the probability of exceeding any threshold can be directly calculated without resorting to simulations, by just applying the distribution law of the annual concentration Z conditional to the data according to equation (1). The advantage of this approach is that the calculation is analytical and straightforward.

In the example of the pollution by NO_{2} in Rouen, the Gaussian model provides at any location the probability of exceeding the threshold of 40µg/m^{3}. The studied area has a surface S of 81km^{2} (i.e. 8100 ha). By multiplying the spatial mean of the probability by S we get an unbiased estimate of the surface probably exposed to the pollution. We get then a surface of 583 ha, close to the true surface of 569 ha (Fig.3.), instead of 352 ha calculated from the kriged map.

If the estimated concentration is less than the threshold, it does not mean that the risk of exceeding the threshold is nil, because of the estimation error. Conversely it may happen that the estimated concentration is above the threshold while the actual concentration is less. In practice the estimation error is unknown but the probabilistic model gives access to its mean (zero because of the unbiasedness condition applied in kriging) and its variance. The calculation of the probability of exceeding a threshold takes into account that variance or more precisely the estimation variance

threshold and the estimation is not precise the probability of exceeding threshold tends to increase. An intermediate probability reveals the uncertainty due to the estimation error, particularly when the estimated value is close to the threshold and the estimation is not precise. The map of the probability can be used to classify the different zones of the urban area with regard to the levels of risks and taking into account the uncertainty in the knowledge of the actual pollution. Such maps are a useful and can help in making decisions related to the protection of the population from pollution.

## 5. Evaluation of the population exposed to the pollution

More complex questions, like evaluating functions depending on several variables at several points simultaneously, require a simulations approach. Let’s take as an example the case of the computation of statistics on the population exposed to a threshold. For sake of simplification we will not consider the movement of the population during the year.

The number of inhabitants *b* _{i }is supposed to be known on a fixed grid, the numbers concerned by the exceeding of threshold *s* is*.*

For the risk δ related to the local exceeding of the threshold, the exposed population is estimated by

This calculation is based on adding up the population of each cell independently. This is a naïve way to proceed.

In fact the quantiles of the exposed population *B(s)* require the spatial law of the concentration. They must be calculated by means of simulations (Deraisme, 2002) because it is mandatory to take into account the probability to exceed the threshold simultaneously in different locations. The spatial distribution of the exposed population is characterized by the simulation of the concentration on a regular grid. For a given simulation the exposed population is simply obtained by adding the inhabitants living in the cells where the simulated concentration exceeds the threshold of 40µg/m^{3}. By repeating this computation on each of the 100 simulations we get a reliable representation of the distribution of the exposed population (Fig. 8). It should be noted that carrying out a large number of simulations is leading to computationally heavy.

For the statistical risk of 10%, the exposed population is estimated as 33 480 inhabitants: the probability that the exposed population is more than 33 480 inhabitants is 0.1. In other words, there is a 90% chance that 33 480 inhabitants are exposed to the excess of the threshold. Another way to say it is that this value represents the quantile at 90% of the exposed population.

The first “naïve” calculation, multiplying the population map with the probability map truncated at the risk of 10%, would have given 47 880 inhabitants, all living in zones where the probability of exceeding the threshold is above 0.9. The significant difference is due to the fact that, in the first case, the risk is applied to the quantity we want to evaluate, i.e. the exposed population, while in the “naïve” way the statistical risk is applied on the NO_{2} concentration independently of the population. The quantile of the exposed population depends on the probability to exceed the threshold together at different locations, which is ignored by the “naïve” calculation.

For more simple calculations, such as the average population exposed to the pollution, simulations are not necessary. We get the same result, i.e. 28150 inhabitants, by multiplying the population by the probability map (without any cut) and summing up the contribution of each cell. Averaging simulations and calculating directly the conditional expectation are equivalent.

## 6. Introducing the temporal factor

In the previous example only the spatial variability has been taken into account in the risk evaluation. Yet the variability in time is an important factor inherent to the definition of some official regulations.

In the following example we will examine the calculation of the quantile of daily mean values of SO_{2} concentrations on the urban area of Le Havre. For that study, twenty four analysers gave daily records of the SO_{2} concentration for a year. The aim is to identify the zones where the daily concentration exceeds the level of 250 µg/m^{3}, prescribed by the regulations, for more than 4 days per year, i.e. 1% of the year. In other words we will look at the spatial distribution of the quantile at 99% of the daily SO_{2} concentration.

The evaluation of a quantile of high order requires that the measures recorded every day are complete for all data locations and all day of the year. At least we must be certain that the missing values are inferior to the threshold.

A simple method consists in calculating the quantile at 99% on each analyser and in modelling its spatial distribution, by means of geostatistical simulations of that quantile. (Fig.9).

This simplified method ignores the variability in time: for a given quantile we will get the same results whether the pollution peaks are synchronous between the different analysers or not. Without fully characterising all complex correlations in space and time, an

intermediate solution consists of simulating the daily concentrations for all days of a year, independently for each day (see first step of the figure 10). An automatic procedure providing 200 simulations has been used. We have obtained in total 365x200= 73000 draws of the daily concentration at regular locations of a grid covering the urban area. The results have been validated by analysing the distributions and spatial correlations for several specific situations.

A map of the quantiles at the 99% frequency has been calculated for each simulation of the 365 days as shown in the second step of the scheme of figure 10. By comparing the 200 quantiles to the threshold we can then get the probability of exceeding the threshold (Fig.11): at each location it is the number of occurrences where the simulated quantile is above the threshold divided by the total number of simulations.

Both maps of the figures 9 and 11 look, at first glance, rather close, however by computing the difference between both calculations, important discrepancies can be observed (see Fig. 12).

The correlation between the mean quantile calculated from 200 simulations and the average of 20 direct simulations of the quantile (Fig.13) shows clearly the bias due to the implicit hypothesis that the pollution events occur simultaneously: the direct simulation of the quantile gives almost systematically values above the quantile calculated from the daily simulations of the SO_{2} concentration.

This schematic example illustrates the risk of important bias generated by excessive simplifications. It still remains that the correlation in time should also be introduced.

## 7. Conclusion

Geostatistics provides a rigorous conceptual frame work for modelling phenomena spreading in space and time. The application of the geostatistical methods requires some care because, if they are not adapted for answering appropriately the problem, the consequences on the results may be very significant. An effective limitation lies in the use of kriging techniques which provide an optimal solution to estimate quantities depending linearly on the variable of interest. However for evaluating the risk of atmospheric pollution, applying a threshold on the pollutant concentration is typically a non linear operation. If the threshold is applied on the kriged map the calculations will result in a potentially importantl bias. More complex models are then required, they are enhanced by non linear geostatistical methods.

The calculation of the conditional probability using the Gaussian model makes the solution of simple problems quite straightforward: delineation of the geographical area concerned by the pollution or evaluation of the average of the number of inhabitants exposed to it.

The conditional simulations provide a consistent solution and the most adequate results for the evaluation of the population exposed to pollution. These techniques are used to evaluate health risks. When dealing with simulations a sufficient number of outcomes must be generated, particularly to derive high order quantiles. The problems become still more complex when the time dimension of the pollution events has to be taken into account, such as the control of the number of days when the pollutant concentration exceeds a threshold. The simulations provide a correct solution but are heavy in computations. The example on SO_{2} in Le Havre is favourable because the data are available at the same locations all over the year. By performing the simulations independently for each day, the most important bias can be avoided. Nevertheless a complete spatial and temporal model would have the theoretical and practical advantage of handling globally the whole set of measures. Further work is required before being able to apply such models. The present regulations in European countries often refer to complex criteria combining different time scales. The power of simulation techniques is not, in theory, limited regarding this complexity, as soon as the model correctly represents the pollutant concentrations at the origin of the pollution events.

The applications illustrated here have been successfully carried out because the criteria were simple (a threshold applied on one pollutant for a given period of time). When the criteria are multiple (several pollutants, several time scales) the simulations will definitely be the only possible method, but computationally heavy. The need for better models, particularly dealing with the temporal aspect, is still important.

## Acknowledgments

The data used for illustrating applications of geostatistical methods have been kindly provided by Air Normand, an organization in charge of the air quality control in the Normandy region of France. The computations have been performed using the Isatis software marketed by Geovariances.

## Notes

- Kriging is a linear combination of data with weights chosen for minimizing the variance of the estimation error.
- The variogram represents the variance of the difference between two measures as a function of the distance between these measures.