PM10 Hazard level evaluation over selected 7 counties
Particulate matter (PM) refers to solid particles and liquid droplets found in air. Many manmade and natural sources produce PM directly, or produce pollutants that react in the atmosphere to form PM. The resultant solid and liquid particles come in a wide range of sizes, and particles that are 10 micrometers or less in diameter (PM10) can be inhaled into and accumulate in the respiratory system and are believed to pose health risks (Environmental Protection Agency, 2010). Particulate matter is one of the six primary air pollutants the Environmental Protection Agency (EPA) regulates, due to exposure to high outdoor PM10 concentrations causes increased disease and death (Environmental Protection Agency, 2010). Therefore, PM10 concentrations, amongst many other air pollutants, are sampled and measured in various places in California, United States.
The general trend of PM air pollutant concentrations in the air in California are on the decrease, but it continues to be monitored and observed. The California standards for annual PM10 concentrations is that the annual arithmetic mean is 20 µg/m3, and the national standard is 50 µg/m3 before 2006 (California Environmental Protection Agency Air Resources Board, 2010, Environmental Protection Agency, 2010). The State of California sets very high standards for their air quality, and air pollutants are carefully monitored.
However, in reality, it is too costly in terms of time, finance, and manpower to keep all the 213 sites to be monitoring and recording. In Fig. 1, a complete map of all 213 sample locations for PM10 are shown. However, one must note that these sample sites are never all used at any given year, PM10 samples are taken at different locations each year. At best, a maximum of 102 PM10 samples are collected during some years, and at worst, 61 PM10 samples are collected at that year. Therefore comparisons of PM10 between years are difficult, due to missing data at sample sites. It is difficult to construct kriging maps in terms of actual observations annually since the air pollutants were measured in different locations each year although the site design originally planned was quite delicate statistically.
Each year, approximately 40% of the 213 sites were actually observed. We call a site that does not have a recorded PM10 value as "missing value", and since there are no patterns so that serious problems would twist the kriging map constructions. In Fig. 2, this is clearly demonstrated. In 1989, there are 61 PM10 samples collected (29% of 213 locations), and in 2000, there are 94 PM10 samples collected (44%).
The data scarcity brings in a series of (five) fundamental issues into the spatial-temporal modelling and prediction practices for California PM10 data, namely:
The necessity to recognize the impreciseness in analyzing the spatial-temporal pattern in terms of California PM10 records, which inevitably acts the solidness of a geo-statistical analysis;
Which theoretical foundations are appropriate for modelling impreciseness uncertainty;
How to fill up the "missing value" sites so that the "complete" records are available, which is either an original annual average from the original observations (40%) recorded on the site or a or predicted value by "neighbourhood sites" (60%), i.e., to facilitate spatial-temporal imprecise PM10 value by interpolations and extrapolations;
How to estimate the parameters of uncertain processes (temporal patterns), particularly the rate of change parameter;
Create annual kiging maps (19 maps) under spatially isotropy and stationarity assumptions so that the changes between annual maps can be analyzed by kriging map difference between 2007 and 1989 and kriging map of location rate of change.
These issues will be addressed in the remaining sections sequentially.
2. The necessity of modelling impreciseness in California PM10 spatial-temporal analysis
Impreciseness is a fundamental and intrinsic feature in the PM10 spatial-temporal modelling, due to the observational data shortage and incompleteness. Spatially, there are 213 sites involved, and temporally, PM10 observations were collected from 1989 to 2007, over a 19-year period. During the 19-year period, there are only two sites (Site 2125 and Site 2804) having complete 19 year records. There are 16 sites having only 1 record (8%) and 70 sites having 10 or above records. To have a statistically significant time-series analysis, 50 data points are minimal requirement for each site, so classical time-series analysis (probabilistic analysis) cannot be performed. In order to have a quick overall evaluation of PM10 records on each site, we borrow the statistical quality control idea here (Electric, 1956, Montgomery, 2001). But we do not carry on traditional 6-sigma rule, rather, classify the PM10 records into four groups: 1-, 2-, 3-, 4-. These four-group limits in Table 1 reflect the national standard, (50 µg/m3) and California state standard (20 µg/m3) respectively. For example, 1 -is for a location whose PM10 fall in 5 to 20 (µg/m3).
|County name||1-||2-||3-||4-||No. of Sites|
One must be aware that the classification is not in absolute sense, rather, additional rules are adding (similar to quality control chart pattern analysis (Electric, 1956):
(1) if a single point, then, classify the site hazard level according to which group it falls in; (2) if a sequence of records, some of them, particularly early points may fall in higher (or lower) hazard level, but if last three points fall in a lower (or higher) hazard level, the later level would be chosen for the site.
The additional rule 1 can attribute to expert's knowledge confirmation, while the additional rule 2 can be regarded as an expert's decision based on trend pattern.
Fig. 3 shows the classifications of a seven sites from the selected 7 counties in Table 1, each county one site is picked up for illustration purpose. The red coloured plot means the hazard level 1; the green coloured plot means the hazard level 2; the purple coloured plot means the hazard level 3; and the black coloured plot means the hazard level 4.
It is evident that facing the impreciseness caused by incomplete data recording, one has to rely on expert's knowledge to compensate the inadequacy and accuracy in collected observational data. Impreciseness is referred to a term with a connotation specified by an uncertain measure or an uncertainty distribution for each of the actual or hypothetical members of an uncertainty population (i.e., collection of expert's knowledge). An uncertain process is a repeating process whose outcomes follow no describable deterministic pattern, but follow an uncertainty distribution, such that the uncertain measure of the occurrence of each outcome can be only approximated or calculated.
The uncertainty modelling without a measure specification will not have an rigorous mathematical foundations and consequently the modelling exercise is baseless and blindness. In other words, measure specification is the prerequisite to spatial-temporal data collection and analysis. For example, without Kolmogrov's (1950) three axioms of probability measure, randomness is not defined and thus statistical data analysis and inference has no foundation at all.
Definition 2.1: Impreciseness is an intrinsic property of a variable or an expert's knowledge being specified by an uncertain measure.
It is therefore inevitably to seek appropriate form of uncertainty theory to meet the impreciseness challenges. In the theoretical basket, interval uncertainty theory (Moore, 1966), fuzzy theory (Zadeh, 1965, 1978), grey theory (Deng, 1984), rough set theory (1982), upper and lower provisions (or expectations) (Walley, 1991), or Liu’s uncertainty theory (2007, 2010) may be chosen.
While imprecise probability theory (Utikin and Gurov, 1998) may be a typical answer to address the observational data inaccuracy and inadequacy. However the imprecise probability based spatial modelling requires too heavy assumptions. Just as Utikin and Gurov (2000) commented, “the probabilistic uncertainty model makes sense if the following three premises are satisfied: (i) an event is defined precisely; (ii) a large amount of statistical samples is available; (iii) probabilistic repetitiveness is embedded in the collected samples. This implies that the probabilistic assumption may be unreasonable in a wide scope of cases.” Guo et al. (2007) and Guo (2010) did attempt to address the spatial uncertainty from the fuzzy logic and later Liu's (2007) credibility theory view of point.
Nevertheless, Liu’s (2007, 2010) uncertainty theory is the only one built on an axiomatic uncertain measure foundation and fully justified with mathematical rigor. Therefore it is logical to engage Liu’s (2007, 2010, 2011) uncertainty theory for guiding us to understand the intrinsic character of imprecise uncertainty and facilitate an accurate mathematical definition of impreciseness in order to establish the foundations for uncertainty spatial modelling under imprecise uncertainty environments.
3. Uncertain measure and uncertain calculus foundations
A key concept in uncertainty theory is the uncertain measure, which is a set function defined on a sigma-algebra generated from a non-empty set. Formally, let be a nonempty set (space), and the -algebra on. Each element, let us say,,is called an uncertain event. A number denoted as, , is assigned to event, which indicates the uncertain measuring grade with which event occurs. The normal set functionsatisfies following axioms given by Liu (2011):
Axiom 1: (Normality).
Axiom 2: (Self-Duality) is self-dual, i.e., for any,.
Axiom 3: (Subadditivity) for any countable event sequence.
Axiom 4: (Product Measure) Let be the uncertain space,. Then product uncertain measure on the product measurable spaceis defined by
That is, for each product uncertain event (i.e,), the uncertain measure of the event is
Definition 3.2: (Liu, 2007, 2010, 2011) An uncertainty variable is a measurable function from an uncertainty space to the set of real numbers, i.e., for any Borel set
Remark 3.3: Parallel to revelation of the connotation of randomness in geostatistics, impreciseness occupies an fundamental position in geospatial-temporal uncertainty statistical analysis. In California PM10 spatial-temporal study, nearly 60% sites do not have "complete" temporal sequences so that in order to fill the "missing" observations, we have to engage expert's knowledge to pursue "complete sequences" (i.e., to have 19 PM10 values at each individual site), which is inevitably imprecise and incomplete. Impreciseness is referred to a term here with an intrinsic property governed by an uncertainty measure or an uncertainty distribution for each of the actual or hypothetical members of an uncertainty population (i.e., collection of expert's knowledge). An uncertainty process is a repeating process whose outcomes follow no describable deterministic pattern, but follow an uncertainty distribution, such that the uncertain measure of the occurrence of each outcome can be only approximated or calculated.
Remark 3.4: Impreciseness exists in engineering, business and research practices due to measurement imperfections, or due to more fundamental reasons, such as insufficient available information,..., or due to a linguistic nature, because it is an unarguable fact that impreciseness exists intrinsically in expert’s knowledge on the real world.
Definition 3.5: Letbe a uncertainty quantity of impreciseness on an uncertainty measure space. The uncertainty distribution of is
An imprecise variable is an uncertainty variable and thus is a measurable mapping, i.e.,. An observation of an imprecise variable is a real number, (or more broadly, a symbol, or an interval, or a real-valued vector, a statement, etc), which is a representative of the population or equivalently of an uncertainty distributionunder a given scheme comprising set and -algebra. The single value of a variable with impreciseness should not be understood as an isolated real number rather a representative or a realization from the uncertain population.
Definition 3.6: (Lipschitz condition) Let be a real-valued function,. If for any, there exists a positive constant, such that
Definition 3.7: (Lipschitz continuity) Let
where is some metric (for example, Euclidean distance in), such
for each, is Lipschitz continuous locally on the open ballof center radius such
Remark 3.8: For continuity requirements, Lipschitz continuous function is stronger than that of the continuous function in Newton calculus but it is weaker than the differentiable function in Newton differentiability sense. In other words, Lipschitz-continuity does not warrant the first -order differentiability everywhere but it does mean nowhere differentiability. Lipschitz-continuity does not guarantee the existence of the first-order derivative everywhere, however, if exists somewhere, the value of the derivative is bounded since
by recalling the definition of the Newton derivative
Similar to the concept of stochastic process in probability theory, an uncertain process is a family of uncertainty variables indexed by and taking values in the state space.
every increment is a normal uncertainty variableb with expected value 0 and variance, i.e., the uncertainty distribution of is
Then is called a canonical process.
Remark 3.10: Comparing to Brownian motion process in probability theory, which is continuous almost everywhere and nowhere is differentiable, while Liu's canonical process is Lipschitz-continuous and ifis differentiable somewhere, the derivative is bounded. Thereforeis smoother than.
is called an uncertain differential equation. A solution to the uncertain differential equation is the uncertain processsatisfying it for any
Remark 3.12: Since and are only meaningful under the umbrella of uncertain integral, i.e., the an uncertain differential equation is an alternative representation of
Definition 3.13: The geometric canonical process satisfies the uncertain differential equation
has a solution
where can be called the drift coefficient and can be called the diffusion coefficient of the geometric canonical process due to the roles played respectively.
4. Spatial interpolation and extrapolation via inverse distance approach
Statistically, spatial interpolation and extrapolation modeling is actually a kind of linear regression modeling exercises, say, kriging methodology. Considering the shortage of California PM10 data records, we will utilize a weighted linear combination approach, which was first proposed by Shepard (1968). The weights are the inverse distances between the missing value cell to the actual observed PM10 value cells. The weight construction is a deterministic method, which is neutral and does not link to any specific measure theory. It is widely used in spatial predictions and map constructions in geostatistics, but is not probability oriented, rather, molecular mechanics stimulated. A unique aspect of geostatistics is the use of regionalized variables which are variables that fall between random variables and completely deterministic variables. The weight of an observed PM10 value is inversely proportional to its distance from the estimated value.
We wrote a VBA Macro to facilitate the interpolations and the extrapolations to "fill" up the 2048 missing value cells in terms of the 1639 cells with PM10 values. With the interpolations and the extrapolations, every site has 19 PM10 values now. As to whether the inverse distance approach can facilitate highly accurate predictions for each cell without a observed PM10 value, we performed a re-interpolation and re- extrapolation scheme (by deleting a true PM10 record, then fill it by the remaining records one by one) to evaluate the mean square value for error evaluation, the calculated mean of sum of error squares is 59.885, which is statistically significant (asymptotically).
We plotted sites 2045, 2744, 2199, 2263, 2297, 2914, and 2248 (appeared in Fig. 3) respectively in Fig. 4. By comparing Fig. 3 and 4, it is obvious that only Site 2744 the hazard level changed (moving up to next higher hazard level), while the hazard level of other six sites are unchanged. This may give an justification of the inverse distance approach. Keep in mind, the aim of this article is investigate whether the PM10 level is changed over 1989 to 2007 19-year period. The change is not necessarily be accurate but reasonably calculated because of the impreciseness features of PM10 complete records.
5. Uncertain analysis of site temporal pattern
Once the interpolations and the extrapolations in terms of the inverse distance approach is completed, a "complete" data set is available, containing 4047 data records of 213 sites over 19 years. The next task is for a given site, how to model the uncertain temporal pattern. It is obvious that the "complete" data set contains impreciseness uncertainty due to the interpolations and the extrapolations. We are unsure that the impreciseness uncertainty is of random uncertainty, so that we still use uncertain measure theory to pursue the temporal uncertainty modelling.
Recall that the Definition 3.13 in Section 3 facilitates a uncertain geometric canonical process,. Notice that may not fit the data reality so that we propose a modified uncertain geometric canonical process, with:
Recall the relevant definitions in Section 3, we have
But note that for,
Notice that the incrementis independent of, i.e., is independent of.
since if and are independent uncertain variables with uncertainty distributions and respectively, then the joint uncertainty distribution of is. Hence we obtain the expression of:
Then the "variance-covariance" matrix for uncertain vector
For the site, the regression model is
Then in terms of the weighted least square criterion we can define an objective function as
We further notice that
Then it is reasonable to estimate by
Furthermore, we notice that
Also, we can evaluate
in terms of numerical integration, Then an estimate for matrix is obtained:
Finally, we use the approximated objective function
to obtain a pair of estimates. Repeat this estimation process until all the 213 weighted least square estimate are obtained.
Recall the definition of coefficient so that the sign and the absolute value of indicates the geometric change over the 19 years. Since the estimation procedure of involves all the spatial-temporal information, it is reasonable to have them plotted in a kriging map to reveal the overall changes over 19-year period.
6. Kriging maps and time-change maps based on completed PM10 data
Kriging map presentation is vital for a geostatistian's visualization, and maps reveal hidden information or the whole picture. A sample statistic is typically condensing the wide-spread information into a numerical point. While, a kringing map is actually a map statistic (or a statistical map) which contains infinitely many information aggregated from limited "sample" information (i.e., observations). Kriging itself is not specifically probability oriented, it is another weighted linear combination prediction, but requires more mathematical assumptions. In fuzzy geostatistics, the fuzzy kriging scheme has also been developed (Bardossy et al., 1990).
where are spatial locations with observation available and the coefficients satisfy the OK linear equation system
The OK system is generated under the assumptions of an additive spatial model
respectively. Accordingly, the variogram of the random error function is just defined by
where is the separate vector between two spatial point and under the isotropy assumption.
The 213 observation sites now have 19-year PM10 values, a "complete" data set is now available, containing 4047 data records of 213 sites over 19 years, and then the 19 ordinary kriging pred4iction maps are generated for comparisons. In Fig. 5, all 19 years of PM10 concentration in California State are shown. It is very interesting to examine the change in PM10 concentrations through the 19 years, based upon the modelled complete 213 site data. In particular, 1998 shows to have an extremely low PM10 concentration. Although air quality is varied over the years, but in general, the PM10 concentration is decreasing, showing an improvement of air quality trend.
As one can clearly see from Fig. 6, that PM10 concentration has clearly decreased over the 19 years, and air quality has improved remarkably over the years. The blue and green colours show negative changes, and red shows positive changes or near positive changes. Counties such as San Diego, Inyo, Santa Barbara, Imperial, still show an increase in PM10 concentration in the air, and indicate bad air quality. While Kern, Modoc, Siskiyou counties show the most improvement in air quality. The left map in Fig. 6 is PM10 record difference between 2007 and 1989 at each location, in total 231 values, and then a difference map is constructed. It is obvious that the difference map only utilizes 1989 and 2007 two-year PM10 records, 1990, 1991,..., 2006 seventeen years' information do not participate the change map construction. The right map in Fig. 6 show completed, the rate of change over 1989 to 2007 19-year period.
Note that the calculations of involve all nineteen years by temporal regression, the dependent variable are estimated form the actual PM10 observations cross over all the available locations. Therefore, the rate of change parameter at each individual location contains all spatial-temporal information. It is reasonable to say the rate of change parameter is an aggregate statistic for revealing the 19-year changes over 213 locations. kriging map is thus different from 2007-1989 kriging maps. The positive sign of indicates the increasing trend in PM10 concentration, while the negative sign f indicates the decreasing trend in PM10 concentration. The absolute value of reveals the magnitude of change of PM10 concentration. It is worth to report, among 213 locations, 193 locations have negative, while the negative locations are 20 (9% approximately).
Air quality and health is always a central issue to public concern on the quality of life. In this chapter, we examined PM10 levels over 19 years, from 1989 to 2007, in the California State. Facing the difficult task of a lack of "complete" PM10 observational data, we utilised the inverse distance weight methodology to "fill" in the locations with missing values. By doing so, the impreciseness uncertainty is introduced, which is not necessarily explained by probability measure foundation. We noted the character of a regionalized variable in geostatistics and therefore engage Liu's (2010, 2011) uncertainty theory to address the impreciseness uncertainty. In this case, we developed a series of uncertain measure theory founded spatial-temporal methodology, including the inverse distance scheme, the kriging scheme, and the geometric canonical process based weighted regression analysis in order to extract the change information from the incomplete 1989-2007 PM10 records. The use of the rate of change parameter alpha is a new idea and it is an aggregate change index utilized all spatial-temporal data information available. It is far better than classical change treatments. However, due to the limitations of our ability, we are unable to demonstrate the detailed uncertain measure based spatial analysis model. In the future research, we plan to develop a more solid uncertain spatial prediction methodology.
I would like to thank the California Air Resources Board for providing the air quality data used in this paper. This study is supported financially by the National Research Foundation of South Africa (Ref. No. IFR2009090800013) and (Ref. No. IFR2011040400096).