Air Pollution Analysis with a Possibilistic and Fuzzy Clustering Algorithm Applied in a Real Database of Salamanca (México)

concentration patterns (located in the superior part). Once the groups are identiﬁed, we apply the PFCM clustering algorithm. "Environmental Monitoring" is a book designed by InTech - Open Access Publisher in collaboration with scientists and researchers from all over the world. The book is designed to present recent research advances and developments in the field of environmental monitoring to a global audience of scientists, researchers, environmental educators, administrators, managers, technicians, students, environmental enthusiasts and the general public. The book consists of a series of sections and chapters addressing topics like the monitoring of heavy metal contaminants in varied environments, biolgical monitoring/ecotoxicological studies; and the use of wireless sensor networks/Geosensor webs in environmental monitoring.


Environmental Monitoring
Examine and study air pollutant information is very important for a better understanding of the human exposure and its potential impacts in health and welfare. In recent years, the city of Salamanca has been catalogued as one of the most polluted cities in Mexico (Zuk et al., 2007). Sulphur Dioxide (SO 2 ), and Particular Matter (PM 10 )a r et h e criteria for searching air pollutants with the highest concentration in Salamanca, where three monitoring stations have been installed in order to know the level of air pollution; measure records of each monitoring station are handled separately. Actually an environmental contingency alarm is activated when the daily average pollutant concentration exceeds an established threshold (in a single monitoring station). In this work, we propose to apply the PFCM (Possibilistic Fuzzy c Means) clustering algorithm to the measured data obtained from three monitoring stations so that a local environmental contingency alarm can be taken, according to the pollutant concentration reported by each monitoring station, general (or city) environmental contingency alarms will depend on the levels provided by the combined measure. So, the PFCM algorithm is used to find the prototypes of patterns that represent the relation between SO 2 and PM 10 air pollutants. For this relation analysis we use records from January 2007. Once the prototypes have been estimated, a comparison is made between the average pollution of each monitoring station and the prototypes. In the analysis is used a data set from January to December 2007. The analysis include pollutant concentration as SO 2 , PM 10 , meteorological variables, wind speed, wind direction, temperature, and relative humidity. It is also analyzed the impact of meteorological variables on the dispersion of pollutants, this is done through the calculus of correlation coefficients. This important correlation analysis is very simple and it is intended for improving decision making in environmental programs. Only the data gathered by the Nativitas monitoring station is used for the correlation analysis. This paper is organized as follow: In Section 2 is presented the features, and explain the air pollution problem in Salamanca. In Section 3 is introduced the PFCM (Possibilistic Fuzzy c Means) clustering algorithm and the correlation coefficients. Section 4 presents the obtained results. And finally, in Section 5 we present our conclusions.

Study case
Salamanca is located in the state of Guanajuato, Mexico, and it has an approximate population of 234,000 inhabitants INEGI (2005). The city is 340 km northwest from Mexico City, with coordinates 20 • 34'22" North latitude, and 101 • 11'39" West longitude. It is located on a valley surrounded by the Sierra Codornices, where there are elevations with an average height of 2,000 meters Above Mean Sea Level (AMSL). Salamanca has been one of the Mexican cities with more important industrial development in the last fifty years. Refinery and Power Generation Industries settled down in the fifty and seventy decades, respectively. These industries constitute the main and most important energy source for local, regional and national economy. However, the increase of population, quantity of vehicles, and the industry, refinery and thermoelectric activities, as well as orography and climatic characteristics have propitiated the increment in SO 2 and PM 10 concentrations INE (2004). The existent orography difficults the dispersion of pollutants by the wind, which produces the worst pollutant concentrations. SO 2 emissions are bigger than those in the Metropolitan area of Mexico City or Guadalajara city, the two biggest cities of Mexico, even when these ones have a bigger population than the city of Salamanca Cortina-Januchs et al. (2009). Orography hinders the dispersion of the worst pollutants by winds. Sulfur dioxide is produced fundamentally by the combustion of fossil fuels, and it has the energy generation sector as the main source of pollution. That is, the industrial sector generates 99.3 % of this pollutant, and only an approximate percentage of 0.06 % is generated by the transport sector. Particles produced by electric power generation represent 29 % of the total emissions, it follows the vehicular traffic in the roads without paving with 27 %, next the agriculture burns with 17 %, transport sector with 10 %,and the remaining 17 % is emitted by other sub-sectors. Authorities of the city have made important efforts to measure and record on concentrations of pollutants Zamarripa & Sainez (2007). In 1999 the Air Quality Monitoring Patronage (AQMP) was formed. Since then the AQMP has been in charge of running the Automatic Environmental Monitoring Network (AEMN), and disseminate information. This information is validated by the Institute of Ecology (IE), which constantly analyzes the levels of pollutants INE (2004). The AEMN consists of three fixed and one mobile stations. The fixed stations are: Cruz Roja (CR), Nativitas (NA), and DIF. The fixed stations cover approximately 80 % of the urban area while the mobile station covers the remaining 20 %. Fig. 1 illustrates the location of the three fixed stations. Each station has the necessary instrumentation to automatically track concentration of pollutants and meteorological variables every minute. Table 1

Clustering algorithms
In this work we take advantage of the qualities of fuzzy and possibilistic clustering algorithms in order to find c groups in a set of unlabeled data set Z = {z 1 , z 2 ,...,z k ,...,z N } in an M-dimensional space, where the nearest z k to a prototype, or group center v i ,b e l o n gt ot h e group i among c possible groups. The membership of each z k to the different groups depends on the kind of partition of the M-dimensional space where data set is defined. This way, a c-partition can be either: hard (or crisp), fuzzy, and possibilistic Bezdek et al. (1999). The hard c-partition of the space for a data set Z(k)={z k |k = 1, 2, ..., N}, of finite dimension and c groups, where 2 ≤ c < N, is defined by (1), (2) defines the fuzzy c-partition, whereas (3) defines the possibilistic c-partition.

Fuzzy c-Means algorithm
The Fuzzy c-Means clustering algorithm (FCM) was initially developed by Dunn Dunn (1973), and generalized later by Bezdek Bezdek (1981). This algorithm is based on the optimization of the objective function given by (4), is the vector of prototypes of the c groups, which are calculated according to D ikA i = z k − v i 2 , a squared inner-product distance norm, and m ∈ [1, ∞] is a weighting exponent which determines the fuzziness of the partition. The optimal c-partition for a Fuzzy c-Means algorithm, is reached through the couple (U * , V * ) which minimizes locally the objective function J fcm , according to the alternating optimization (AO).
Theorem FCM Bezdek (1981) Following the previous equations of the FCM algorithm, the solution can be reached with the next steps: FCM-AO-V Given the data set Z choose the number of clusters 1 < c < N, the weighting exponent m > 1, as well as the ending tolerance δ > 0.
I Provide an initial value to each one of the prototypes v i , i = 1, .., c.T h e s e v a l u e s a r e generally given in a random way.
II Calculate the distance of z k to each one of the prototypes v i ,u s i n gD 2  (5).
IV Update the new values of the prototypes v i using equation (6).
V Verify if the error is equal or lower than δ, If this is truth, stop. Else, go to step II.
The FCM is an algorithm that calculates a membership value µ ik for each point z k in function of all prototypes v i . The sum of the membership values of z k to the c groups must be equal to one. However, a problem arises when there are several equidistant points from the prototypes of the groups, because the FCM is not able to detect noise points or nearest and furthest points from the prototypes. Pal et al Pal et al. (2004) show an example with two points located in the boundary of two groups, one point near to the prototypes and the other one far away from them. This must be handled with care, as both points are not equally representative of the groups, even if they have the same membership values. One way to overcome this inconvenience is to use a possibilistic algorithm.

Possibilistic c-Means algorithm
The Possibilistic c-Means clustering algorithm (PCM) Krishnapuram & Keller (1993) is based on typicality values and relaxes the constraint of the FCM concerning the sum of membership values of a point to all the c groups, which must be equal to one. Thus, the PCM identifies the similarity of data points with an alone prototype v i using a typicality values that takes values in [0,1]. The nearest data points to the prototypes are considered typical, further data points are atypical and data points with zero, or almost zero, typicality values are considered noise Ojeda-Magaña et al. (2009a). The objective function J pcm proposed by Krishnapuram Krishnapuram & Keller (1993) for this algorithms is given by ( 7 ) where The first term of J pcm is identical to that of the FCM objective function, which is based on the distance of the points to the prototypes. The second term, that includes a penalty γ i ,t riest o bring t ik toward 1.
Theorem PCM Krishnapuram & Keller (1993): if γ i > 0, 1 ≤ i ≤ c, m > 1 and Z has at least c distinct data points, then (T, V) ∈ M pcm ×ℜ c×N may minimize J pcm only if Krishnapuram and Keller Krishnapuram & Keller (1993) Krishnapuram & Keller (1996) recommend to apply the FCM at a first time, such that the initial values of the PCM algorithm can be estimated. They also suggest the calculus of the penalty γ i with equation (11) where K > 0, although the most common value is K = 1, and the membership values {µ ik } are those calculated with the FCM algorithm in order to reduce the influence of noise.
The PCM algorithm is very sensitive to the {γ i } values, and the typicality values depend directly on it. For example, if the value of γ i is small, the typicality values t ik of T are also small, whereas if the value of γ i is high, the t ik are also high. For this work, the {γ i } values are obtained from equation (11).
In order to avoid a problem with the initial PCM algorithm, as sometimes the prototypes of different groups coincided Hoppener et al. (2000), even if the natural structure of data has well delimited different groups, Tim et al Timm et al. (2004); Timm & Kruse. (2002) have modified the objective function to include a constraint based on the repulsion among groups, thus avoiding identical groups when they must be different. The objective of the fuzzy clustering algorithms is to find an internal structure in a numerical data set into n different subgroups, where the members of each subgroup have a high similarity with its prototype (centroid, cluster center, signature, template, code vector) and a high dissimilarity with the prototypes of the other subgroups. This justifies the existence of each one of the subgroups Andina & Pham (2007). A simplified representation of a numerical data set into n subgroups, help us to get a better comprehension and knowledge of the data set Barron-Adame et al. (2007). Besides, the particional clustering algorithms (hard, fuzzy, probabilistic or possibilistic) provide, after a learning process, a set of prototypes as the most representative elements of each subgroups.
Ruspini was the first one to use fuzzy sets for clustering Ruspini (1970). After that, Dunn Dunn (1973) developed in 1973 the first fuzzy clustering algorithm, named Fuzzy c-Means (FCM), with a parameter of fuzziness m equal to 2. Later on Bezdek Bezdek (1981) generalized this algorithm. The FCM is an algorithm where the membership degree of each point to each fuzzy set A i is calculated according to its prototype. The sum of all the membership degrees of each individual point to all the fuzzy sets must be equal to one. Krishnapuram and Keller Krishnapuram & Keller (1993) developed the Possibilistic c-Means (PCM) clustering algorithm, where the principal characteristic is the relaxation of the restriction that gives the relative typicality property of the FCM. The PCM provides a similarity degree between data points and each one of the prototypes, value known as absolute typicality or simply typicality Pal et al. (1997). So, the nearest points to a prototype are identified as typical, whereas the furthest points as atypical, and noise Ojeda-Magaña et al.

PFCM clustering algorithm
Pal et al. Pal et al. (1997) have proposed to use the membership degrees as well as the typicality values, looking for a better clustering algorithm. They called it Fuzzy Possibilistic c-Means (FPCM). However, the sum equal to one of the typicality values for each point was the origin of a problem, particularly when the algorithm uses a lot of data. In order to avoid this problem, Pal et al Pal et al. (2005) proposed to relax this constraint and they developed the PFCM clustering algorithm, where the function to be optimized is given by (12) and subject to the constraints ∑ c i=1 µ ik = 1∀k;0≤ µ ik , t ik ≤ 1 and the constants a > 0, b > 0, m > 1a n dη > 1. The parameters a and b define a relative importance between the membership degrees and the typicality values. The parameter µ ik in (12) has the same meaning as in the FCM. The same happens for the t ik values with respect to the PCM algorithm. emphTheorem PFCM Pal et al. (2005): The membership degrees are calculated with equation (13), the typicality values with (14) and for the prototypes the equation (15) is used. The iterative process of this algorithm follows the next steps: PFCM-AO-V Given the data set Z choose the number of clusters 1 < c < N, the weighting exponents m > 1, η > 1, and the values of the constants a > 0, and b > 0.
I Provide an initial value to each one of the prototypes v i , i = 1, .., c.T h e s e v a l u e s a r e generally given in a random way.
II Run the FCM-AO-V algorithm. III With these results, calculate the penalty parameter γ i for each cluster i.T a keK = 1.
IV Calculate the distance of z k to each one of the prototypes v i using D 2 (13).
VII Update the value of the prototypes v i using equation (15).
VIII Verify if the error is equal or lower than δ, if this is truth, stop. Else, go to step IV.

PFCM clustering algorithm in the AEMN
As it is known, in the partition clustering algorithms is necessary a minimum of two groups. However, in our problem we only have one group, this group is formed by patterns [SO 2 ;PM 10 ] pollutant concentrations. Therefore, is proposed a synthetic cloud of patterns with the following covariance matrix and vector of centers: In this case, the number of patterns (4320) is the same in the synthetic cloud and the pollutant concentration.

Correlation coefficient
The correlation coefficient r (also called Pearson's product moment correlation after Karl Pearson Pérez et al. (2000)) is used to determine the strength and direction of the relationship between two variables. This form of correlation requires that both variables are normally distributed, interval or ratio variables. The correlation coefficient is calculated by eq.(16): where n is the number of data points. The numerical values of correlation coefficient range from +1 to -1. If two variables move exactly together, the value of the correlation coefficient is 1. This indicates perfect positive correlation. If two variables move exactly opposite to each other, the value of the correlation coefficient is -1. Low numerical values indicate little relationship between two variables, such as -0.10 or +0.15 indicate little relationship between on two variable.  station we observe that either SO 2 or PM 10 pollutant concentrations are highest. At the DIF monitoring station we observe the highest PM 10 concentrations in the AEMN network. The main proposal in this work is to apply the PFCM clustering algorithm to the AEMN in Salamanca as well to integrate the pollutant measures from the three monitoring stations. The PFCM initial parameters (a, b, m and η) are very important in order to reduce the outlier effects in the pattern prototypes. Pal et al, in Pal et al. (2005) recommend of b parameter value larger than the a parameter value in order to reduce the mentioned effects. On the other hand, a small value for η and a value greater than 1 for m are recommended. nevertheless, choosing a too high of a value of m reduces the effect of membership of data to the clusters, and the algorithm behaves as a simple PCM. Taking into account the previous recommendations, the initial parameters for the PFCM clustering algorithm were set as follows: a = 1, b = 5, m = 2andη = 2. The found prototypes (a and b)areshowninFig.4. In Fig. 4(a) the daily averages of SO 2 concentrations are presented for each monitoring station together with the corresponding prototypes. It is observed also that Cruz Roja monitoring station receives the highest emissions of SO 2 concentrations: this is due to its location near to the refinery. The prototypes in this case were very low in comparison with the observed SO 2 concentrations, because only one station observed high SO 2 concentrations (Cruz Roja). According with the analyzed patterns the emitted pollutant is only measured by the Cruz Roja monitoring station (see Fig. 4). Fig. 4(b) shows the daily averages of PM 10 concentrations and result prototypes. In this case, the observed averages are very similar at the three monitoring stations. The PM 10 pollutant dispersion is more uniform then the SO 2 pollutant dispersion in the city.

Conclusions
Nowadays, there is a program to improve the air quality in the city of Salamanca, Mexico. Besides, this program has established thresholds for several levels of contingencies depending on the SO 2 and PM 10 pollutant concentrations. However, a particular level of contingency for the city is declared taking into account the highest pollutant concentration provided by one of the three monitoring stations. For example, if a pollutant concentration exceeds a given threshold in a single monitoring station, the alarm of contingency applies to the whole city. This value is normally provided by the Cruz Roja station, due to its proximity to the refinery and power generation industries. Looking for local and general contingency levels in the city, we have proposed to estimate a set of prototypes such that they can represent a calculated measure of pollutant concentrations according to the values measured in the three fixed stations. In such a way, a local alarm of contingency can be activated in the area of impact of the pollution depending on each station, and a general alarm of contingency according to the values provided by the prototypes. Nevertheless, the last case requires adjusting the thresholds, as the actual values would be only used for local contingency because they depend on the measured values of pollutant concentrations, and the general contingency requires thresholds as a function of calculated values.