List of the experiments carried out using FCM and FCS.

## 1. Introduction

There are many investigations reported in the scientific literature about Particulate Matter (PM) 2.5 and PM10 in urban and suburban environments [Vega et al 2002, Querolet al 2004, Fulleret al 2004].

In this contribution, the information acquired from PMx monitoring systems is used to accurately forecast particle concentration using diverse soft computing techniques.

A number of works have been published in the area of airborne particulates forecasting. For example, Chelani[et al 2001] trained hidden layer neural networks for CO forecasting at India. Caselli [et al 2009] used a feedforward neural network to predict PM10 concentration. Other works such as Kurt’s [et al 2010] have constructed a neural networks model using many input variables (e.g. wind, temperature, pressure, day of the week, Date, concentration, etc) making the model too complex and inaccurate.

However, not many scientific literature discuss a number of robust forecasting methods using soft computing techniques. These techniques include neuro-fuzzy inference methods, fuzzy clustering techniques and support vector machines. Each one of these algorithms is discussed separately and the results discussed. Furthermore, a comparison of all methods is made to emphasize their advantages as well as their disadvantages.

## 2. Fuzzy inference methods

Fuzzy inference systems (FIS) are also known as fuzzy rule-based systems. This is a majorunit of a fuzzy logic system. The decision-making is an important part in the entire system. The FIS formulates suitable rules and based upon the rules the decision is made. This is mainly based on the concepts of the fuzzy set theory, fuzzy IF–THEN rules, and fuzzy reasoning. FIS uses “IF - THEN” statements, and the connectors present in the rule statement are “OR” or “AND” to make the necessary decision rules.

Fuzzy inference system consists of a fuzzification interface, a rule base, a database, a decision-making unit, and finally a defuzzification interface as described in Chang(et al 2006). A FIS with five functional block described in Fig.1.

The function of each block is as follows:

A rule base containing a number of fuzzy IF–THEN rules;

A database which defines the membership functions of the fuzzy sets used in the fuzzy rules;

A decision-making unit which performs the inference operations on the rules;

A fuzzification interface which transforms the crisp inputs into degrees of match with linguistic values; and

A defuzzification interface which transforms the fuzzy results of the inference into a crisp output.

The working of FIS is as follows. The inputs are converted in to fuzzy by using fuzzification method. After fuzzification the rule base is formed. The rule base and the database are jointly referred to as the knowledge base.

Defuzzification is used to convert fuzzy value to the real world value which is the output.

The steps of fuzzy reasoning (inference operations upon fuzzy IF–THEN rules) performed by FISs are:

Compare the input variables with the membership functions on the antecedent part to obtain the membership values of each linguistic label. (this step is often called fuzzification.)

Combine (through a specific t-norm operator, usually multiplication or min) the membership values on the premise part to get firing strength (weight) of each rule.

Generate the qualified consequents (either fuzzy or crisp) or each rule depending on the firing strength.

Aggregate the qualified consequents to produce a crisp output. (This step is called defuzzification.)

A typical fuzzy rule in a fuzzy model has the format shown in equation 1

where AB are fuzzy sets in the antecedent; Z = f(x, y) is a function in the consequent. Usually f(x, y) is a polynomial in the input variables x and y, of the output of the system within the fuzzy region specified by the antecedent of the rule.

A typical rule in a FIS model has the form (Sugenoet al1988): IF Input 1 = x AND Input 2 = y, THEN Output is z = ax + by + c.

Furthermore, the final output of the system is the weighted average of all rule outputs, computed as

## 3. Fuzzy clustering techniques

There are a number of fuzzy clustering techniques available. In this work, two fuzzy clustering methods have been chosen: fuzzy c-means clustering and fuzzy clustering subtractive algorithms. These methods are proven to be the most reliable fuzzy clustering methods as well as better forecasters in terms of absolute error according to some authors[Sin, Gomez, Chiu].

Since 1985 when the fuzzy model methodology suggested by Takagi-Sugeno [Takagi et al 1985, Sugeno et al 1988], as well known as the TSK model, has been widely applied on theoretical analysis, control applications and fuzzy modelling.

Fuzzy system needs the precedent and consequence to express the logical connection between the input output datasets that are used as a basis to produce the desired system behavior [Sin et al 1993].

### 3.1. Fuzzy clustering means (FCM)

Fuzzy C-Means clustering (FCM) is an iterative optimization algorithm that minimizes the cost function given by:

Where n is the number of data points, c is the number of clusters, xk is the kth data point, vi is the ith cluster center μik is the degree of membership of the kth data in the ith cluster, and m is a constant greater than 1 (typically m=2)[Aceveset al 2011]. The degree of membership μik is defined by:

Starting with a desired number of clusters c and an initial guess for each cluster center vi, i = 1,2,3… c, FCM will converge to a solution for vi that represents either a local minimum or a saddle point cost function [Bezdek*et al* 1985]. The FCM method utilizes fuzzy partitioning such that each point can belong to several clusters with membership values between 0 and 1. FCM include predefined parameters such as the weighting exponent m and the number of clusters c.

### 3.2. Fuzzy clustering subtractive

The subtractive clustering method assumes each data point is a potential cluster center and calculates a measure of the likelihood that each data point would define the cluster center, based on the density of surrounding data points. Consider m dimensions of n data point (x1,x2, …, xn) and each data point is potential cluster center, the density function Di of data point at xi is given by:

where*r*_{a} is a positive number. The data point with the highest potential is surrounded by more data points. A radius defines a neighbour area, then the data points, which exceed *r*_{a}, have no influence on the density of data point.

After calculating the density function of each data point is possible to select the data point with the highest potential and find the first cluster center. Assuming that *X*_{c1} is selected and *D*_{c1} is its density, the density of each data point can be amended by:

The density function of data point which is close to the first cluster center is reduced. Therefore, these data points cannot become the next cluster center. *r*_{b} defines an neighbour area where the density function of data point is reduced. Usually constant *r*_{b}>*r*_{a}*.* In order to avoid the overlapping of cluster centers near to other(s) is given by [Yageret al 1994]:

## 4. Support vector machines

The support vector machines (SVM) theory, was developed by Vapnik in 1995, and is applied in many machine-learning applications such as object classification, time series prediction, regression analysis and pattern recognition. Support vector machines (SVM) are based on the principle of structured risk minimization (SRM) [Vapniket al 1995, 1997].

In the analysis using SVM, the main idea is to map the original data *x* into a feature space *F* with higher dimensionality via non-linear mapping function, which is generally unknown, and then carry on linear regression in the feature space [Vapnik 1995]. Thus, the regression approximation addresses a problem of estimating function based on a given data set, which is produced from the function. SVM method approximates the function by:

where*w* = [*w*_{1},…,*w*_{m}] represent the weights vector, *b* is defined as the bias coefficients and (x)=[1(x),…,m(x)] the basis function vector.

The learning task is transformed to the weights of the network at minimum. The error function is defined through the ε-insensitive loss function, *Lε(d,y(x))* and is given by:

The solution of the so defined optimization problem is solved by the introduction of the Lagrange multipliers αi, *i*=1,2,…,k) responsible for the functional constraints defined in Eq. 9. The minimization of the Lagrange function has been changed to the dual problem [Vapnik et al 1997]:

With constraints:

Where C is a regularized constant that determines the trade-off between the training risk and the model uniformity.

According to the nature of quadratic programming, only those data corresponding to non-zero *K(x*_{i}*, x*_{j}*)=(x*_{i}*)*(x*_{j}*)* is the inner product kernel which satisfy Mercer’s condition [Osun*aet al 1997*] that is required for the generation of kernel functions given by:

Thus, the support vectors associates with the desired outputs y(x) and with the input training data x can be defined by:

Where x_{i}are learning vectors. This leads to a SVM architecture (Fig. 2) [Vapnik 1997, Cristianiniet al 2000].

The methodology used for the design, training and testing of SVM is proposed as follows based in a review of Vapnik, Osowski [et al 2007] and Sapankevych[et al 2009]

Preprocess the input data and select the most relevant features, scale the data in the range [−1, 1], and check for possible outliers.

Select an appropriate kernel function that determines the hypothesis space of the decision and regression function.

Select the parameters of the kernel function the variances of the Gaussian kernels.

Choose the penalty factor C and the desired accuracy by defining the ε-insensitive loss function.

Validate the model obtained on some previously, during the training, unseen test data, and if not pleased iterate between steps (c) (or, eventually b) and (e).

## 5. Discussion of results

Simulations were performed using fuzzy clustering algorithms using the equations [3-7], in this case study, the datasets at Mexico City in 2007 were chosen to construct the fuzzy model. Likewise, the data of 2008 and 2009 from the same geographic zone in each case were used to training and validating the data, respectively. The result of the fuzzy clustering model was compared then to the real data of Northwest Mexico in 2010.

The results obtained show an average least mean square error of 11.636 using Fuzzy Clustering Means, whilst FCS shows an average least mean square error of 10.59. Table 1 shows a list of the experiments carried out. An example of these results is shown in figure 4 for FCM and figure 5 shows the estimation made using FCS at Northwest Mexico City.

In figures 4 and 5, the raw data (shown in blue solid line) and the constructed fuzzy model (in dashed-starred green line) shown that the trained model is approximated to the raw data with an average least mean square error of 8.7%, implying that a fuzzy model can be accurately constructed using this technique.

Site | LMSE using FCM | LMSE using FCS |

Northwest | 10.1917 | 7.4807 |

Northeast | 13.6282 | 13.7374 |

Center | 18.5757 | 15.1409 |

Southwest | 5.0411 | 7.4953 |

Southeast | 10.7428 | 9.1188 |

In table 1 is shown that the best prediction in terms of error percentage is given at southwest for both fuzzy clustering means and fuzzy clustering subtractive, whilst the lessen estimation is given at the city center. This may be due to the high variations in terms of PM10 particles making it more difficult to predict. However, more research is needed to confirm this.

Furthermore, detailed simulations were carried out using Support Vector Machines following the proposed methodology shown in figure 3. These simulations were carried out using the same dataset as the fuzzy clustering technique. In this case, values 2 σ was chosen, and an ε of 11 and 13 were chosen since it was demonstrated to give better results in previous contributions (Sotomayor et al 2010, Sotomayor et al 2011). Figure 6 shows the results of the model using support vector machines with a Gaussian kernel, whilst figure 7 shows the results using the same datasets, with polynomial kernel

Figure 6 indicates a summary of the results with the Support vector machine (in red circles), the raw data (black cross) and the behavior of the data (solid black line). These results show that for Gaussian Kernel (fig 6) gives 11.8 error using the same LMSE Algorithm than the fuzzy model with an epsilon of 13 giving a total number of support vector machines of 157. In the case of figure 5b, using the Gaussian kernel, it was also used the same σ and an epsilon of 11. For this figure, the support vector shows an improvement by having an LMSE of 8.7.

For figure 7a, the estimation gives an error of 9.8 using an σ of 2 and an epsilon of 11 using 177 support vector machines. Likewise, figure 7b also shows the estimation using a third degree polynomial kernel with anε of 13. In this case, a 10.1 LMSE is shown by having 183 support vector machines.

## 6. Conclusions and further work

An assessment in the performance of both fuzzy systems generated using Fuzzy Clustering Subtractive and Fuzzy C-Means was made taking in account the number or membership functions, rules, and Least Mean Square Error for PM10 particles. As a case study, Estimations were made at Northwest Mexico City in 2010, giving consistent results.

In case of SVMs, it can be concluded that for this case study an ε of 11 gives a better estimation than an ε of 13 for the Gaussian kernel. In general, the Gaussian kernel gives better results in terms of estimation than its corresponding polynomial kernel. In general terms, fuzzy clustering gives a better estimation than Gaussian and polynomial kernels, although in-depth studies are needed to corroborate these results for other scenarios.

For future work, more SVM kernels can be implemented and comparison can be made to find out which kernels give better estimation. Also, SVMs can be implemented along with other techniques such as wavelet transform to improve the performance of these algorithms