Number of samples in each dataset.
To fulfill the national air quality standards, many countries have created emissions monitoring strategies on air quality. Nowadays, policymakers and air quality executives depend on scientific computation and prediction models to monitor that cause air pollution, especially in industrial cities. Air pollution is considered one of the primary problems that could cause many human health problems such as asthma, damage to lungs, and even death. In this study, we present investigated development forecasting models for air pollutant attributes including Particulate Matters (PM2.5, PM10), ground-level Ozone (O3), and Nitrogen Oxides (NO2). The dataset used was collected from Dubrovnik city, which is located in the east of Croatia. The collected data has missing values. Therefore, we suggested the use of a Layered Recurrent Neural Network (L-RNN) to impute the missing value(s) of air pollutant attributes then build forecasting models. We adopted four regression models to forecast air pollutant attributes, which are: Multiple Linear Regression (MLR), Decision Tree Regression (DTR), Artificial Neural Network (ANN) and L-RNN. The obtained results show that the proposed method enhances the overall performance of other forecasting models.
- imputing missing data
- air pollutants
- layered recurrent neural network
Air quality monitoring and management have drawn much attention in recent years and attracted great attention from the public. Air pollution poses serious problems and infection for living organisms and environmental risks . Harmful emission of industrial waste on air is one of the common environmental influences that disturb the air quality specifications and the national economy . Significant publications have shown that air pollution has harmful effects on human health . Air pollution affects the living organisms by producing impacts on cardiac, vascular, pulmonary, and neurological systems . For example, air pollution in the City of New York causes the death of more than 3000 people and causes hospitalization of 200 persons . It was found that many of these reported incidences were caused by the exposure to PM2.5 and other pollutant attributes . In 2010, it was estimated that ambient particulate matter (PM) caused 3.2 million premature deaths . Moreover, several-analysis and research papers highlight that there is an exponential relationship between PM values and cardiovascular disease, and significant relation between NO2 concentrations and cardiovascular disease [8, 9].
Air pollution arises from many sources such as vehicle fumes, agricultural, industrial, and natural sources like volcanoes . Common air pollutants are classified into two groups: trace gases such as carbon monoxide (CO), nitrogen dioxide (NO2), ground-level ozone (O3), and sulfur dioxide (SO2) or particulate matter (PM2.5) or (PM10) in aerodynamic diameter . Tropospheric ground-level ozone (O3) is a secondary factor that can damage human health and ecosystem [12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41]. O3 concentration is one of the most serious oxidant factors that are harmful to human skin and lung tissues when inhaled [15, 16]. Several side effects impair pulmonary function and cause respiratory symptoms such as headache, weight loss, cough, shortness of breath, hoarseness, and pain while breathing . Moreover, several epidemiological research studies focus on the relation between O3 pollution and mortality .
Air pollution monitoring and control is a major global challenge [19, 20]. To develop and train air quality prediction models, meteorological data for the investigated area should be collected and used. This data mostly consists of physical parameters that include temperature, dew point, wind direction, wind speed, cloud cover, cloud layer(s), ceiling height, visibility, current weather, amount of precipitation, and many more [21, 42]. These attributes greatly influence the concentration of pollutants in the area of interest.
Recently, cities are exposed to air pollutants either indoors or outdoors . Several monitoring stations (i.e., sensors) are used to monitor the air quality by collecting data from different locations inside cities. These stations are used to collect data for gases or particulate matter in an accurate manner . These sensors can be categorized as wired or wireless sensors. Wired sensors need great efforts for deployment and maintenance. Wired sensors can be easily breakdown due to several reasons (e.g., environment close to a volcano, where the hot gases and steams can damage a wired network easily [24, 43]). Wireless sensors still in an early stage. However, they show a great performance compared to wired sensors either in deployment or maintenance. Both types of sensors send the collected data to a central station for further processing. However, sometimes the process of collecting data suffers from different problems such as power failure, sensor fault, man-made error in measurements, and many others. Figure 1 depicts the process of collecting data from different sensors. For example, if the gas sensor (i.e., O3) does not work accurately, the collected data will not be complete and accurate. As a result, the air quality of the prediction model may not be accurate if the percentage of missing data is high.
Missing data cause serious problems for developing prediction models. The presence of missing data could severely reduce the quality of air quality prediction models. To solve this problem, we may either remove the missing data or imputing it. Removing the missing data may reduce the application performance , while imputing missing data may enhance the overall performance and without losing the collected data. Many methods exist to impute the missing data. Researchers either applied simple methods such as average value or complex methods such as machine learning methods to impute missing data . Imputing missing values based on average is not accurate compared to the other one.
The main goal of this study is to propose a hybrid model that can predict the daily average of the concentration of air pollutants based on missing data imputation. The proposed model is a machine learning approach that can enhance the performance of monitoring systems of air pollution inside cities. Layered Recurrent Neural Network (L-RNN)  was successfully used to solve a variety of state-of-the-art applications such as detection of heart failure  and time-series data classification . L-RNNs for missing data were explored earlier to handle the missing data problem [25, 29]. In this research, we first explore the use of L-RNN for imputing the missing data collected from Dubrovnik city that is in the east of Croatia. In the second step, we develop a series of models for predicting NO2, , and using the machine learning model (i.e., MLR, DTR, ANN, and L-RNN).
The rest of this chapter is organized as follows: In Section 2, the related works of air pollution as well as the literature of missing data is presented. Section 3 describes the proposed methodology. Section 4 presents predictive models using machine learning concepts. Section 5 presents the data collection process. The evaluation criteria used in this chapter are presented in Section 6. Section 7 presents the experimental setup used in this paper. Finally, a conclusion of the work is presented in Section 8.
2. Related works
2.1 Imputation vs. removing data
One of the most common problems in the process of developing prediction models is the Data Cleaning/Exploratory Analysis. This phase becomes a challenge when missing values are in presence. In general, there is no fundamental method to deal with missing data. Missing data problem occurs if no value(s) is assigned while collecting data. In general, the missing data are presented by different symbols such as , N/A, or . There are two methods adopted in the literature to handle missing data. They are:
Several researchers remove the missing data from the collected dataset if the percentage of the missing data is less than . However, if the percentage of missing data is greater than , the dataset should be examined carefully . Many approaches have been investigated by researchers to solve the missing data problem. For example, the data list wise deletion method removes the missing data or incomplete data from the collected dataset. This method works fine if the percentage of missing data is very small and does not affect the overall accuracy . The pairwise data deletion method keeps the missing data and tries to reduce the loss that occurs in the list wise deletion method. However, deleting missing values is an acceptable approach for some applications. Mary and Arockiam  investigated the missing data as a case study of air pollution. They proposed an ST-correlated proximate approach to impute the incomplete dataset for the air pollution system. The authors compared the obtained results of the proposed approach with different statistical methods. Sta  investigated the process of collecting data for modern urban cities. The author proposed a framework to cluster the collected data into three clusters: complete, ambiguous, and missing data. The author imputed the missing data and enhanced the overall performance of the proposed system. Xiaodong et al.  proposed a Hot Deck imputation approach that imputes the incomplete records (missing data) using the similarity between complete and incomplete data.
In statistics, imputation is defined as the process of substituting missing data with swapped values. Unit imputation is used when we replace a single data point while the replacement of a component of a data point, is called, item imputation. Imputation is considered a successful solution to avoid difficulties associated with list wise deletion of missing values. Suhani et al.  proposed a machine learning approach based on the fuzzy kNN technique to impute the missing data for a selected case from the medical field. The authors ignore the missing data whose entropy value is less than a predetermined value and recover the incomplete data that are higher than the predetermined value based on a fuzzy kNN algorithm. Chen et al.  applied a machine learning approach based on a convolutional neural network to impute the missing data for a real medical dataset. The authors improve the overall performance after imputing missing data. Turabieh et al.  proposed a dynamic model based on deep learning neural networks for missing data imputation. The authors showed that the proposed model improves the overall performance of medical applications after imputing missing data.
2.2 Air pollution prediction
Air pollution is a serious problem that negatively affects human health, environment, and climate. Governments and organizations published several initiatives to reduce the concentrations of air pollutants, but high levels of concentrations of air pollutants still exist. As a result, monitoring the concentrations of air pollution is needed. Air monitoring consists of several steps; 1) Monitoring sites based on wired or wireless sensors, 2) collecting data that should be accurate and complete, 3) data analysis using predictive models based on machine learning to predict and analyze the collected data, and, 4) the final step is making decisions to reduce the concentrations of air pollution. This process should be performed correctly to ensure that the concentration of air pollution is under control.
Different types of machine learning methods have been used to predict the concentrations of air pollutant indicators by many researchers. For example, Perez and Gramsch  applied a feed-forward neural network to predict the concentration of PM2.5 and PM10 in Santiago, Chile. The obtained accurate results show that the proposed approach enhances the prediction of PM2.5 and PM10. Lana et al.  employed regression models to predict several air pollutants such as CO, NO, NO2, O3 and PM10) for the city of Madrid (Spain). The obtained results explore the importance of reducing air pollutants in the city of Madrid. Kamińska  employed an ensemble learning method based on random forests to model the relationship between the concentrations of air pollutants and nine variables describing meteorological conditions, temporal conditions, and traffic flow. The collected data was for 2 years 2015 and 2016 for WrocÅ‚aw city. The data consists of hourly values of wind speed, wind direction, temperature, air pressure and relative humidity, temporal variables, and traffic flow. The obtained results show that the season plays a vital role in the overall performance. Kamińska  proposed a probabilistic forecasting method to predict the concentrations of NO2. The dataset represents the hourly values of the concentration of NO2 wind speed, and traffic flow for the main intersection in Wrocław city. The obtained results show that wind speed plays a vital factor in the concentration of NO2.
Shang et al.  employed a novel prediction method that hybridized the regression tree (CART) and ensemble extreme learning machine (EELM) methods to predict the hourly concentration of PM2.5 air pollutant. The training dataset used in this research obtained from the meteorological data of Yancheng urban area, while the testing data (i.e., the air pollutant concentration) obtained from the City Monitoring Centre. The obtained results demonstrate the effectiveness of the proposed method to predict PM2.5. A hybrid framework based on three different machine learning methods (i.e., genetic algorithm [GA], random forests [RFs], and backpropagation neural networks [BPNN]) proposed by Dotse et al. . The proposed hybrid approach is used to predict daily PM10 in Brunei Darussalam. Sun and Sun  proposed a hybrid model to predict PM2.5 in Baoding city in China, where a combination of three machine learning methods (i.e., principal component analysis [PCA], least squares support vector machine [LSSVM], and cuckoo search [CS]). The obtained results show that the PCA algorithm works as a feature selection algorithm that reduces the dimensionality of the input dataset while CS shows promising results to predict PM2.5. The main shortfall of this work that is applicable for short-term PM2.5 forecasting.
A dynamic fuzzy synthetic evaluation model for predicting the concentration of three air pollutants (i.e., of PM2.5, PM10 and SO2) in two cities from China have been proposed by Xu et al. . The obtained results show that the proposed model can be employed to build a robust monitoring air quality system for early warning. A novel hybrid model based on extreme learning machine (ELM) is employed to predict the concentration level of PM10 for Beijing and Harbin cities in China by Luo et al. . Aznarte  proposed an ELM approach that is optimized by cuckoo search (CS) to enhance the overall performance of ELM. A probabilistic forecasting approach is applied to predict NO2 in Madrid city from Spain. Wang et al.  proposed a novel hybrid machine learning approach based on a decomposition method and extreme learning machine (ELM) that is optimized by differential evolution (DE) to predict air pollutants in Beijing and Shanghai cities from China. Kumar and Goyal  applied Multiple Linear Regression (MLR) and Principal Component Regression (PCR) methods to predict several air pollutants in Delhi city from India. A MultiLayer Perceptron (MLP) neural network is adopted to predict PM10 in Delhi city from India by Aly et al. . The authors also applied two algorithms (i.e., Naïve Bayes [NB] and Support Vector Machine [SVM]) and the performance of MLP outperforms NB and SVM. Vibha and Satyendra  applied seven models of neural networks using Levenberg–Marquardt (LM) to predict the daily PM10 in two cities from India.
The main purpose of this research is to evolve different machine learning methods to predict daily average air pollutant concentrations such as O3, PM2.5, PM10, and NO2 values given data with many missing values. The process consists of two phases: 1) imputing missing data based L-RNN model, and 2) development of predictive models using several machine learning algorithms which include LR, DTR, ANN, and L-RNN. Our proposed approach starts by collecting data from sensors. If the collected data suffer from missing data, an imputation process will start based on the L-RNN hat predicts the concentration of air pollutants. This process will be repeated until the collected data have no missing value(s). Once the collected dataset is complete, a machine learning model is selected to predict the daily average of air pollutant concentrations. The selected model is evaluated based on two evaluation criteria that are Root Mean Square Error (RMSE), and coefficient of determination (). The proposed approach is depicted in Figure 2. The following subsections demonstrate the proposed approach.
A layered recurrent neural network is known as a neural network that has local feedback, which is particularly suited to predict the daily air pollutant attributes since it incorporates a time delay while training process through a feedback connection between output layer and hidden layer(s). Figure 3 demonstrates the connection feedback. In simple, during the training process, the output of the recurrent neural network is added to the output of the hidden layer. The result of summation is employed as an argument of the transfer function to gain the output in the succeeding iteration. Eq.(1) demonstrates the output of the L-RNN, where presents the input values for hidden layer, presents the input values for output layer. and presents the weights between and , respectively. The final output is obtained from Eq.(2), where is a transfer function. In this work, we employed back-propagation through time in the training phase for the proposed L-RNN structure.
3.2 Data imputation using L-RNN
To implement the data imputation process, we clustered the data into two groups 1) complete dataset [without a missing value(s)] and 2) incomplete dataset [with a missing value(s)]. A holdout method is used to train and test the L-RNN. The complete dataset is divided into three datasets: training dataset (), testing dataset (), and validation dataset (). While the incomplete dataset is used to simulate the trained L-RNN model to impute the missing value(s). This process will be repeated dynamically while receiving any records with a missing value(s).
The computational complexity of the model depends on the structure of the L-RNN and the number of missing data in the received record. The computational complexity will increase exponentially if the number of missing values increases. Figure 4 illustrates the process used to impute the missing value(s) (i.e., the concentration value of O3, NO2, PM2.5, PM10).
4. Predictive models using machine learning
Several machine learning methods can be used to predict air quality. However, we have limited our research paper into four methods: MLR, DTR, ANN, and L-RNN. To avoid the over-fitting problem in the training process, we employed the k-fold cross-validation method with k-fold = 5. The following subsection explores each learning method in more detail.
4.1 Multiple linear regression
Linear regression (LR) is one of the most well-known algorithms in statistics and machine learning. The main idea of LR is to find the relationship between input and output numerical variables. There are several types of LR such as Simple linear regression, multiple linear regression, logistic regression, ordinal regression, Multinomial regression, and Discriminant Analysis. LR has been employed successfully in many areas as a machine learning algorithm [53, 54]. MLR is a classical statistical method that tries to find a relationship between complex input–output variables. In simple, MLR tries to find an approximation linear function between independent input variables and dependent output variable without loss of generality. Eq.(3) explores the regression line in MLR.
where is dependent output variable, is the independent input variable, is polynomial coefficients of , is the number of independent input variables, and is the possible variation form. Eq.(4) presents a compact version of Eq.(3).
where represents the number of samples, represents the value of independent input variable in the sample, and is the residual error in the sample. The coefficient vector can be calculated based on the standard least-square method as shown in Eq.(6).
Therefore, when the parameter vector is known, the generated MLR model can predict the dependent output variable based on the independent input variable (s).
4.2 Decision tree regression
The DTR is employed in this chapter to predict the air pollutant attributes due to its ability to handle complex data and takes less training execution time compared to other prediction models. In simple, DTR uses if-then conditions to predict the appropriate output value(s) . The DTR has three steps to predict the output value(s) as follows:
Step 1: Determining the parameter settings for DTR such as: predicting accuracy, selecting splits, when to stop splitting, and selecting the optimal tree.
Step 2: Selecting the splits to predict values of the continuous dependent variable, which usually measured with node impurity measure which provides an indication of the relative homogeneity of cases in the terminal nodes.
Step 3: Determining when to stop the splitting which is related to the minimum number of nodes. Which means to select the best rightly-sized tree, which is called the optimum tree.
4.3 Artificial neural network
Artificial Neural Network (ANN) has been used widely in many forecasting applications due to its ability to handle complex data. Without having any information about the mathematical model that represents the relation between input and output variables, ANN can learn the learn hidden knowledge between input and output variables. In general, there are many kinds of ANNs such as Feedforward Neural Network (FFNN), Recurrent Neural Network (RNN), and Convolutional Neural Network (CNN) . In this chapter, we adopted two types of neural networks based on a feed-forward network using the propagation training method, which is: the standard neural network (ANN) and Layered Recurrent Neural Network (LRNN).
5. Data collection
The data set used in this research is collected from Dubrovnik city that is located in the east of Croatia. Dubrovnik city has a Mediterranean climate and has over hours of sunshine per year, which is considered the sunniest place in Croatia. In this dataset, the concentration of O3 has been monitored with a commercial Teledyne API 400E UV photometric O3 analyzer. While the concentration of NO2 has been monitored with Teledyne API 200E chemiluminescent NO2 analyzer. O3 and NO2 concentrations were measured every minute and the output signals were stored in a datalogger. The collected data are validated and averaged. The concentration of PM10, and PM2.5 have been monitored with the GRIMM model EDM 180. Samples of PM particles were collected by gravimetric methods throughout the day to obtain 24–hour averages of concentrations. All instruments are regularly maintained and calibrated. Meteorological data were obtained from the Meteorological and Hydrological Services of Croatia. The dataset is collected during the 2015 and 2016.
Table 1 shows the number of records in each dataset used in this paper. For example, the O3 dataset has 699 total records, where 200 records () are incomplete. Figure 5b demonstrates the missing data pattern for each dataset, where the x-axis presents the 24-hours (i.e., input variables), while the y-axis presents the observations during the 2 years. Figure 5a shows that there is a missing data in the second year for NO2 dataset, where NO2 sensors do not work. Since the missing data are higher than , we examined the collected data carefully to maintain the performance of air quality prediction systems. As a result, imputing missing data are needed.
|Percentage of missing data %||16.96||17.08||20.26||28.80|
|Total number of records||730||731||699||731|
6. Evaluation criteria
In this research, we employed two different evaluation criteria: Root Mean Square Error (RMSE), and coefficient of determination (), defined below.
where and denote the actual and predicted values of air pollution concentrations, respectively, represents the number of instances and stands for the average of the actual values of the air pollution concentrations.
7. Experimental results
In this work, two different types of experiments were performed to develop a prediction model for pollutant parameters with missing data. They are: (i) removing missing or incomplete records, and (ii) imputing the missing data. Four regression models were employed in this work (i.e., MLR, DT, ANN, and LRNN). All experiments were performed using MATLAB-R2019b environment. The following subsections discussed the obtained results.
7.1 Results without imputing missing data
The first experiments that we employed in this chapter are based on removing all the missing data (i.e., records). Table 2 shows the obtained results of four different regression models. The LRNN model outperforms other models in three datasets (i.e., NO2, PM10, and PM2.5) based on RMSE and values. While ANN outperforms other models in O3 dataset based on RMSE. The performance of the MLR method is the worst overall datasets.
7.2 Imputing data using LRNN
For imputing missing data, we employed L-RNN as a dynamic prediction model based on the current states of the collected data. In general, there are two different training algorithms for L-RNN: real-time recurrent learning, where a fixed set of weights recursively applied while training process and back-propagation through time, where the L-RNN structure altered between feed-forward and feedback structures. In this work, we used back-propagation through a time training process.
The parameters setting used, in this case, are shown in Table 3. A holdout method is employed to train the L-RNN based on the complete dataset, where for training, for validation, and for testing. The reason for employing the holdout method is to reduce the complexity and execution time for the proposed imputing model. After imputing missing data, we employed a k-fold across-validation method in the training process for four machine learning methods (i.e., MLR, DTR, L-RNN, and ANN) with k-fold = 5 to evaluate the complete dataset.
|Number of iterations||1000|
|Number of neurons in the input layer||Number of input data|
|Number of neurons in the hidden layer||Number of input data /2|
|Number of neurons in the output layer||1|
7.3 Results after imputing missing data
7.3.1 MLR models
In this part, we employed MLR as a prediction model after imputing missing data. In Eq.(9), Eq.(10), Eq.(11), and Eq.(12) we show the MLR results for PM2.5, PM10, O3, and NO2, respectively. Table 4 shows the obtained results of MLR method. The performance of of MLR is acceptable over all datasets.
7.3.2 DT models
In this work, the minimum leave size used is 4, and the maximum number of splits is 6. The main reason for using this setting is to simplify the generated tree. The obtained models of DT for each dataset were shown in Figure 6. Table 5 explores the obtained results of DT over all datasets.
7.3.3 ANN models
Figure 7 shows the ANN structure used in this chapter, where we have three inputs and a single output. Table 6 shows the obtained results of ANN over all datasets. The performance of ANN is excellent compared to MLR and DT.
7.3.4 LRNN models
In this chapter, we employed the LRNN as a regression model to predict the daily average of air pollutant attributes. Table 7 shows the parameters setting for LRNN as a regression method. These settings have been selected carefully to fit our data based on a set ore preliminary experiments. Table 8 shows the obtained results of LRNN. The performance of LRNN is outstanding based on the convergence curves as shown in Figure 8. LRNN method can converge within 1000 epochs. Moreover, the obtained results of LRNN compared to the other previous methods are promising.
|LRNN||Number of epoch||1000|
|Training function||Back propagation|
7.4 Analysis of the results
Table 9 shows the obtained results before and after imputing missing data. The performance of the LRNN model outperforms other models in three datasets (i.e., NO2, PM10, and PM2.5) based on RMSE and values. While ANN outperforms other models in O3 dataset based on RMSE. The performance of ANN over O3 outperforms other methods. While the performance of MLR is the worst one.
|Dataset||Regression model||After imputing||Without imputing|
From the obtained results, it can be seen that the performance of the LRNN model has an outstanding performance, where equals 0.90 in three datasets. However, these obtained results are not perfect since of the data is removed for PM2.5, and of the data is removed from NO2. Removing the missing data will neglect several records and the dataset may lose important information. Figure 5 shows the actual and predicted values for NO2 dataset using all regression methods after imputing missing data.
For more analysis, comparing the obtained results that are reported in Table 9, we can notice that the performance of MLR over PM10 after imputing the missing data is reduced , while the performance of DT, LRNN, and ANN is improved after imputing missing data for PM10 dataset. In general, the performance of the regression models is improved compared to the results reported in Table 2. For example, the value of ANN over O3 dataset before imputing missing data was 0.76, and after imputing missing data becomes 0.78, while the RMSE is improved . So, we can conclude that imputing missing data will improve the air quality measurement systems without losing any record of collected data.
8. Conclusion and future work
Data collection from remote sensors suffers from missing data which reduces the overall performance of air quality monitoring systems. Monitoring air pollution is not an easy task, where several measurements are used to evaluate air quality. In this study, four measurements are used to predict air pollution concentrations (i.e., O3, NO2, PM2.5, and PM10). We imputed the missing data using the Layered recurrent neural network (L-RNN). The performance of four different machine learning models (i.e., LR, DTR, ANN, and L-RNN) was investigated to predict the average daily air pollution concentrations. The performance of the proposed method presented an improvement in the performance of the air quality monitoring system. In future work, we plan to study different methods based on machine learning concepts to enhance the prediction of air pollutant systems. Moreover, we will investigate the general design of the Internet of Things (IoT) applications to improve the performance of the air quality monitoring system.
The authors would like to acknowledgement Croatian Meteorological and Hydrological Service for their support.