Open access peer-reviewed chapter

Ensemble Methods in Environmental Data Mining

By Goksu Tuysuzoglu, Derya Birant and Aysegul Pala

Submitted: October 31st 2017Reviewed: January 25th 2018Published: February 17th 2018

DOI: 10.5772/intechopen.74393

Downloaded: 1119


Environmental data mining is the nontrivial process of identifying valid, novel, and potentially useful patterns in data from environmental sciences. This chapter proposes ensemble methods in environmental data mining that combines the outputs from multiple classification models to obtain better results than the outputs that could be obtained by an individual model. The study presented in this chapter focuses on several ensemble strategies in addition to the standard single classifiers such as decision tree, naive Bayes, support vector machine, and k-nearest neighbor (KNN), popularly used in literature. This is the first study that compares four ensemble strategies for environmental data mining: (i) bagging, (ii) bagging combined with random feature subset selection (the random forest algorithm), (iii) boosting (the AdaBoost algorithm), and (iv) voting of different algorithms. In the experimental studies, ensemble methods are tested on different real-world environmental datasets in various subjects such as air, ecology, rainfall, and soil.


  • data mining
  • classification
  • ensemble learning
  • environmental data
  • bagging
  • random forest
  • AdaBoost

1. Introduction

Environmental data miningis defined as extracting knowledge from huge sets of environmental data. It is an interdisciplinary area of both computer and environmental sciences, including but not limited to environmental information management systems, decision support systems, recommender systems, environmental data analytics, and so on.

Environmental data mining based on ensemble learning is a rather young research area where a set of learners are trained sequentially on the dataset to better analyze and understand environmental processes and systems. However, it is not well-known yet how ensemble methodology can be utilized in order to improve the performance of a single method. For this purpose, this chapter presents the findings of a systematic survey of what is currently done in the area and aims to investigate the ability of different ensemble strategies for environmental data mining.

Ensemble learning in environmental data mining (ELEDM) can be drawn as a combination of three main areas: data mining (DM), machine learning (ML), and environmental science (Figure 1). ML in environmental science is learning-driven, meaning that machines teach themselves to recognize patterns by analyzing environmental data, whereas in contrast, DM is discovery-driven, meaning that patterns are automatically discovered from environmental data. DM uses many ML methods, including ensemble learning methods.

Figure 1.

Interdisciplinary structure of ensemble learning in environmental data mining (ELEDM).

The novelty and main contributions of this chapter are as follows. First, it provides a brief survey of ensemble learning used in environmental data mining. Second, it presents how an ensemble of classifiers can be applied on environmental data in order to improve the performance of a single classifier. Third, it is the first study that compares different ensemble strategies on different environmental datasets in terms of classification accuracy.


2. Related work

Data mining techniques have been recently utilized in environmental studies for processing environmental data and converting it to useful patterns to obtain valuable knowledge and make right decisions when dealing with environmental problems. Many of the developed techniques in data mining can often be tailored to fit environmental data.

Recently, ensemble learning has been one of the active research fields in machine learning. Thus, it has been utilized in a very broad range of areas such as marketing, banking, insurance, health, telecommunication, and manufacturing. In contrast to these studies, our work proposes ensemble learning approach that combines several models to produce a result to solve environmental problems.

2.1. Ensemble-based environmental data mining

Ensemble classifiers have been applied to different environmental subjects, such as air [1, 2, 3, 4, 5, 6], water [7, 8, 9], soil [10, 11, 12], plant [13], forests [14, 15], climate [16, 17, 18], noise [19], rainfall [20], energy [21, 22, 23], as well as living organisms [18, 24, 25]. Some of the ensemble-based environmental data mining studies have been compared in Table 1. In this table, the scopes of the studies, the year they were performed, the algorithms that were used in the studies, the type of data mining task, the success rate with the validation method, and the ensemble strategy are listed. In addition, if more than one algorithm is presented and compared with each other, the proposed one (the most successful one) is also indicated. As given in the table, ensemble of models for classification or prediction has higher interest than ensemble clustering and anomaly detection [2, 22] in environmental science. Although ensemble clustering has been used in many areas, especially in bioinformatics, only a few studies [4, 25] have been conducted so far in the environmental science.

Ref.YearTypeDescriptionData mining taskEnsemble strategyAlgorithmsValidation
[22]2017EnergyIdentification of anomalous consumption patterns in building energy consumptionAnomaly detection2, 4RF, SVR, CCAD-SW using autoencoder and PCA, EADTPR = 98.10%
FPR = 1.98% (for EAD model)
[18]2016ClimateDetermination of the impact of climate change on the habitat suitability for large brown troutPrediction1, 2Generalized additive models, MLP with bagging ensembles, RF, SVM, and fuzzy rule-based systems (TSK)Threefold cross validation
Weighted MSE = 0.18 (MLP with bagging ensembles)
Overall true skill statistics (TSS) = 0.69 (RF)
[11]2015SoilClassification of complex land use/land cover categories of desert landscapes using remotely sensed dataClassification2, 3RF and boosted ANNsMean class user’s accuracy = 86.7% (for boosted ANN) and 86.6% (for RF ANN)
[26]2015SoilSolve the problem of rare classes’ classification on dust storm forecastingClassification2, 3SMOTE with AdaBoost and RF (SARF), SVM, fuzzy ANNTenfold cross validation
Accuracy = 96.51% (SARF)
[4]2015AirForecasting of air pollutant values for the Attica areaClustering2, 4SOM for clustering, FFANN and RF ANN for regression, FIS to obtain fuzzy valuesTenfold cross validation
RMSE and R2
[9]2014WaterPredictive modeling of groundwater nitrate pollutionPrediction2RF regression, LRROC = 0.923 (for model RF-A)
AUC = 0.911 (for model RF-B)
[25]2013Living organismsConstruction of habitat models for living species in the Lake Prespa, Macedonia; in the soils of Denmark; and in the Slovenian riversClustering1, 2RF and bagged multitarget predictive clustering tree (PCT) and single-target DTTenfold cross validation
[5]2012AirPrediction of the Macau’s air pollution indexPrediction1Bootstrap sampling with replacement and random sampling without replacement using ANFIS method as base learnerRMSE = 12.21 (ANFIS with random sampling)
[2]2011Air energyDetect overconsumption of fuel in aircraftsAnomaly detection1Bootstrap sampling on each of the regression tree (tree), elastic network, GP, and stable GP regression methodsROC = 0.90
NRMSE varied consistently between 85 and 90%

Table 1.

Comparison of ensemble-based environmental data mining studies.

ANN, artificial neural network; SVR, support vector regression; PCA, principal component analysis; MLP, multilayer perceptron; SOM, self-organizing maps; EAD, ensemble anomaly detection; FIS, fuzzy inference system; GP, Gaussian process; MSE, mean squared error; RMSE, root-mean-square error; TPR, true positive rate; ROC, receiver operating characteristic.

The idea of using an ensemble of classifiers rather than the single best classifier has been proposed in several environmental data mining studies [5, 11, 26]. It is apparent that ensemble learners boost the performance of the single classifiers. Different models pick up different patterns in data. By pooling all these predictions together, as long as they are reasonably independent, informed, and diverse, the outcomes tend to be better.

One of the most popular ensemble learning strategies, bagging, is also well adapted to develop models for solving environmental problems. For example, it has been utilized to the forecast air pollution level of a region [5] and to establish habitat models for living species [25].

The second type of ensemble learning strategy, the random forest(RF) algorithm, has also been applied for classifying environmental data. It has been applied to predict pollutant occurrences in groundwater [9] and determination of the impact of climate change on the habitat suitability for a fish species [18] and to predict dust storm accurately [26].

Another ensemble learning strategy (boosting), the AdaBoost algorithm, has been used in various types of environmental applications such as for the classification of complex land use/land cover categories of desert landscapes using remotely sensed data [11], to solve the problem of rare classes’ classification on dust storm forecasting [26] and discovering plant species for automatic weed control [27].

Training with different algorithms in each ensemble (voting) is another commonly used ensemble strategy in environmental science. Some of the examples are for the identification of anomalous consumption patterns in building energy consumption [22] and forecasting air pollutant values of a region [4].

Differently from existing studies, the study presented in this chapter focuses on applying four distinct ensemble strategies to environmental datasets using (i) different training sets formed by random sampling with replacement (bagging), (ii) different training sets obtained by random instance and feature subset selection (random forest), (iii) different training sets using random sampling with replacement over weighted data (AdaBoost), and (iv) different algorithms (voting).

2.2. Advantages of ensemble-based environmental data mining

Some of the advantages of environmental data mining are given below:

  • Prediction of parameters expected based on other parameters or under different cases in environmental studies, for example, prediction of rainfall [20], climate change [16, 17, 18] species richness/diversity [24, 25], and atmospheric parameters [28].

  • Construction of models to reduce the consumption of energy [21, 22, 23] and raw materials [2] such as wood, grass, metal, steel, plastics, glass, paper, fuel, and natural gas.

  • Clustering the items in environmental data to describe the current situation more clearly and to plan different activities for different clusters [4, 25].

  • Classification of environmental audio and environmental noise [19].

  • Processing ecological data for better modeling ecological systems [24, 25].

  • Analyzing environmental data toward a better quality control such as air quality [1, 5, 6] and water quality [7, 8, 9].

  • Identifying unexpected patterns from an environmental data using a data mining algorithm and detection of anomalies in environmental data [2, 22] to identify bad values, changes, errors, noises, frauds, and abnormal activities to realize the purpose of giving an alarm.

  • Determination of the most important factor that affects the environment using a data mining technique such as decision tree and random forest [29].

  • Development of a model to manage resources effectively [2, 21, 23], including environmental resources such as air, water, and soil; flow resources such as solar power [30] and wind energy; and natural resources such as coal, gas, and forests.

  • Discovering patterns that can be used for better waste management and recycling.

  • Analyzing the records of financial transactions related to environmental economics for better decision-making, i.e., investigating the financial impacts of environmental policies.

  • Using ensemble methods as a preprocessing step before performing the essential environmental study.

  • Clustering environmental documents according to their topics and main contents.

  • Usage of process mining to improve work management in the environmental science.


3. Background information

3.1. Ensemble learning

Ensemble learning is a machine learning technique where multiple learners are trained to solve the same problem and their predictions are combined with a single output that probably has better performance on average than any individual ensemble member. The fundamental idea behind ensemble learning is to combine weak learners into one, a strong learner, who has a better generalization error and is less sensitive to overfitting in the presence of noise or small sample size. This is because different classifiers can sometimes misclassify different patterns and accuracy can be improved by combining the decisions of complementary classifiers.

3.2. Elements of an ensemble classifier

A typical ensemble framework for classification tasks contains four fundamental components descripted as follows:

  • Training set: a training set is a special set of labeled examples providing known information that are used for training.

  • Base inducer(s) or base classifier(s): an inducer is a learning algorithm that is used to learn from a training set. A base inducer obtains a training set and constructs a classifier that generalizes relationship between the input features and the target outcome.

  • Diversity generator: it is clear that nothing is gained from an ensemble model if all ensemble members are identical. The diversity generator is responsible for generating the diverse classifiers and decides the type of every base classifier that differs from each other. Diversity can be realized in different ways depending on the accuracy of individual classifiers for the improved classification performance. Common diversity creation approaches are (i) using different training sets, (ii) combining different inducers, and (iii) using different parameters for a single inducer.

  • Combiner: the task of the combiner is to produce the final decision by combining all classification results of the various base inducers. There are two main methods of combining: weighting methods and meta-learning methods. Weighting methodsgive each classifier a weight proportional to its strength and combine their votes based on these weights. The weights can be fixed or dynamically determined when classifying an instance. Common weighting methods are majority voting, performance weighting, Bayesian combination, and vogging. Meta-learning methodslearn from new training data created from the predictions of a set of base classifiers. The most well-known meta-learning methods are stacking and grading. While weighting methods are useful when combining classifiers built from a single learning algorithm and they have comparable success, meta-learning is a good choice for cases in which base classifiers consistently classify correctly or consistently misclassify.


4. Ensemble strategies

In order to construct an ensemble model, any of the following strategies can be performed:

4.1. Strategy 1: different training sets using random sampling with replacement

One ensemble strategy is to train different base learners by different subsets of the training set. This can be done by random resampling of a dataset (i.e., bagging; Figure 2a). When we train multiple base learners with different training sets, it is possible to reduce variance and therefore error.

Figure 2.

Different ensemble strategies: (a) bagging, (b) random forest, (c) AdaBoost, and (d) voting.

4.2. Strategy 2: different training sets obtained by random instance and feature subset selection

The combination of bagged decision trees is constructed similar to Strategy 1 using one significant adjustment that random feature subsets are used (i.e., random forest; Figure 2b). When we have enough trees in the forest, random forest classifier is less likely overfit the model. It is also useful to reduce the variance of low-bias models, besides handling missing values easily.

4.3. Strategy 3: different training sets using random sampling with replacement over weighted data

This ensemble strategy can be implemented by weighted resampling of the dataset serially by focusing on difficult examples which are not correctly classified in the previous steps (i.e., boosting; Figure 2c). Boosting helps to decrease the bias of otherwise stable learners such as linear classifiers or univariate decision trees also known as decision stumps.

4.4. Strategy 4: different algorithms

The other ensemble strategy (i.e., voting; Figure 2d) is to use different learning algorithms to train different base learners on the same dataset. So, the ensemble includes diverse algorithms that each takes a completely different approach. The main idea behind this kind of ensemble learning is taking advantage of classification algorithms’ diversity to face complex data.

4.5. Characteristic of different ensemble classifiers

Although ensemble classifiers have a common goal to construct multiple, diverse and predictive models and finally to combine their outputs, each strategy is carried out in different ways using different training sets, combiner or inducer. Table 2 summarizes the properties of different ensemble strategies, the popular algorithms under each category and pros and cons of each ensemble classifier.

AlgorithmTraining setClassifiersCombinerInducerEnsemble strategyAdvantageWeakness
BaggingRandom resamplingInducer independentMajority votingSingle inducer1Minimizes varianceA relatively large ensemble size—loss of cooperation with each other
Random forestRandom resampling + feature subsetInducer dependent (decision tree)Majority votingSingle inducer2
BoostingWeighted resamplingInducer independentWeighted majority votingSingle inducer3Boosts the performance of the weak learnersDegrades with noise
AdaBoostWeighted resamplingInducer independentWeighted majority votingSingle inducer3
StackingResampling and k-foldingInducer independentMeta-learningMultiinducer1, 4Good performanceStorage and time complexity
GradingResampling and k-foldingInducer independentMeta-learningMultiinducer1, 4Predictions are gradedStorage and time complexity
VotingSame datasetInducer independentMajority votingMultiinducer4Increase predictive accuracyHow classifiers are selected
VotingSame datasetInducer independentMajority votingSingle inducer4Simple to understand and implementLimited to a single algorithm performance

Table 2.

Characteristic of different ensemble classifiers.

4.6. Challenges of ensemble learning in environmental data mining

Even ensemble-based environmental data mining is helpful based on the advantages indicated in Section 3; there are also challenges that could be overcome when you are aware. Challenges can be grouped under five main titles: selecting ensemble strategy, determining a satisfactory architecture, computational cost, complex nature of environmental data, and finally post processing:

  • Selecting ensemble strategy: it is a difficult work to determine the best ensemble strategy in terms of accuracy, scalability, computational cost, usability, compactness, and speed of classification. Environmental researchers should know how to construct an ensemble model and be aware of alternative strategies and advantages/disadvantages of them. To overcome this problem, environmental data mining is mostly addressed to computer and environmental scientists working together.

  • Determining a satisfactory architecture: there are two levels of problems in designing ensemble architecture. First, it is necessary to determine the optimal ensemble size. There are three approaches for determining the ensemble size: (i) preselection of the ensemble size, (ii) selection of the ensemble size while training, and (iii) postselection of the ensemble size (pruning). Second, how are learning algorithms and their respective parameters selected to construct the best ensemble? The best values for the input parameters of the algorithms should be determined through a number of tries. These problems are fundamentally different and should be solved separately to improve classification accuracy. Furthermore, it is necessary to update the model when new environmental data is acquired, allowing the up-to-date model to change over time.

  • Computational cost: increasing the number of classifiers usually increases computational cost. To overcome this problem, users may predefine a suitable ensemble size limit, or classifiers can be trained in parallel.

  • Complex nature of environmental data: it is necessary to deal with high dimensionality and complexity of environmental data. To reduce the dimensionality of the feature vector, feature selection techniques can be used such as principal component analysis, information gain, and ReliefF. Another problem is to deal with heterogeneous data by adding problem-specific science algorithms to the solution.

  • Post processing: another critical issue is determining what the best voting mechanism (majority, weighted, average, etc.) for combining the outputs of base classifiers is. Furthermore, the final results should be presented in an appropriate form to help users understand and interpret easily.


5. Experimental study

In this study, different ensemble learning strategies were compared in terms of classification accuracy, precision (PRE), recall (REC), and f-measure (F-MEA). Four ensemble learning strategies were tested on six different real-world environmental datasets. The application was developed by using Weka open source data mining library.

5.1. Dataset description

In this experimental study, six different datasets that are available for public use were selected to determine the best ensemble strategy. Basic characteristics of the investigated environmental datasets are given in Table 3.

IDDataset nameYearAttributesInstancesTypeLink
1Ozone (1 h)2008742536Air
2Ozone (8 h)2008742534
5Forest type201528523Ecology

Table 3.

Environmental datasets and their characteristics.

5.2. Comparison of ensemble strategies

Classification accuracies, precision, recall, and f-measure values for the applied algorithms were obtained using tenfold cross validation. Comparison of the classification accuracies of the applied algorithms for each dataset is displayed in Figure 3. Four weak learners (support vector machine (SVM), naive Bayes (NB), decision tree (DT) applied with C4.5 algorithm, and K-nearest neighbor (KNN)) and four ensemble learners (bagging, random forest (RF), AdaBoost, and voting) were used to construct classification models from environmental data. The base classifiers for the ensemble learners were selected as the one which gave the best classification accuracy among the applied weak learners for the respective dataset.

Figure 3.

Comparison of single and ensemble classifiers in terms of classification accuracies.

The experimental results were obtained with optimum parameters (given in Table 4) using grid search. The best parameters of SVM were found for the complexity parameter, Cfor the exponent value, Efor polykernel parameters in the interval [10k for kϵ {−3, …, 3}], and [1, 2, 3, 4, 5, 6, 7, 8, 9, 10], respectively. To model DT, confidence factor, C, for pruning and the minimum number of objects, M, for leaf were obtained in the intervals of [0.05–0.95] and [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]. The number of neighbors, Nfor KNN classifier, was selected in the range of [1, 25]. For RF classifier, the number of randomly chosen attributes, K, and the number of iterations to be performed, I, were found in the intervals [0–15] and [10–100], respectively. The number of ensemble classifiers for bagging is 10 for each dataset. Weight threshold for weight pruning, P, and the number of iterations to be performed, I, were selected in the interval [10–100] for AdaBoost classifier. Voting was performed using the optimum parameters of SVM, NB, DT, KNN, and RF classifiers.

CECMNDistance metricIKIP
Ozone (1 h)10320.05117Euclidean distance1051010
Ozone (8 h)10350.5515Chebyshev distance10510010
Leaf10010.0521Manhattan distance60010010
Eucalyptus10110.1529Manhattan distance5028040
Forest type10010.15311Manhattan distance50111010
Cloud10010.05117Euclidean distance100010040

Table 4.

Optimum classifier parameters corresponding to each dataset.

The objective of this experiment is to remark the success of the ensemble strategies in terms of classification accuracy concerning environmental data. According to the experimental results, it is apparent that the number of correctly classified instances is increased if ensemble strategies are applied. Especially, AdaBoost classifier provides significant performance gain compared to other models. SVM has superiority over other single learners; hence, most of the ensemble models selected it as the base learner.

There are a number of cases resulting in poor classification performance, such as the following:

  • In case of the presence of either noisy or missing data

  • If there is an insufficient number of instances available

  • If there are too many number of classes

  • If a complex relationship is inherent

  • If the feature dependencies are ignored

  • If the feature selection is not well performed

  • If the algorithm parameters are not correctly determined

  • If the class labels are imbalanced

For example, because the number of instances in “cloud” dataset is very few (due to the insufficient number of instances), inferior results are obtained for most of the applied algorithms as expected. However, even in such cases while some algorithms fail, some others manage to perform well (e.g., C4.5 DT 82%). In this situation, the classifier’s performance can also be enhanced by applying ensemble learning methods as in the case of AdaBoost with 84% classification accuracy for the same dataset. AdaBoost is a powerful ensemble learning algorithm because its distribution update step ensures that instances misclassified by the previous classifier are more likely to be included in the training data of the next classifier with the chance of further enhancement.

Due to the fact that classification accuracy as a performance metric is not just enough to decide whether a learner is considerably good or not, the precision, recall, and f-measure values were also calculated for each model (Table 5). It is also clear from the table values that applying ensemble strategies compared to single learners makes more sense in terms of classifier performance.

Ozone (1-h)SVM0.970.970.95EeucalyptusSVM0.650.650.65
C4.5 DT0.940.970.95C4.5 DT0.660.650.64
Ozone (8-h)SVM0.930.940.93CloudSVM0.370.400.37
C4.5 DT0.870.930.90C4.5 DT0.820.820.82
SVMBagged0.920.940.93C4.5 DTBagged0.550.540.54
SVMAdaBoost0.930.940.93C4.5 DTAdaBoost0.840.840.84
Forest typesSVM0.910.910.91LeafSVM0.780.760.76
C4.5 DT0.880.880.87C4.5 DT0.660.650.64

Table 5.

Precision (PRE), recall (REC), and f-measure (F-MEA) results using tenfold cross validation for respective algorithms in each dataset.


6. Conclusion and future work

This study aims to provide helpful guidelines for future applications by presenting the advantages and challenges of ensemble-based environmental data mining and comparing alternative ensemble strategies through experimental studies. It compares four different ensemble strategies for environmental data mining: (i) bagging, (ii) bagging combined with random feature subset selection, (iii) boosting, and (iv) voting. In the experimental studies, ensemble methods are tested on different real-world environmental datasets.

In the future, the following studies can be carried out:

  • Multistrategy ensemble learning that combines several ensemble strategies can be addressed, instead of a single ensemble strategy.

  • Text mining, web mining, and process mining have been used in many engineering fields. However, there is very limited usage of them in environmental engineering. Future research can focus on these subjects.

  • Some ontologies can be developed for environmental domain. We believe that the future environmental data mining studies will be supported by the ontologies to extract semantic relationships, to improve accuracy, and to develop better decision support systems.

© 2018 The Author(s). Licensee IntechOpen. This chapter is distributed under the terms of the Creative Commons Attribution 3.0 License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

How to cite and reference

Link to this chapter Copy to clipboard

Cite this chapter Copy to clipboard

Goksu Tuysuzoglu, Derya Birant and Aysegul Pala (February 17th 2018). Ensemble Methods in Environmental Data Mining, Data Mining, Ciza Thomas, IntechOpen, DOI: 10.5772/intechopen.74393. Available from:

chapter statistics

1119total chapter downloads

2Crossref citations

More statistics for editors and authors

Login to your personal dashboard for more detailed statistics on your publications.

Access personal reporting

Related Content

This Book

Next chapter

Estimating Customer Lifetime Value Using Machine Learning Techniques

By Sien Chen

Related Book

First chapter

A Dynamic Context Reasoning based on Evidential Fusion Networks in Home-Based Care

By Hyun Lee, Jae Sung Choi and Ramez Elmasri

We are IntechOpen, the world's leading publisher of Open Access books. Built by scientists, for scientists. Our readership spans scientists, professors, researchers, librarians, and students, as well as business professionals. We share our knowledge and peer-reveiwed research papers with libraries, scientific and engineering societies, and also work with corporate R&D departments and government entities.

More About Us