Open access peer-reviewed chapter

Differential Evolution Algorithm in the Construction of Interpretable Classification Models

By Rafael Rivera-Lopez and Juana Canul-Reich

Submitted: November 15th 2017Reviewed: February 20th 2018Published: June 27th 2018

DOI: 10.5772/intechopen.75694

Downloaded: 433

Abstract

In this chapter, the application of a differential evolution-based approach to induce oblique decision trees (DTs) is described. This type of decision trees uses a linear combination of attributes to build oblique hyperplanes dividing the instance space. Oblique decision trees are more compact and accurate than the traditional univariate decision trees. On the other hand, as differential evolution (DE) is an efficient evolutionary algorithm (EA) designed to solve optimization problems with real-valued parameters, and since finding an optimal hyperplane is a hard computing task, this metaheuristic (MH) is chosen to conduct an intelligent search of a near-optimal solution. Two methods are described in this chapter: one implementing a recursive partitioning strategy to find the most suitable oblique hyperplane of each internal node of a decision tree, and the other conducting a global search of a near-optimal oblique decision tree. A statistical analysis of the experimental results suggests that these methods show better performance as decision tree induction procedures in comparison with other supervised learning approaches.

Keywords

  • machine learning
  • classification
  • evolutionary algorithms

1. Introduction

Knowledge discovery refers to the process of nontrivial extraction of potentially useful and previously unknown information from a dataset [1]. Within the stages of this process, data mining stands out since it allows analyzing the data and producing models for their representation. In particular, machine learning provides data mining with useful procedures to build these models, since many of the techniques aimed at information discovery are based on inductive learning. Decision trees (DTs), artificial neural networks (ANN), and support vector machines (SVMs), as well as clustering methods, have been widely used to build predictive models. The ability to track and evaluate every step in the information extraction process is one of the most crucial factors for relying on the models gained from data mining methods [2]. In particular, DTs are classification models characterized by their high levels of comprehensibility and robustness. Knowledge learned via a DT is understandable due to its graphical representation [3], and also DTs can handle noise or data with missing values and make correct predictions [4].

On the other hand, soft-computing-based approaches have been widely used to solve complex problems in almost all areas of science and technology. These approaches try to imitate the process of human reasoning when solving a problem with the objective of obtaining acceptable results in a reasonable time. For the case of data mining, soft computing techniques such as ANN, metaheuristics (MHs), fuzzy logic methods, and other approaches have been used as tools to solve the data mining challenges. In particular, an MH is a general algorithmic template based on intelligent processes and behaviors observed in both nature and other disciplines [5]. Evolutionary algorithms (EAs) are one type of MH that have been successfully applied for providing near-optimal solutions for many computationally complex problems in almost all areas of science and technology. The effectiveness of EAs is due to two factors: (1) they combine a clever exploration of the search space to identify promising areas and (2) they perform an efficient exploitation of these areas aiming to improve the known solution or solutions. EAs are inspired by evolutionary theories that synthesize the Darwinian evolution through natural selection with the Mendelian genetic inheritance. In particular, differential evolution (DE) algorithm is an EA designed for solving optimization problems with variables in continuous domains that, instead of implementing traditional crossover and mutation operators, it applies a linear combination of several randomly selected candidate solutions to produce a new solution [6].

MHs have been previously applied to build DTs, and there exist several surveys that describe their implementation [7, 8, 9, 10, 11]. Some approaches apply a recursive partitioning strategy in which an MH finds a near-optimal test condition for each internal node of a DT; however, the approach most commonly used is to perform a global search in the solution space with the aim of finding near-optimal DTs. Since DE is one of the most powerful EA to solve real-valued optimization problems, and the task of finding a near-optimal oblique hyperplane with real-valued coefficients is an optimization problem in a continuous space, in this chapter, two DE-based methods to induce oblique DTs are described: one implementing a recursive partitioning strategy to find the most suitable oblique hyperplane of each internal node of a decision tree, and the other conducting a global search of a near-optimal oblique decision tree. A statistical analysis of the experimental results suggests that these methods show better performance as decision tree induction procedures in comparison with other supervised learning approaches.

The rest of this chapter is organized as follows: Section 2 provides a set of basic definitions about DTs and the DE algorithm. The induction of oblique DTs by means of MH-based approaches is described in Section 3. The constituent elements of the DE-based methods described in this chapter is discussed in Section 4, and the experimental results are discussed in Section 5. Finally, Section 6 describes the conclusion and the future work.

2. Background

Machine learning methods are an essential tool in emerging disciplines such as data science [12] and business intelligence [13] since they provide efficient predictive models constructed from the data previously collected. DTs, ANN, and SVMs, as well as clustering methods, have been widely used to build these models. A DT is a hierarchical model using an ordered sequence of decisions to predict the class membership of new unclassified instances. An ANN consists of many nonlinear elements connected by links associated with weighted variables operating in parallel [14], in which learning is performed iteratively as the network processes the training instances, trying to simulate the way a human being learns from previous experiences. Finally, one SVM finds the hyperplane that best separates the training instances into two different classes using a set of functions called kernels. The optimal hyperplane is described with a combination of entry points known as support vectors [15].

A DT is an acyclic connected graph with a single root node used as one classification model induced through a set of training instances. A DT contains zero or more internal nodes and one or more leaf nodes [16]. Each internal node evaluates a test condition consisting of a combination of one or more attributes of the dataset, and each leaf node has a class label. The arcs joining an internal node with their successor nodes are labeled with the possible outcomes of its test condition. Each DT branch represents a sequence of decisions made by the model to determine the class membership of a new unclassified instance. The DT induction (DTI) process commonly implements a recursive partition strategy. In each stage of this process, the most appropriate test condition to split the training set is selected according to some partition criterion. As a result of evaluating the training instances with this test condition, two or more instances subsets are created which are assigned to the successor nodes of the current internal node. This process is recursively applied until a stop criterion is reached. If the number of attributes used in the test conditions of the tree internal nodes is regarded, two types of DT can be constructed: axis-parallel or multivariate DTs. An axis-parallel DT is a univariate DT that evaluates a single attribute in each test condition to split the training set. On the other hand, oblique DTs and nonlinear DTs are multivariate DTs in which a linear combination and a nonlinear composition of attributes are utilized in the test conditions of a DT, respectively. Multivariate DTs commonly show better performance, and they are smaller than univariate DTs, but they require more computational effort to induce them. In particular, an oblique hyperplane divides the instance space into two halfspaces, and it is defined as follows:

j=1dhjxj+b>0E1

where dis the number of attributes in the dataset, xjis the value of the j-th attribute, hjis a real-valued coefficient in the hyperplane, and brepresents the independent term of the hyperplane. Figure 1 shows an axis-parallel DT induced from the iris dataset [17] using the J48 method [18], and Figure 2 shows a near-optimal oblique DT constructed from the same dataset by the DE-based method implementing a global search strategy. Iris dataset has four attributes, three class labels, and 150 instances. It is clear that the oblique DT is more compact and more accurate than its axis-parallel version, but it has been proved that to find the oblique hyperplane that minimizes the number of misclassified instances both above and below is an NP-hard problem [19].

Figure 1.

An axis-parallel DT induced from the iris dataset.

Figure 2.

An oblique DT induced from the iris dataset.

On the other hand, MHs are general algorithmic templates that can be easily adapted to solve almost all optimization problems [20]. MHs are nature-inspired procedures using stochastic components to find a near-optimal solution and have several parameters that need to be fitted to the specific problem [21]. In accordance with the number of candidate solutions used in its search procedure, MHs have been grouped in single-solution-based MHs and population-based MHs [22]. Single-solution-based MHs implement intelligent search procedures that iteratively replace a candidate solution with a neighboring solution with the aim of reaching a near-optimal solution. Simulated annealing (SA) and Tabu search (TS) are two well-known single-solution-based MHs. Population-based MHs use a group of candidate solutions in each step of their iterative process. The most commonly used population-based MHs are related to EAs and Swarm intelligence (SI) methods. Genetic algorithms (GA), genetic programming (GP), evolutionary strategies (ES) and DE are the most prominent EAs, and ant colony optimization (ACO) and particle swarm optimization (PSO) are examples of SI methods.

In particular, DE is an effective EA designed to solve optimization problems with real-valued parameters [6]. DE evolves a population X=x1x2xNPof NP chromosomes by applying mutation, crossover, and selection operators with the aim to reach a near-optimal solution. Several DE variants differing in the implementation of the mutation and crossover operators have been described in existing literature. In this chapter, the standard DE algorithm, namedDE/rand/1/binin agreement with the nomenclature adopted to refer DE variants, is used as a procedure to find a near-optimal solution. At each iteration of this evolutionary process, known as a generation, a new population of chromosomes is generated from the previous one. For each i1NPin the g-th generation, xiis taken from the Xg1population, and it is used to build a new vector uiby applying the mutation and crossover operators. Vectors xiand uiare known as the target vector and the trial vector, respectively. To build a new chromosome, instead of implementing a traditional mutation operator, DE first applies a linear combination of several chromosomes randomly chosen from the current population (xr1,xr2,andxr3) to construct a mutated vector vi=xr1+Fxr2xr3, where F is a user-specified value representing a scale factor applied to control the differential variation. Next, the crossover operator determines each parameter in uifrom either xior vi, based on a stochastic decision. If a random value is less than a crossover factor (CF), the j-th parameter value of uiis taken from vi, otherwise its value is uji=xji. Finally, a one-to-one tournament is applied to determine which vector, between xiand ui, is selected as a member of the new population Xg. Figure 3 shows a scheme of the application of the DE operators to build a new chromosome for the next population.

Figure 3.

DE operators applied to build a new chromosome for the next population.

DE has been used in conjunction with several machine learning techniques to implement classification methods [23, 24, 25, 26, 27]. It has been mainly applied to optimize the parameters of classification methods or to conduct preprocessing tasks in a data mining process. DE has several advantages in comparison with other MHs, and since mutation operator is based on a linear combination of several randomly chosen individuals, DE exhibits a good trade-off between its exploitation and exploration skills [28]. On the other hand, although DE requires the definition of a smaller number of parameters compared to other MHs, its performance is sensitive to the values selected for CR, F, and NP.

3. Induction of oblique decision trees using metaheuristics

Several MHs have been used to induce DTs with methods implementing a recursive partitioning strategy. One timeline of these methods is shown in Figure 4. Single-solution-based MHs such as SA and TS have been used to induce DTs through this strategy. SA is applied in the simulated annealing of decision trees (SADT) method [29] that iteratively perturbs one randomly selected coefficient to build a new hyperplane, and in one variant of the oblique classifier 1 (OC1) system [30], named OC1-SA [31], that disturbs simultaneously several coefficients of the best axis-parallel hyperplane found by the OC1 algorithm. TS is used in the linear discriminant and TS (LDTS) method [32] and in the linear discrete support vector DT with TS (LDSDT TS) method [33]. Furthermore, EAs such as ES, GA, and DE also have been applied to build an oblique DT through this strategy. The OC1-ES algorithm [31] and the multimembered ES oblique DT (MESODT) method [34] obtain a near-optimal hyperplane using the 1+1-ES and the μλ-ES, respectively. Furthermore, GA evolves a population of hyperplanes encoded: (1) with a binary chromosome in the binary tree-GA (BTGA) algorithm [35] and in the HereBoy for DT (HBDT) method [36] and (2) with a real-valued chromosome in the OC1-GA algorithm [31] and in the procedures described by Krȩtowski [37], and by Pangilinan and Janssens [38]. Finally, DE is applied in an OC1 variant named OC1-DE algorithm [39].

Figure 4.

Timeline of the MH-based approaches to induce oblique DTs.

On the other hand, several MH-based approaches implementing a global search strategy have been described in the existing literature. GA evolves a population of variable-length chromosomes in the generalized decision tree inducer (GDTI) method [40] and in the evolutionary full tree induction (EFTI) method [41]. Other GA-based approaches such as the Global EA for oblique DTI (GEA-ODT) procedure [42, 43] and the tree analysis with randomly generated and evolved trees (TARGET) algorithm [44] use trees as chromosomes. Furthermore, the standard GP is applied by Liu and Xu [45], the strongly typed GP is used by Bot and Langdon [46, 47], and the grammar-based GP is utilized in the GP with margin maximization (GP-MM) method [48]. Finally, DE is implemented in the perceptron DT (PDT) method [49, 50] and in the DE for inducing oblique DTs (DE-ODT) method [51].

4. DE-based methods to induce oblique decision trees

In this chapter, two methods to induce an oblique DT using the DE/rand/1/bin algorithm are described. The first method, named OC1-DE, is similar to the OC1 system and its variants, but it applies DE to find a near-optimal hyperplane at each internal node of an oblique DT [39]. The second one, named DE-ODT method, implements a global search strategy to induce oblique DTs [51].

4.1. OC1-DE method to search near-optimal oblique hyperplanes

The OC1-DE method is based on the OC1 system [30] and the OC1-GA procedure [31]. The OC1 system applies a two-step process to find a better hyperplane. First, it finds the best axis-parallel hyperplane splitting the instance set. Next, it applies two procedures to disturb the hyperplane coefficients:

  • Sequential perturbation: This is a deterministic rule that adjusts the hyperplane coefficients, taking one at a time and looking for its optimal value.

  • Random vector perturbation: When the sequential perturbation reaches a local optimum, a random vector is added to the current hyperplane with the aim of looking elsewhere in the solutions space.

Finally, this procedure returns as the best hyperplane to the one selected between the best-perturbed hyperplane and the best axis-parallel hyperplane. OC1 uses several criteria to evaluate the quality of the candidate hyperplanes such as information gain [52] and three criteria introduced by Heath [19]: max minority, minority sum, and sum of impurities. Induced DT is pruned by removing sub-trees whose impurity value is less than a predefined threshold value. An improved OC1 version [53] uses the elements defined in the CART method [54]: the Gini index and the twoing rule as splitting criteria and the cost-complexity pruning method.

On the other hand, the OC1-GA method is an OC1 variant encoding the hyperplane coefficients in a real-valued chromosome. First, the axis-parallel hyperplane that best splits the training instances is obtained. This hyperplane is copied to 10%of the initial population and the remaining hyperplanes are randomly created. Then, OC1-GA evolves this population to find a near-optimal hyperplane evaluating its quality through the twoing rule. Oblique DT is then induced in a recursive partitioning strategy, and it is pruned using the cost-complexity pruning method.

The DE implementation to find a near-optimal hyperplane at each internal node of an oblique DT is shown in the Algorithm 1. First, the axis-parallel hyperplane that best splits a set of training instances is obtained (line 1). It is copied to 10%of the initial population, as is proposed in [31], and the remaining hyperplanes are randomly created (line 2). Each random hyperplane is constructed considering that almost two instances with different class labels are separated by the hyperplane. Next, this population is evolved through several generations using the DE operators (lines 3–19), and the best hyperplane in the population is selected (line 20). Finally, the OC1-DE algorithm returns the hyperplane selected between the best axis-parallel hyperplane and the best oblique hyperplane produced by DE (line 21).

The hyperplane returned by the OC1-DE is used as the test condition of a new internal node that is added in an oblique DT. This hyperplane is used to split the training instances into two subsets. The OC1-DE method is recursively applied using each subset until a leaf node is created as all instances in the subset have the same class label or a threshold value of unclassified instances is reached. The quality of the hyperplane is obtained using the twoing rule. Finally, a pruning procedure is applied in order to reduce the overfitting of DT produced and to improve its predictive power.

4.2. DE-ODT method to induce oblique decision trees

The DE-ODT method implements a global search strategy in which the DE algorithm is applied to find a near-optimal oblique DT, where each real-valued chromosome encodes only a feasible oblique DT.

4.2.1. Linear representation of oblique decision trees

In the DE-ODT method, each chromosome represents the internal nodes of a complete binary oblique DT stored in a fixed-length real-valued vector (Figure 5). This vector encodes the set of hyperplanes used as test conditions of the oblique DT. Vector size is determined using both the number of attributes and the number of class labels of the training set whose model is induced. Since each internal node of an oblique DT has a hyperplane as its test condition, the size of the real-valued vector xiused to encode each i-th candidate solution in the population is fixed as ned+1, where neis the estimated number of internal nodes of a complete binary oblique DT. Considering that: (1) an oblique DT is more compact than its univariate version and since (2) the DT size is related to the structure of the training set, the DE-ODT method determines the value of nebased on both the number of attributes and the number of class labels (s) in it.

Figure 5.

Linear encoding scheme for the internal nodes of a complete binary oblique tree.

If the number of internal nodes of a complete binary DT with height His 2H11and the number of leaf nodes of the same DT is 2H1, two heights can be obtained: Hi=log2d+1+1, and Hl=log2s+1. Using these equations, neis determined as follows:

ne=2maxHiHl11,E2

and, the size of the real-valued parameter vector representing a sequence of nehyperplanes for a training set with dattributes is computed as follows:

n=ned+1.E3

As an example, if a hypothetical dataset with three numerical attributes and three class labels is used to induce an oblique DT, then d=3and s=3. In this case, Hi=log24+1=3and Hl=log23+1=3. Finally, ne=2max331=7. This implies that the oblique DT could have seven internal nodes. Finally, one chromosome representing a candidate solution in the evolutionary process of this problem has 28 real-valued parameters.

4.2.2. Induction of feasible oblique decision trees

The DE-ODT method applies the following steps to map an oblique DT from a chromosome xiof the population:

  1. Hyperplanes construction: xiis used to build the vector wirepresenting the sequence of candidate hyperplanes utilized in the internal nodes of a partial DT. Since the values of xirepresent the hyperplane coefficients contained in these nodes, the following criterion applies: Values x1ixd+1iare assigned to the hyperplane h1, the values xd+2ix2d+2iare assigned to the hyperplane h2, and so on. For each j1ne, and for each k1d+1, the k-th coefficient of hjis designed as follows

    hkj=xj1d+1+ki.E4

These hyperplanes are assigned to the elements of wi: h1is assigned to w1i, h2is assigned to w2i, and so on. Figure 6 shows an example of the construction of a set of hyperplanes from one chromosome for the hypothetical dataset previously described. Once wiis completed, it is used to create a partial DT with only internal nodes.

  • Partial oblique decision tree construction: wiis used to create the partial tree (pTi). First, the element in the initial location of wiis used as the root node of pTi. Next, the remaining elements of wiare inserted in pTias successor nodes of those previously added so that each new level of the tree is completed before placing new nodes at the next level, in a similar way to the breadth-first search strategy. Figure 7 shows an example of the construction of a partial oblique DT from wi. In this figure, it can be observed that w1iis selected as the tree root node, w2iand w3iare placed as the successor nodes of w1i, w4iand w5iare designed as the successor nodes of w2i, and so on.

  • Oblique decision tree completion: The final stage of the mapping scheme adds leaf nodes in pTiusing the training set. In this stage, one instance set is assigned to a node (the complete training set for the root node of the tree), and it is labeled as an internal node. To evaluate each instance in this set using the hyperplane associated to the internal node, two instances subsets are created, and they are assigned to the successor nodes of this node. This assignment is repeated for each node of the partial DT. If the internal node is located at the end of a branch of the DT, then two leaf nodes are created, and they are designated as successor nodes of this node. The instances subsets created are assigned to these leaf nodes. On the other hand, if all instances in the set assigned to the internal node have the same class label, it is labeled as a leaf node and its successor nodes are removed, if they exist. Figure 8 shows an example of this tree-completion procedure. Figure 8 shows that all the instances assigned to w3and w5have the same class label, so they are designated as leaf nodes, and the successor nodes of w3are removed from the tree. On the other hand, since w4is the ending node of a branch, its instance set is split using its hyperplane, the instances subsets produced are assigned to two new leaf nodes, and their majoritarian classes are assigned as their class labels. It can be observed that this tree has three internal nodes and four leaf nodes.

  • Figure 6.

    Construction of a set of hyperplanes from xi.

    Figure 7.

    Construction of a partial oblique DT with only internal nodes.

    Figure 8.

    Completion of an oblique DT using pTi and the training instances.

    4.2.3. General structure of the DE-ODT method

    The Algorithm 2 shows the structure of the DE-ODT method described in this chapter. This procedure requires to identify the training set used to induce an oblique DT, as well as the three control parameters applied by the DE algorithm (CR, F, and NP) and the threshold value (τ) used to determine if a node is labeled as a leaf node. First, the DE-ODT method gets the attributes vector (a), the vector of class labels (c), and the instance set (ι) from the dataset whose model must be built (line 1). Next, the value of dand nare computed (lines 2–4). Then, the DE algorithm evolves a population of real-valued chromosomes encoding oblique DTs. DE selects the best candidate solution xbestin the last population as the result of its evolutionary process (line 5). After that, a near-optimal oblique DT is constructed applying the procedures described in the previous paragraphs (lines 6–8). Since the DE-ODT method uses an a priori definition of the size of the chromosome, it is possible that some leaf nodes in the DT do not meet the following conditions: that the size of its instances subset is less than τor that all the instances in the subset belong to the same class. The DE-ODT method refines the DT by replacing nonoptimal leaf nodes with sub-trees (line 9). Finally, the oblique DT is pruned to reduce the possible overfitting generated by applying this refinement (line 10).

    This procedure allows inducing feasible oblique DTs with a different number of nodes, although they are represented with a fixed-length parameter vector.

    5. Experimental study

    In this chapter, the experimental study carried out to analyze the performance of the DE-based methods for DTI is detailed. First, a description of the datasets used in this study, as well as the definition of the parameters of each method, are given. Then, both the model validation technique used in the experiments and the statistical tests applied to evaluate the results obtained are outlined. Finally, a discussion about the performance of the DE-based methods is provided.

    5.1. Experimental setup

    A benchmark of 20 datasets chosen from the UCI machine learning repository [55] is used to carry out the experimental study. These datasets have been selected as their attributes are numerical, their instances are classified into two or more classes, and most of them are imbalanced datasets. Table 1 shows the description of these datasets. To ensure that the comparison of the results achieved by the DE variants with those produced by other approaches is not affected by the treatment of the data, all datasets used in this study do not have missing values. Also, the data are not preprocessed, filtered, or normalized, that is, they are used as they are obtained from the UCI repository.

    DatasetInstancesAttributesClassesClass distributionDatasetInstancesAttributesClassesClass distribution
    Glass2149770 76 17 0 13 9 29Diabetes76882500 268
    Balance-scale62543288 49 288Heart-statlog270132150 120
    Iris1504350 instances per classAustralian690142307 383
    Ionosphere351342126 225Wine17813359 71 48
    Sonar20860297 111Vehicle846184212 217 218 199
    Liver-disorder34562145 200Page-blocks54731054913 329 28 88 115
    Blood-t74842570 178Breast-tissue-61069622 21 14 15 16 18
    Movement-libras360901524 instances per classParkinsons19522248 147
    Seeds2106370 instances per classSegment2310197330 instances per class
    Ecoli33678143 77 52 35 20 5 2 2Spambase46015721813 2788

    Table 1.

    Description of datasets used in the experiments.

    The DE-based methods are implemented in the Java language using the JMetal library [56]. The mutation scale factor is linearly decreased from 0.5to 0.1as the evolutionary process progresses, and the crossover rate is fixed at 0.9. The decrement in the F value allows more exploration of search space at the beginning of the evolutionary process, and with the passage of the generations, it tries to make a better exploitation of promising areas of this space [57]. The population size is adjusted to 5n, with 250 and 500 chromosomes as lower and upper bound, respectively. These bounds are used to ensure that the population is not so small as not to allow a reasonable exploration of the search space and is not so large as to impact the runtime of the algorithm. Furthermore, the fitness function used in the DE-ODT method computes the training accuracy of each DT in population, and the twoing rule is used as fitness value in the OC1-DE method. The best oblique DT induced by these methods is pruned using the error-based pruning (EBP) approach [58]. Finally, the threshold value used to determine whether a node should be labeled as one leaf node is set to be two instances, and the DT size is defined as the number of leaf nodes of the oblique DT.

    In this study, a repeated stratified 10-fold cross-validation (CV) procedure is used to estimate the predictive performance of the DE-based methods, and the Friedman test [59] is applied to carry out a statistical analysis of the results produced by these methods as compared to them with those obtained by other classification methods. This nonparametric statistical test evaluates the statistical significance of the experimental results through computing the p-value without making any assumptions about the distribution of the analyzed data. This p-value is used to accept or to reject the null hypothesis H0of the experiment which holds that the performance of the compared algorithms does not present significant differences. If the p-value does not exceed a predefined significance level, H0is rejected and the Bergmann-Hommel (BH) post hoc test [60] is conducted to detect the differences between all existing pairs of algorithms. These statistical tests are applied using the scmamp R library [61].

    The results obtained with the DE-based methods are compared with those achieved by several supervised learning methods available on the WEKA data mining software [62]. First, the accuracy and the size of the DTs got by these algorithms are compared with those obtained by the J48 method [63] and by the SimpleCART (sCART) [54] procedure. Next, the accuracy of the DTs constructed with the DE-based procedures is compared with those achieved using the following classification methods: Naïve Bayes (NB) [64], multilayer perceptron (MLP) [65], radial basis function neural network (RBF-NN) [66], and random forest (RF) [67].

    5.2. Comparison with DTI methods

    Table 2 and Figure 9 show the average accuracies of the DTs induced by the DTI algorithms as well as those achieved by the OC1-DE method. In Table 2, the best result for each dataset is highlighted with bold numbers, and the numbers in parentheses refer to the ranking reached by each method for each dataset. The last row in this table indicates the average ranking of each method. It is observed that the DE-based methods produce better results than those generated by the other DTI algorithms.

    DatasetJ48sCARTOC1-DEDE-ODT
    Glass67.62(4)71.26(2)71.31(1)68.97(3)
    Diabetes74.49(3)74.56(2)73.37(4)75.79(1)
    Balance-scale77.82(4)78.74(3)93.92(1)91.97(2)
    Heart-statlog78.15(2)78.07(3)74.11(4)81.11(1)
    Iris94.73(3)94.20(4)96.73(2)97.17(1)
    Australian84.35(4)85.19(2.5)85.19(2.5)85.61(1)
    Ionosphere89.74(3)88.86(4)91.11(2)92.28(1)
    Wine93.20(1)89.49(4)92.58(2)91.88(3)
    Sonar73.61(3)70.67(4)77.65(2)79.34(1)
    Vehicle72.28(2)69.91(4)72.32(1)71.33(3)
    Liver-disorders65.83(4)66.64(3)67.63(2)71.16(1)
    Page-blocks96.99(2)96.76(4)96.88(3)97.07(1)
    Blood-t78.20(2)77.86(3)76.35(4)78.70(1)
    Breast-tissue-634.81(3)32.45(4)34.91(2)38.85(1)
    Movement-libras69.31(2)65.64(3)75.11(1)55.63(4)
    Parkinsons84.72(4)86.31(3)87.95(1)86.43(2)
    Seeds90.90(3.5)90.90(3.5)93.76(1)91.79(2)
    Segment96.79(1)95.83(3)95.93(2)94.78(4)
    Ecoli82.83(4)83.15(3)83.51(2)84.72(1)
    Spambase92.68(2)92.35(3)92.19(4)93.94(1)
    Average ranking2.8253.2502.1751.750

    Table 2.

    Average accuracies obtained by the DTI algorithms and the DE-based methods.

    Figure 9.

    Graphical comparison of the average accuracies obtained by the DTI algorithms and the DE-based methods.

    A statistical test of the experimental results is conducted to evaluate the performance of the DE-based methods. First, the Friedman test is run and its resulting statistic value is 16.197 for four methods and 20 datasets, which has a p-value of 1.033×103. When evaluating this p-value with a significance level of 5%, H0is rejected. Next, the BH post hoc test is applied to find all the possible hypotheses which cannot be rejected. Table 3 shows both the average rank (AR) of the results yielded by each method and the p-values computed by comparing the average accuracies achieved by the DE-based procedures versus those obtained by the other DTI methods. The p-values highlighted with bold numbers indicate that H0is rejected for this pair of methods since they show different performance. Unadjusted p-values are calculated with the average ranks of the two methods being compared, as is described by Demšar in [68]. These values are used by the BH post hoc test to compute the corresponding adjusted p-values. Table 3 shows that the DE-ODT method has a better performance than the other DTI methods since it has the lowest average rank (1.750), and its results are statistically different than these methods. Figure 10 shows a graph where the nodes represent the compared methods and the edges joining two nodes indicate that the performance of these methods does not present significant differences. The values shown in the edges are the p-values computed by the BH post hoc test. This figure is based on that obtained using the scmamp library, and in it is observed that the DE-based methods are not statistically different between them, and that the DE-ODT method is statistically different with the DTI methods. This statistical results indicate that the DE-ODT method is the better DTI method to build oblique DT.

    MethodAROC1-DEDE-ODT
    UnadjustedBHUnadjustedBH
    J482.8251.1134e-011.1134e-018.4584e-032.5375e-02
    sCART3.2508.4584e-032.5375e-022.3856e-041.4131e-03
    OC1-DE2.1752.9786e-015.9572e-01
    DE-ODT1.7502.9786e-015.9572e-01

    Table 3.

    The p-values for multiple comparisons among DTI algorithms and the DE-based methods.

    Figure 10.

    The p-value graph of the DTI algorithms and the DE-based methods.

    On the other hand, the average sizes of the DTs constructed by the DE-based algorithms and also of those induced by the J48 and the sCART methods are shown in Table 4 and Figure 11. Similar to Table 2, the best result for each dataset in Table 4 is highlighted with bold numbers, and the numbers in parentheses refer to the ranking reached by each method for each dataset. These results indicate that the DE-ODT method produces the most compact DTs. Also, it is observed that the size of the DTs built for the OC1-DE method has less complexity than those yielded by the J48 method.

    DatasetJ48sCARTOC1-DEDE-ODT
    Glass23.58(4)8.00(1)21.61(3)11.08(2)
    Diabetes22.20(3)3.00(1)41.55(4)14.77(2)
    Balance-scale41.60(4)13.00(2)15.24(3)5.01(1)
    Heart-statlog17.82(4)16.00(2)17.43(3)7.23(1)
    Iris4.64(3)5.00(4)3.00(1)3.37(2)
    Australian25.75(4)5.00(1)21.90(3)15.64(2)
    Ionosphere13.87(4)3.00(1)7.20(2)7.73(3)
    Wine5.30(3)5.00(2)5.48(4)4.71(1)
    Sonar14.45(4)10.00(2)10.24(3)6.13(1)
    Vehicle69.50(3)80.00(4)56.74(2)44.25(1)
    Liver-disorders25.51(4)3.00(1)22.65(3)6.60(2)
    Page-blocks42.91(4)22.00(1)38.70(3)24.56(2)
    Blood-t6.50(1)10.00(3)22.39(4)8.46(2)
    Breast-tissue-622.45(4)8.00(1)14.09(3)8.97(2)
    Movement-libras47.52(4)30.00(3)27.46(1)29.07(2)
    Parkinsons10.24(4)7.00(2)7.11(3)4.85(1)
    Seeds7.42(4)6.00(3)4.78(2)3.17(1)
    Segment41.21(4)41.00(3)30.53(2)27.91(1)
    Ecoli18.59(4)15.00(3)12.57(2)7.06(1)
    Spambase103.37(4)75.00(3)74.42(2)31.70(1)
    Average ranking3.652.152.651.55

    Table 4.

    Average DT sizes obtained by the DTI methods.

    Figure 11.

    Average DT sizes of several DTI methods.

    5.3. Comparison with other classification methods

    Table 5 and Figure 12 show the average accuracies got by several classification methods as well as those obtained by the DE-based methods. In this table, we can observe that the RF algorithm and the MLP method construct more accurate classifiers than the others, and also that the DE-based procedures induce DTs with better accuracy than the models built by both the RBF-NN algorithm and the NB method.

    DatasetNBMLPRBF-NNRFOC1-DEDE-ODT
    Glass49.44(6)67.29(4)65.09(5)79.95(1)71.31(2)68.97(3)
    Diabetes75.76(3)74.75(4)74.04(5)76.18(1)73.37(6)75.79(2)
    Balance-scale90.53(4)90.69(3)86.34(5)81.71(6)93.92(1)91.97(2)
    Heart-statlog83.59(1)79.41(5)83.11(2)82.41(3)74.11(6)81.11(4)
    Iris95.53(5)96.93(2)96.00(4)94.73(6)96.73(3)97.17(1)
    Australian77.19(6)83.42(4)82.55(5)86.77(1)85.19(3)85.61(2)
    Ionosphere82.17(6)91.05(5)91.71(3)93.39(1)91.11(4)92.28(2)
    Wine97.47(4)98.03(1.5)97.70(3)98.03(1.5)92.58(5)91.88(6)
    Sonar67.69(6)81.59(2)72.60(5)84.47(1)77.65(4)79.34(3)
    Vehicle44.68(6)81.11(1)65.35(5)75.14(2)72.32(3)71.33(4)
    Liver-disorders54.87(6)68.72(3)65.04(5)72.99(1)67.63(4)71.16(2)
    Page-blocks90.01(6)96.28(4)94.91(5)97.54(1)96.88(3)97.07(2)
    Blood-t75.28(5)78.46(2)78.22(3)73.62(6)76.35(4)78.70(1)
    Breast-tissue-646.42(1)35.47(5)41.13(3)45.19(2)34.91(6)38.85(4)
    Movement-libras64.14(5)80.50(2)75.50(3)82.89(1)75.11(4)55.63(6)
    Parkinsons70.10(6)91.44(1)81.49(5)91.38(2)87.95(3)86.43(4)
    Seeds90.52(6)95.24(1)91.67(5)93.57(3)93.76(2)91.79(4)
    Segment80.17(6)96.21(2)87.31(5)98.07(1)95.93(3)94.78(4)
    Ecoli85.51(2)84.85(3)83.30(6)86.25(1)83.51(5)84.72(4)
    Spambase79.56(6)91.19(4)81.31(5)95.65(1)92.19(3)93.94(2)
    Average ranking4.8002.9254.3502.1253.7003.100

    Table 5.

    Average accuracies obtained by several classification methods.

    Figure 12.

    Graphical comparison of the average accuracies obtained by several classification methods.

    The Friedman statistics computed by analyzing the results got by these six methods with 20 datasets is 27.661, and the corresponding p-value is 4.24×105so that H0is rejected. The BH post hoc test is then applied to find all possible hypotheses that cannot be refused. Table 6 shows the results of these tests, and Figure 13 shows the graph corresponding to these p-values. The value highlighted with bold in Table 6 indicates that the DE-ODT method is statistically different with the NB method, only.

    MethodAROC1-DEDE-ODT
    UnadjustedBHUnadjustedBH
    NB4.8006.2979e-023.7787e-014.0591e-032.8414e-02
    MLP2.9251.9019e-017.6079e-017.6737e-018.9374e-01
    RBF-NN4.3502.7118e-017.7679e-013.4610e-031.3844e-01
    RF2.1257.7623e-035.4336e-030.9342e-023.9736e-01
    OC1-DE3.7003.1049e-017.6079e-01
    DE-ODT3.1003.1049e-017.6079e-01

    Table 6.

    The p-values for multiple comparisons among several classification methods.

    Figure 13.

    The p-value graph of the classification methods.

    The p-values obtained by the BH post hoc test point out that the RF method is statistically different only with the RBF-NN algorithm and the NB method, and that both the MLP method and the DE-ODT procedure are statistically different with the NB method. The comparison between the remaining pairs of algorithms indicates that they have a similar performance. The RF method is the best ranked in this comparison, and the AR of the DE-ODT procedure places it as the third best classification method.

    6. Conclusions

    In this chapter, two DE-based methods to induce oblique DTs are described. The OC1-DE method implements a recursive partitioning strategy to find a near-optimal hyperplane which is used as test condition of an oblique DT. On the other hand, in the DE-ODT method, a global search in the space of oblique DTs is conducted with the aim of finding a near-optimal tree. The DE-ODT method estimates the size of the chromosome encoding a complete tree based on both the number of attributes and the number of classes of the dataset whose model is constructed. This method also defines a scheme to map feasible oblique DTs from this chromosome.

    The experimental results obtained indicate that these DE-based methods are better DTI methods, since they build more accurate and compact oblique DTs than those induced by the J48 and the sCART procedures. The DE-ODT method is better than the OC1-DE since its search procedure uses intelligent search procedures combining their exploration and exploitation skills, thus providing a better way to discover the relationships between the attributes used in the training set, and although the search process is only guided by the accuracy of the DT, the models constructed are more compact than those produced by the methods that implement a recursive partitioning strategy. Among the other compared methods, the results got by the OC1-DE method are better than those obtained by the other methods, since it uses a linear combination of attributes in each test condition of the tree, and it produce better hyperplanes than the axis-parallel hyperplanes.

    Even though the results yielded by the DE-based variants are not better than those produced by the RF algorithm and the MLP-based classifier, they are statistically equivalent. An advantage of these methods is that it constructs models whose decisions and operations are easily understood, and although the RF method also builds DTs, its voting scheme makes it very difficult to trace the way in which the model takes its decisions.

    In this chapter, an analysis of the run time of the algorithms is not performed, since it is known that MHs consume more computational time than other approaches because they work with a group of candidate solutions, unlike the traditional methods where only one DT is induced from the training set. It is important to mention that for many practical applications, the construction of the model is conducted in one offline procedure, so the time of its construction is not a parameter that usually impacts the efficiency of the built model.

    © 2018 The Author(s). Licensee IntechOpen. This chapter is distributed under the terms of the Creative Commons Attribution 3.0 License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

    How to cite and reference

    Link to this chapter Copy to clipboard

    Cite this chapter Copy to clipboard

    Rafael Rivera-Lopez and Juana Canul-Reich (June 27th 2018). Differential Evolution Algorithm in the Construction of Interpretable Classification Models, Artificial Intelligence - Emerging Trends and Applications, Marco Antonio Aceves-Fernandez, IntechOpen, DOI: 10.5772/intechopen.75694. Available from:

    chapter statistics

    433total chapter downloads

    More statistics for editors and authors

    Login to your personal dashboard for more detailed statistics on your publications.

    Access personal reporting

    Related Content

    This Book

    Next chapter

    Advanced Content and Interface Personalization through Conversational Behavior and Affective Embodied Conversational Agents

    By Matej Rojc, Zdravko Kačič and Izidor Mlakar

    Related Book

    First chapter

    Designing Data-Driven Learning Algorithms: A Necessity to Ensure Effective Post-Genomic Medicine and Biomedical Research

    By Gaston K. Mazandu, Irene Kyomugisha, Ephifania Geza, Milaine Seuneu, Bubacarr Bah and Emile R. Chimusa

    We are IntechOpen, the world's leading publisher of Open Access books. Built by scientists, for scientists. Our readership spans scientists, professors, researchers, librarians, and students, as well as business professionals. We share our knowledge and peer-reveiwed research papers with libraries, scientific and engineering societies, and also work with corporate R&D departments and government entities.

    More About Us