Open access peer-reviewed chapter

Classic and Bayesian Tree-Based Methods

By Amal Saki Malehi and Mina Jahangiri

Submitted: July 6th 2018Reviewed: December 7th 2018Published: September 18th 2019

DOI: 10.5772/intechopen.83380

Downloaded: 43

Abstract

Tree-based methods are nonparametric techniques and machine-learning methods for data prediction and exploratory modeling. These models are one of valuable and powerful tools among data mining methods and can be used for predicting different types of outcome (dependent) variable: (e.g., quantitative, qualitative, and time until an event occurs (survival data)). Tree model is called classification tree/regression tree/survival tree based on the type of outcome variable. These methods have some advantages over against traditional statistical methods such as generalized linear models (GLMs), discriminant analysis, and survival analysis. Some of these advantages are: without requiring to determine assumptions about the functional form between outcome variable and predictor (independent) variables, invariant to monotone transformations of predictor variables, useful for dealing with nonlinear relationships and high-order interactions, deal with different types of predictor variable, ease of interpretation and understanding results without requiring to have statistical experience, robust to missing values, outliers, and multicollinearity. Several classic and Bayesian tree algorithms are proposed for classification and regression trees, and in this chapter, we provide a review of these algorithms and appropriate criteria for determining the predictive performance of them.

Keywords

  • classic classification trees
  • Bayesian classification trees
  • classic regression trees
  • Bayesian regression trees

1. Introduction

Different parametric traditional models are proposed for predicting different types of outcome variable (e.g., (quantitative, qualitative, and survival data)) and exploratory modeling. These parametric models are: generalized linear models (GLMs) [1], discriminant analysis [2], and survival analysis [3]. Also, different nonparametric methods are proposed for data prediction and some of these methods are: classic and Bayesian tree-based methods, support vector machines [4], artificial neural networks [5], multivariate adaptive regression splines [6], K-nearest neighbor [7], Bayesian networks [8], and generalized additive models (GAMs) [9].

Classic and Bayesian tree-based methods are defined as machine-learning methods for data prediction and exploratory modeling. These methods are supervised methods and are one of powerful and most popular tools for classification and prediction. These methods have some good advantages over traditional statistical methods and these advantages are [10, 11, 12]:

  • easy to interpret due to display result as graphically;

  • understanding result without requiring to have statistical experience;

  • deal with high-dimensional dataset and large dataset;

  • without requiring to determine assumptions about the functional form of the data;

  • deal with nonlinear relationships and high-order interactions;

  • invariant to monotone transformations of predictor variables;

  • robust to missing values;

  • robust to outliers;

  • robust to multicollinearity;

  • extract homogeneous subgroups of observations.

Tree-based methods have been used in different sciences such as medical studies and epidemiologic studies [13, 14, 15, 16, 17]. In these studies, tree models are used for determining risk factors of diseases and identifying high-risk and low-risk subgroups of patients. Tree methods can determine subgroups of patients that need to different diagnostic tests or treatment strategies, indeed these methods are useful for subgroup analysis [18, 19].

Several classic and Bayesian tree algorithms are proposed for classification trees, regression trees, and survival trees. These tree algorithms classify observations into a finite homogeneous subgroups based on predictor variables. Tree model is called classification tree, regression tree, and survival tree, if the outcome variable is a quantitative variable, qualitative variable, and survival data, respectively. Tree-based methods extract homogeneous subgroups of data by a recursively partitioning process and then fit a constant model or a parametric model such as linear regression, Poisson regression, and logistic regression for data prediction within these subgroups. Finally, this process is displayed graphically like a tree structure and this advantage is one of the attractive properties of tree models [20].

In this chapter, we review classic and Bayesian classification and regression tree approaches. Owing to space limitation, Bayesian approaches are discussed more, because this chapter provides the first comprehensive review of Bayesian classification and regression trees.

We begin with a discussion of the steps for tree generating of classic classification and regression trees in Section 2. We mention classic classification trees on Section 3. Section 4 provides a review on classic regression trees. Section 5 contains a discussion of treed generalized linear models. A review of Bayesian classification and regression trees is provided in Section 6. Appropriate criteria for determining the predictive performance of tree-based methods are mentioned in Section 7, and Section 8 presents the conclusion.

2. Classic classification and regression trees

In a dataset with an outcome variable Y and P-vector of predictor variables as X = x1xp, recursive partitioning process of tree generating for classic tree algorithms has several main steps and these steps are: tree growing step, stopping the tree growth step, and tree pruning step. Some of the tree algorithms use two steps (tree growing and stopping the tree growth) for tree generating. These steps are as follows:

2.1. Tree growing

Tree growing step is the first step for tree generating and this step is performed using a binary recursive partitioning process based on a splitting function that this binary tree subdivides the predictor variable space. Tree growth begins at the root node and this node is the top-most node in the tree and includes all observations in the learning dataset. Tree grows by either splitting or not splitting each node of tree (each node contains a subset of learning dataset) into two child nodes or left and right daughter nodes using splitting rules for classifying observations into homogeneous subgroups in terms of outcome variable. Splitting rules for classifying observations are selected using some splitting functions. Binary recursive partitioning process continues until none of the nodes can split or stopping rule of tree growth is reached. We will mention these stopping rules. Binary recursive partitioning process splits each node of tree into only two nodes, but some of tree algorithms can generate multiway splits [20].

In tree growing process, nodes that split are called internal node and otherwise are called terminal node. Each internal node includes a subset of dataset and all internal nodes in tree are parent of their subnodes. Each sample of learning dataset is placed in one of the terminal nodes of tree, and the tree size is equal to the number of terminal nodes of tree. Each node of tree is splitted based on a splitting rule for classifying observations into left and right daughter nodes. If chosen splitting rule is based on a quantitative predictor variable, then observations divide based on {xis} or {xi>s} into left and right nodes, respectively (s: an observed value of quantitative predictor variable). If chosen splitting rule is based on a qualitative predictor variable, then observations divide based on xiCor xiCinto left and right nodes, respectively (C: a category subset of qualitative predictor variable). Many splitting rules can be in each node and all possible splitting rules must be checked for determining best splitting rule using a goodness of fit criterion. This criterion shows the degree of homogeneity in the daughter nodes, and homogeneity is computed using a splitting function and best splitting rule has the highest goodness of fit criterion [20].

Several splitting functions are proposed for classification trees and some of them are [21]: Entropy, Information Gain, Gini Index, Error Classification, Gain Ratio, Marshal Correction, Chi-square, Twoing, Distance Measure [22], Kolmogorov-Smirnov [23, 24], and AUC-splitting [25]. Also, several studies compared the performance of splitting functions [21, 26, 27].

In tree growing process, a predicted value is assigned to each node. Data prediction in classification trees such as C4.5 [28], CART [29], CHAID [30], FACT [31], QUEST [32], CRUISE [33], and GUIDE [34] is based on fitting a constant model like the proportion of the categories of outcome variable at each node of tree. CRUISE algorithm also can fit bivariate linear discriminant models [35] and GUIDE algorithm also can fit kernel density model and nearest neighbor model at each node of tree [34]. All mentioned classification trees except C4.5 tree algorithm accept user-defined misclassification cost, and all except CHAID and C4.5 methods accept user-defined class prior probabilities.

Data prediction in regression trees such as AID [36], M5 [37], CART [29], and GUIDE [38] is based on fitting a constant model like the mean of outcome variable at each node of tree. M5 also can fit linear regression model and GUIDE can fit models such as linear regression model and polynomial model.

2.2. Stopping the tree growth step

Stopping the tree growth step is the second step for tree generating. Tree growth is continued until it is possible, and several rules are proposed for stopping the tree growth and we mention some of them [29, 39]:

  • There is only one observation in the terminal nodes.

  • All observations in the terminal nodes are belong to a category of outcome variable.

  • Node splitting is impossible, because all observations in each of terminal nodes have the same distribution of predictor variables.

  • Determining a user-specified minimum threshold for goodness-of-fit criterion of splitting rules.

  • There is the number of observations less than a user-specified minimum threshold in the terminal nodes.

  • Determining a user-specified maximum for depth of tree.

2.3. Tree pruning step

Tree pruning step is the third step for tree generating and this step is one of the main steps for tree generating. Tree algorithm produces a large maximal tree or saturated tree (the nodes of this tree cannot split any further, because terminal nodes have one observation or observations are belong to a category of outcome variable within each terminal node) and then prunes it to avoid overfitting. In this step, a sequence of trees is generated and each tree in this sequence is an extension of previous trees. Finally, an optimal tree is selected among the trees of sequence based on having lowest cost of misclassification (for classification tree) and lowest estimated prediction error (for regression tree) [29].

Several methods are proposed for tree pruning and some of these methods are [39, 40]: cost-complexity pruning, reduced error pruning, pessimistic error pruning, minimum error pruning, error-based pruning, critical value pruning, and minimum description length pruning [41]. Also, several studies compared the performance of pruning methods [39, 40].

3. Classic classification trees

Several classic classification tree approaches are proposed to classify observations, and data prediction in a dataset contains a qualitative outcome variable Y with K categories or classes and P-vector of predictor variables as X = x1xp. We review some of these classification tree algorithms and these algorithms are: THAID, CHAID, CART, ID3, FACT, C4.5, QUEST, CRUISE, and GUIDE. Also, we only checked software programs such as SPSS, STATISTICA, TANAGRA, WEKA, CART, and R for being these tree methods and available software programs are mentioned for each model. Owing to space limitation, we only mention the name of other classification tree algorithms and these algorithms are: SLIQ [42], SPRINT [43], RainForest [44], OC1 [45], T1 [46], CAL5 [47, 48], and CTREE [49].

3.1. THAID (theta automatic interaction detector)

THAID classification tree algorithm is developed by Messenger and Mandell in 1972 and is the first published classification tree algorithm [50]. This tree algorithm only deals with qualitative predictor variables and uses a greedy search approach for tree generating. Splitting function in THAID algorithm is based on the number of cases in categories of outcome variable, and splitting rule for node splitting is selected based on minimizing the total impurity of new two daughter nodes. THAID method does not use any pruning method, and tree growth is continued until decrease in impurity is higher than a minimum user-specified limit.

3.2. CHAID (chi-square automatic interaction detector) and exhaustive CHAID

CHAID classification tree algorithm is developed by Kass in 1980 and this algorithm is a descendant of THAID tree algorithm [30]. This algorithm can generate multiway splits and tree-growing process including three steps: merging, splitting, and stopping. Also, continuous predictor variables must be categorized, because CHAID only accepts qualitative predictor variables in tree generating process. CHAID algorithm uses significance tests with a Bonferroni correction as splitting function, and best splitting rule is selected based on having lowest significance probability. This tree algorithm generates biased splits and deals with missing values. CHAID algorithm is implemented in these software programs: SPSS, STATISTICA, and R (CHAID package).

Exhaustive CHAID algorithm is proposed by Biggs et al. in 1991 and this algorithm is an improved CHAID method. The splitting and stopping steps of this algorithm are the same as the CHAID algorithm, and it just changed to improve merging [51].

3.3. CART (classification and regression trees)

The classic CART model was developed by Breiman et al. in 1984 and this model is a binary tree algorithm [29]. CART algorithm is one of the best known classic classification and regression trees for data mining. CART algorithm generates a classification tree using a binary recursive partitioning, and tree generating process in this algorithm contains four steps: (1) tree growing: tree growth is based on a greedy search algorithm that CART algorithm grows tree by sequentially choosing splitting rules. This classification tree algorithm provides three splitting functions for choosing splitting rules, and these splitting functions are: entropy, Gini index, and twoing. (2) tree growing process continues until none of the nodes can split, and a large maximal tree is generated. (3) tree pruning: CART uses cost-complexity pruning method for tree pruning to avoid overfitting and to obtain “right-sized” trees. This pruning method generates several subtrees or a sequence of pruned trees, and each tree in this sequence is an extension of the previous trees. (4) best tree selection: CART uses independent test dataset or cross-validation to estimate the prediction error (misclassification cost) of each tree and then selects the best tree from sequence of trees with lowest estimated prediction error.

CART can generate linear combination splits and uses surrogate splits for dealing with missing values, and also, these surrogate splits are used to measure an importance score for predictor variables. This best known classic tree algorithm suffers from some problems such as greediness, instability, and bias in split rule selection [52]. CART is available at these software programs: CART, R (rpart package), SPSS, STATISTICA, WEKA, and TANAGRA.

3.4. ID3 (Iterative Dichotomiser 3)

ID3 classification tree algorithm is proposed by Quinlan in 1986 [53]. This algorithm uses a greedy algorithm using information gain as splitting function and this splitting function is based on entropy splitting criterion and best splitting rule has highest information gain. ID3 does not use any pruning methods, and tree growth process is continued until all observations in the terminal nodes are belong to a category of outcome variable and/or best information gain is near to zero. This algorithm only deals with qualitative predictor variables (if dataset contains quantitative predictor variables, they must be categorized). Also, ID3 algorithm cannot impute missing values, and this method like CART model suffers from selection bias, because ID3 algorithm favors the predictor variables with more values for node splitting of tree. ID3 is implemented in these software programs: WEKA and TANAGRA.

3.5. FACT (Fast and Accurate Classification Tree)

FACT classification tree algorithm was introduced by Loh and Vanichsetakul in 1988 [31]. In this algorithm, variable selection for node splitting based on quantitative predictor variable is based on having the largest F-statistics of analysis of variance (ANOVA), and then, linear discriminant analysis is used to determine split point for this variable. FACT model transforms qualitative predictor variables into ordered variables in two steps (first step: these variables are transformed into dummy vectors, second step: these vectors are projected onto the largest discriminant coordinate). FACT generates unbiased splits when dataset contains only quantitative predictor variables. Also, it, unlike other classification tree methods (C4.5, CART, QUEST, GUIDE and CRUISE), does not use any pruning methods, and tree growing is stopped when stopping rule is reached. FACT can deal with missing values and missing values of quantitative and qualitative predictor variables are imputed at each node by the means and modes of the non-missing values, respectively.

3.6. C4.5

C4.5 classification tree algorithm is developed by Quinlan in 1993 and this algorithm is an extension of ID3 tree algorithm [28]. This algorithm uses a greedy algorithm using gain ratio as splitting function and generates biased splits. C4.5, unlike ID3 method, deals with quantitative and qualitative predictor variables and also, deals with missing values. In this tree method, split of quantitative predictor variable is binary split and split of qualitative predictor variable is multiway split (a branch is created for each category of qualitative predictor variable). Pruning method used in this algorithm is error-based pruning method. C4.5 is available at these software programs: R (Rweka package), WEKA, TANAGRA, and also can obtain from:http://www.rulequest.com/Personal/. Also, J4.8 tree algorithm is Java implementation of the C4.5 algorithm in WEKA software.

3.7. QUEST (Quick, Unbiased, and Efficient Statistical Tree)

Quest classification tree algorithm is developed by Loh and Shih in 1997, and this model generates binary splits [32]. This method, unlike other classification algorithms such as CART and THAID, does not use exhaustive search algorithm (because these algorithms suffer from variable selection bias) and so improves computational cost and variable selection bias. Quest tree method uses statistical test for selecting variable splitting and then variable with smallest significance probability is selected to split node of tree. This method uses F-statistics of analysis of variance (ANOVA) for quantitative predictor variables and chi-square test for qualitative predictor variables. After determining variable, an exhaustive search is implemented to find the best split point and QUEST method uses quadratic discriminant analysis for selecting split point. For determining split point of a qualitative variable, values of this variable must be transforming like method used in FACT algorithm.

Quest like CART can generate linear combination splits and uses cost-complexity pruning method for tree pruning. Missing values of quantitative and qualitative predictor variable are imputed at each node by the means and modes of the nonmissing values, respectively. Software for QUEST algorithm can be obtained from:www.stat.wisc.edu/~loh/.

3.8. CRUISE (Classification Rule with Unbiased Interaction Selection and Estimation)

CRUISE tree algorithm was introduced by Kim and Loh in 2001, and this algorithm, unlike other classification tree algorithms (CART and QUEST), generates multiway splits [33]. CRUISE method is free of selection bias and can detect local interactions. Two methods of variable selection are used in this tree model and these methods are: 1D (similar to the method used in QUEST method) and 2D. CRUISE method like CART and QUEST can generate linear combination splits and uses cost-complexity pruning method for tree pruning. Also, a bivariate linear discriminant model can fit instead of constant model in each node of tree [35]. CRUISE uses several methods for imputing missing values in the learning dataset and dataset used for tree pruning. Software for CRUISE algorithm can be obtained from:www.stat.wisc.edu/~loh/.

3.9. GUIDE (Generalized, Unbiased, Interaction Detection, and Estimation)

GUIDE tree algorithm was introduced by Loh in 2009, and this method is an evolution of FACT, QUEST, and CRUISE algorithms and improves the weaknesses of these algorithms [34]. It like QUEST and CRUISE generates unbiased binary splits and can perform splits on combinations of two predictor variables at a time. Also, GUIDE like QUEST and CRUISE methods performs the two-step approach based on significance tests for splitting each node. GUIDE uses a chi-squared test of independence of each independent variable versus dependent variable on the data in the node and computes its significance probability. It chooses the variable associated with the smallest significance probability and finds a split point that minimizes the sum of Gini indexes and uses it to split the node into two daughter nodes.

GUIDE method uses cost-complexity pruning method for tree pruning (this method is used in other tree algorithms such CART, QUEST, and CRUISE). It deals with missing values and assigns them as a separate category. Also, this tree method can compute importance score for predictor variables and can use nearest neighbor model and bivariate kernel density instead of constant model in the nodes of tree. Software for GUIDE algorithm can be obtained from:www.stat.wisc.edu/~loh/.

3.10. Classification tree algorithms for ordinal outcome variable

Several tree methods are proposed for predicting an ordinal outcome variable. Twoing splitting function is extended by Breiman et al. for using classification tree for ordinal outcome variable [29] and also Piccarreta extended Gini-Simpson criterion for this case [54]. Archer proposed a package in R software (rpartOrdinal package) and this package contains some splitting functions for tree generating for predicting an ordinal outcome variable [55]. Also, Galimberti et al. developed a package in R software (rpartScore package) that overcomes some problems of rpartOrdinal package [56]. Tutz and Hechenbichler extended ensemble tree methods such as bagging and boosting for analyzing an ordinal outcome variable [57]. For study about other approaches, refer to Refs. [49, 57, 58, 59, 60].

3.11. Classification tree algorithms for imbalanced datasets

In an imbalanced dataset, one of the classes of outcome variable has fewer samples than other classes and this class is rare. In real applications such as medical diagnosis studies, this rare class is the interest for analyzing. Due to the skew distribution of classes, most classification tree algorithms predict all samples of rare class as a class with more samples. Indeed, these models are not robust to unbalance between classes and have good diagnostic performances only on the class with more samples. Several remedies have been proposed to solve this problem for using classification tree algorithms on the imbalanced datasets. Some of these remedies are: sampling methods (undersampling, oversampling, and synthetic minority oversampling technique (SMOTE)), cost-sensitive learning, class confidence proportion decision tree [61], and Hellinger distance decision trees [62]. Ganganwar in 2012 provides a review of classification algorithms for imbalanced datasets [63].

4. Classic regression trees

Several classic regression trees are proposed to classify observations, and data prediction in a dataset contains a quantitative outcome variable Y and P-vector of predictor variables as X = x1xp. We review some of these regression tree algorithms and these algorithms are AID, CART, M5, and GUIDE. Also, we only checked software programs such as SPSS, STATISTICA, TANAGRA, WEKA, CART, and R for being these tree algorithms, and available software programs are mentioned for each model. Owing to space limitation, we only mention the reference of other classification tree algorithms and refer to references for study about other regression tree approaches [49, 64, 65]. Also, for Poisson regression trees, refer to Refs. [66, 67, 68, 69].

4.1. AID (automatic interaction detector)

AID regression tree algorithm is proposed by Morgan and Sonquist in 1963, and this algorithm is the first published regression tree algorithm [36]. It generates binary splits and piecewise constant models. This algorithm uses a greedy search for tree generating and a splitting rule is selected based on minimizing the total sum of the square errors. AID suffers from bias in variable selection and this method does not use any pruning method and tree growing is stopped when the reduction in total sum of the square errors is less than a predetermined value.

4.2. CART

CART algorithm considers both classification and regression trees, and tree-generating process in CART algorithm for generating a regression tree is like classification tree [29]. But another splitting function is used to choosing splitting rules of regression tree, and this function is least squares deviation. Also, CART algorithm for selecting best regression subtree uses independent test dataset or cross-validation to estimate the prediction error (sum of squared differences between the observations and predictions) of each tree to select the best tree from sequence of trees with lowest estimated prediction error. CART algorithm for regression tree generating like classification tree uses surrogate splits for imputing missing values and can generate linear combination splits. CART is available at these software programs: CART, R (rpart package), STATISTICA, WEKA, and TANAGRA.

4.3. M5

M5 tree algorithm is proposed by Quinlan in 1992 and this algorithm like AID and CART methods generates a piecewise constant model and then fits a linear regression model in nodes of tree [37]. M5 improves the prediction accuracy of tree algorithm using linear regression model at nodes and deals with missing values. Also, this method like CART algorithm uses least-squares deviation as splitting function and can generate multiway splits. In M5, smoothing technique is used after tree pruning, and this technique improves the accuracy predictions. Wang and Witten in 1996 proposed M5′ tree algorithm, and this method is based on M5 method [70]. This method is available at these software programs: WEKA and R (RWeka package).

4.4. GUIDE

GUIDE method is introduced by Loh in 2002, and it generates unbiased binary splits [38]. This method uses a regression model at each node of tree and calculates the residuals. Then, residuals are transformed to a binary variable based on the sign of them (positive or negative), and algorithm is followed like algorithm used for classification tree. This tree method like method used for classification tree uses missing value category and can compute importance score for predictor variables. GUIDE can fit models such as linear regression model and polynomial model instead of constant model in the nodes of tree. Software for GUIDE method can be obtained from:www.stat.wisc.edu/~loh/.

5. Treed generalized linear models

Some of tree-based methods such as CART, QUEST, C4.5, and CHAID fit a constant model in the nodes of tree, thus a large tree is generated, and this tree has hard interpretation. Treed models, unlike conventional tree models, partition data into subsets and then fit a parametric model such as linear regression, Poisson regression, and logistic regression instead of using constant models (mean or proportion) for data prediction. Treed models generate smaller trees in comparison to tree models. Also, treed models can be a good alternative for traditional parametric models such as GLMs, when these parametric models cannot estimate relationship between outcome variable and predictor variables across a dataset. Several tree algorithms are developed that fit parametric models into terminal nodes, and to study these algorithms, refer to Refs. [71, 72, 73, 74, 75, 76, 77].

6. Bayesian classification and regression trees

The classic CART algorithm was developed by Breiman et al. in 1984, and this model is one of the best known classic classification and regression trees for data mining. But this algorithm suffers from some problems such as greediness, instability, and bias in split rule selection. CART generates a tree by using a greedy search algorithm, and this search algorithm has disadvantages such as: limit the exploration of tree space, dependence future splits to previous splits, generate optimistic error rates, and the inability of the search to find a global optimum [78]. CART has instability problem, because by resampling or drawing bootstrap samples from dataset may generate tree with different splits [79]. The splitting method in CART model is biased toward predictor variables with many distinct values and more missing values [80, 81].

Several tree models are suggested to solve these problems and these remedial models are ensemble of trees such as Random Forests [82], Bagging [83], Boosting [84], Multiboost [85], and LogitBoost [86] (for solving instability problem), tree algorithms such as CRUISE [33, 35], QUEST [32], GUIDE [34], CTREE [49], and LOTUS [71] (for solving bias in split rule selection problem), and Bayesian tree approaches and evtree algorithm [78] are suggested to solve greediness problem of CART. Also, Bayesian tree approaches can quantify uncertainty, and these approaches explore the tree space more than classic approaches.

Several Bayesian approaches are proposed for tree-based methods [87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98]. In these Bayesian tree approaches like classic tree approaches, a model is called Bayesian classification trees if the outcome variable is a qualitative variable. Also, a model is called Bayesian regression trees if the outcome variable is a quantitative variable. The method of data prediction in these Bayesian approaches is like classic approaches. The method of data prediction for Bayesian classification trees is based on fitting a constant model like the proportion of the outcome variable in the terminal nodes. Data prediction in Bayesian regression tree is based on fitting a constant model like the mean of the outcome variable in the terminal nodes.

Classic tree approaches use only observations for data analysis, but Bayesian approaches combine prior information with observations. Bayesian tree approaches define prior distributions on the components of classic tree approaches and then utilize stochastic search algorithms through Markov chain Monte Carlo (MCMC) algorithms or deterministic search algorithms for exploring tree space [87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98].

Bayesian tree approaches have materials such as prior distribution function, posterior distribution function, data likelihood function, marginal likelihood function, stochastic search algorithm or deterministic search algorithm for exploring tree space, stopping rule of simulation algorithm (if stochastic search algorithms are used to simulate from posterior distribution and explore tree space) and criteria for identify good trees (if model produces several trees). In this section, we review Bayesian tree approaches and also mention the results of published papers based on using these Bayesian algorithms for data analysis.

6.1. BUNTINE’s Bayesian classification tree approach

The first Bayesian tree approach for classification tree model was proposed by Buntine in 1992. This proposed approach offers a full Bayesian analysis for classification tree model by using a deterministic search algorithm instead of using a stochastic search algorithm [87]. This model like other classic tree models uses a splitting function for tree growth using Bayesian statistics with similar performance to splitting methods such as Information Gain and Gini. Buntine also like traditional tree models, in order to prevent overfitting model, used Bayesian smoothing and averaging techniques instead of pruning the tree.

In this Bayesian approach, prior distributions are defined on the tree space and data distribution in the terminal nodes of tree (similar priors distributions use for data distribution in the terminal nodes unlike prior distributions considered on the tree space). Buntine showed the superior performance of Bayesian approach in comparison to classic tree algorithms such as CART model of Breiman et al. and C4 model of Quinlan et al. [99] on several datasets [87]. This Bayesian approach may be obtained from:http://ksvanhorn.com/bayes/free-bayes-software.html.

6.2. CGM’s Bayesian CART approach

CGM (Chipman, George, McCulloch) proposed a Bayesian approach for CART model by defining prior distributions on the two components of CART model ΘTin 1998, and these components are a binary tree T with 𝒦 terminal nodes and parameter set Θ=(θ1,θ2,,θK) [89, 91, 92, 93]. Indeed, they define prior distributions on tree structure and parameters in terminal nodes. In this approach, following equation is established for joint posterior distribution of components according to Bayes’ theorem:

PΘ,T=pΘ|TpTE1

where p(T) and p(ΘT) show the prior distribution for the tree and parameters in terminal nodes given the tree, respectively. In this approach, a similar tree-generating stochastic process is used for p(T) of both classification and regression tree models [89], and this recursive stochastic process for tree growth includes the following steps:

  • Start from T that includes only a root node (terminal node η).

  • Slit terminal node ηwith probability PSPLIT = α1+dηβ(dηshows the depth of the node η. αparameter is the base probability of tree growth by splitting a current node, and βparameter determines the rate at which the propensity to split decreases as the tree gets larger). αand βparameters control the shape and size of the tree and these parameters provide a penalty to avoid overfitting tree.

  • If terminal node ηsplits, then a splitting rule ρis assigned to this node according to the distributionPRULE(discrete uniform distribution is used for selecting predictor variable to split the terminal node ηand splitting threshold for this selected variable)

  • Let T as newly created tree from step 3 and run steps 2 and 3 on this tree with ηequal to the newly created child nodes.

In this approach, the posterior distribution function p(T|X, y) is computed with combining the marginal likelihood function p(Y|X, T) and tree prior p(T) as follows:

pT|X,ypy|X,TpTE2

py|X,T=py|X,Θ,ΤpΘ|TE3

pyXΘΤin Eq. (3) shows the data likelihood function.

A stochastic search algorithm is used for finding good models and simulating from relation (2) by using a MCMC algorithm such as Metropolis-Hastings algorithm. This Metropolis-Hastings algorithm simulates a Markov chain sequence of trees namely T0,T1,T2, …, and this algorithm starts with an initial tree T0, then iteratively simulates the transitions from Tito Ti+1by two steps as shown below:

  1. Generate a candidate value Twith probability distribution q(Ti, T).

  2. Set Ti+1=Twith probability below:

αTi,T=minqT,TiqTi,TpY|X,TpTpY|X,TipTi,1E4

Else, set Ti+1=Ti.

In this simulation algorithm, q(T,T) generates Tfrom T by randomly selecting among four steps. These steps are GROW step, PRUNE step, CHANGE step, and SWAP step. This simulation algorithm is run with multiple restarts instead of a single long chain for reasons such as convergence of the posterior distribution or simulation chain, to avoid wasting long time waiting in areas of trees with high posterior distribution function, and generate a wide variety of different trees. Also, the stopping criterion of simulation algorithm is based on that the chain became trapped in a local posterior model.

This Bayesian approach unlike classic CART model does not generate a single tree, thus good trees for classification tree are selected based on criteria such as having lowest misclassification and largest marginal likelihood function. Also, good trees for regression tree are determined based on having the largest marginal likelihood function and lowest residual sums of squares. CGM by using simulation showed that stochastic search algorithm can find better trees than a greedy tree algorithm. They indicated that the Bayesian classification approach has lower misclassification rate than CART model and they also used Bayesian model averaging for improving prediction accuracy of Bayesian classification trees [89].

6.3. DMS’S Bayesian CART approach

DMS (Denison, Mallick, Smith) in 1998 proposed a Bayesian approach for the CART model, and this approach is quite similar to Bayesian approach of CGM with just minor differences [88]. In this approach, prior distributions are defined over the splitting node (S), splitting variable (V), splitting rule (R), tree size (𝒦), and parameters of data distribution in terminal nodes (ѱ).

In this Bayesian approach, joint distribution of model parameters is defined as follows (p(𝒦): prior distribution for size of tree, pθkK: prior distribution for parameter set θk= RkSkVkѱkgiven 𝒦 (tree size), pyK,θk: data likelihood function):

pK,θk,y=p(K)pθk|Kpy|K,θkE5

This Bayesian approach puts a prior distribution over the tree size to avoid overfitting data and uses a truncated Poisson distribution with parameter 𝜆 (𝜆 shows the expected number of nodes in the tree and a weakly informative is used prior for tree size by setting 𝜆 equal to 10) for p(𝒦) as follows:

pKλKeλ1K!E6

Also, pθkKin Eq. (5) is defined as follows:

pθk|K=pRk|Vk,Sk,KpVk|Sk,KpSk|Kpѱk|V,S,KE7

So, prior for this Bayesian approach is defined as follows:

pθk|KpK=pRk|Vk,Sk,KpVk|Sk,KpSk|Kpѱk|V,S,KpKE8

In this approach, Bayesian analysis of tree size 𝒦 and parameter set θkis as follows:

pθk,K|y=pK|ypθk|K,yE9

Also, simulation from the above equation is done by using MCMC algorithms to find good trees and Reversible Jump MCMC algorithm is used to simulate from this equation [100]. This simulation algorithm is performed for a single long chain with a burn-in period to explore the tree space. In this simulation algorithm, trees cannot have sample size less than 5 in the terminal nodes and also cannot have size higher than 6 in during burn-in period of simulation chain of posterior distribution. Reversible Jump MCMC algorithm used by DMS to simulate from Eq. (9) includes four steps: BIRTH (GROW), DEATH (PRUNE), VARIABLE, and SPLITTING RULE. In this simulation algorithm, BIRTH step, DEATH step, VARIABLE step, and Splitting RULE step are randomly chosen with probability bK, dK, VK, and RK, respectively, and algorithm is as follows:

  1. Stating with an initial tree.

  2. Set Kto the tree size in the present tree.

  3. Generate u ~U01

  4. Go to step type determined by u (a step type is determined based on following conditions):

    if (uBK), then go to BIRTH step

else if (bK𝑢bK+dK), then go to DEATH step

else if (bK+dK𝑢bK+dK+VK), then go to VARIABLE step

else, go to RULE step

Then, acceptance probability (α) of each step that changes tree (𝒦,θ) to tree (K,θ) as follows (Kdieshows the number of possible locations for a death in the current tree):

BIRTH step:α=min1,likelihood ratio×Kdie+1KE10
DEATH step:α=min1,likelihood ratio×KKdie+1E11
VARIABLE and RULE steps:α=min1,likelihood ratioE12

if (uα), then proposed tree to accept, else reject.

The stopping criterion of the above simulation algorithm is based on the stability of the posterior distribution and it can be assessed by drawing a plot of iterations of chain against sampled parameter values. This Bayesian approach, unlike CART, does not produce a tree using stochastic search algorithm. Thus, good classification trees are selected based on criteria such as misclassification rate, deviance (−2log pyK,θk), and posterior probability, and good classification trees have lowest misclassification rate, deviance, and largest posterior probability. Also, good regression trees have largest posterior probability and lowest residual sum of squares. DMS indicated that Bayesian approach provides richer output and superior performance than classic CART model [88].

6.4. CGM’s hierarchical priors for Bayesian regression tree shrinkage approach

CGM, 2000 proposed a Bayesian approach for regression tree with mean-shift model based on computational strategy of CGM’s Bayesian approach in 1998. Unlike the Bayesian approach (1998), it can assume dependence of parameters in the terminal nodes. Indeed, hierarchical priors are used for these parameters and therefore shrunk trees are generated [90]. Hierarchical priors have some advantages such as: shrinkage is used in the stochastic search algorithm unlike proposed methods for tree shrinkage (because these methods use shrinkage after searching tree), fitting a larger tree to the dataset without overfitting and improve predictions. CGM by using simulation showed the superior performance of new Bayesian approach for regression tree with mean-shift model in comparison to Bayesian approach of CGM in 1998, CART model, and tree shrinkage methods of Hastie and Pregibon [90, 101].

6.5. WTW’S Bayesian CART approach

WTW (Wu, Tjelmeland, West), 2007 proposed a Bayesian approach for CART model based on the computational strategy of Bayesian approach of CGM (1998) [95]. In this approach, prior distributions define on the tree, splitting variables, splitting thresholds, and parameters in the terminal nodes. This Bayesian approach like approaches of CGM [89, 90, 92, 93] simulates from the posterior distribution by using the Metropolis-Hastings algorithm. The steps used in simulation algorithm of WTW include GROW step and PRUNE step, CHANGE step, SWAP step, and RESTRUCTURE (RADICAL) step (first three steps are similar to steps of simulation algorithm in Bayesian approaches of CGM). RESTRUCTURE step creates large changes in the structure of tree, but tree size is unchanged. There are some advantages by adding this step to simulation algorithm of posterior distribution such as: improving the convergence of the MCMC algorithm, elimination of the need for restarts of the simulation algorithm unlike Bayesian approaches of CGM, and large changes in the structure of tree without change in tree size.

In this approach, convergence diagnostics of simulation algorithm are based on plots such as: plots of iteration number against log posterior distribution, log marginal likelihood function, number of terminal nodes, and number of times that a particular predictor variable is shown as a splitting variable in the tree. WTW showed the superior performance of Bayesian approach in comparison to CART model and that the Bayesian approach had a lower misclassification rate than the CART model [95].

6.6. OML’S Bayesian CART approach

OML (O’Leary, Mengersen, Low Choy), 2008, proposed a Bayesian approach for CART model by extending the Bayesian approach of DMS. These two Bayesian approaches have differences such as the stopping rule of the simulation algorithm or convergence diagnostic plots, criteria for identifying good trees and prior distributions considered for parameters in the terminal nodes [88, 96, 98].

The stopping criterion of simulation chain in OML’S Bayesian classification trees approach has two steps. The first step includes the plot of iterations against accuracy measures (false and positive negative rate and misclassification rate), log posterior, log likelihood, and tree size. If these plots show stability in mentioned items, then in second step, structure of component trees (variables and splitting rules at each splitting node) examines in the set of good trees and if this structure was stabilized and/or the same trees were in this set, then convergence has occurred for this simulation chain; otherwise, iterations must be increased until convergence.

The set of good trees in this Bayesian classification tree approach is determined based on the accuracy measures computed from the confusion matrix of Fielding and Bell [102]. Good trees have lowest misclassification rate and false positive and negative rate (or using highest sensitivity and specificity instead of lowest false positive and negative rate) [96, 98, 103]. After convergence of simulation chain, two or three trees are selected as the best trees among set of good trees based on criteria such as modal structure of tree (same size tree with the same variables and splitting rules), lowest misclassification rate, false negative and positive rate and deviance, highest posterior probability and likelihood, using expert judgment and biological interpretability [96, 98, 103].

The stopping rule of simulation algorithm for regression tree like classification tree includes two steps. In the first step, plot of iterations are drawn against posterior probability, residual sum of squares, and deviance. If these abovementioned items are stable, then structure of component trees examines in the set of good trees and if this structure was stabilized, convergence has been occurred for this simulation chain. Also, set of good trees for regression tree is selected based on having the highest posterior probability and likelihood, lowest residual sum of squares, and deviance [98].

OML compared the Bayesian classification trees with the classic CART model on an ecological dataset and concluded that Bayesian approach has smaller false positive rate, misclassification rate, and deviance than CART model, while the CART model has lower false negative rate, but this model had higher false positive rate [96]. They, in 2008, indicated that this Bayesian approach had a lower false negative rate in comparison to Bayesian approach of DMS, but approach of DMS had a lower false positive rate and misclassification rate [96].

OML in 2009 compared predictive performance of random forests with the Bayesian classification trees on the three datasets and they concluded that the best tree selected with Bayesian classification trees has higher sensitivity and better accuracy in comparison to random forests. They expressed that the Bayesian approach may have better performance than random forests in determining important predictor variables in datasets with a large number of noise predictor variables. OML also indicated that the Bayesian classification tree approach unlike random forests is not biased toward assignment of observations to the largest class of outcome variable in predicting data [103].

OML and Hu in 2011 compared the performance of Bayesian classification trees with the CART of Breiman et al., and they concluded that the Bayesian approach has higher sensitivity and specificity in comparison to CART. They also investigated overfitting of the Bayesian approach by using cross-validation method, and this approach did not show any evidence of overfitting [98].

6.7. OMML’S expert elicitation for Bayesian classification tree approach

OMML (O’Leary, Mengersen, Murray, Low Choy), 2008, proposed a Bayesian classification tree approach based on the computational strategy of Bayesian classification tree approach of OML and by using informative priors [96, 97]. In this Bayesian approach, informative priors are used to define Dirichlet distributions for splitting node, splitting variable, and splitting rule as follows:

pSk|K=DirSk|αS1,,αSkE13
pVk|Sk,K=DirVk|αV1,,αVkE14
pRk|Vk,Sk,K=DirRk|αR1,,αRkE15

In Bayesian approach of OML, there was no prior information about splitting node, splitting variable, splitting rule, and hyperparameters in the Dirichlet distributions of above equations. So, these hyperparameters were set equal to 1 and uniform non-informative priors used for splitting node, splitting variable, and splitting rule [96, 98, 103]. In this new approach, an expert is subjected with three questions (ordering, grading, and weighting) about splitting node, splitting variable, splitting rule, and tree size for defining informative priors. Then, existing hyperparameters in the relations (13), (14) and (15) are determined by following the result of a question. Three questions are used for size of the tree to determine λ in relation (6). DMS and OML used a weakly informative prior for tree size by setting λ = 10 [88, 96, 98, 103]. But OMML unlike DMS and OML used an informative prior for size of the tree [96, 97].

O’Leary et al. in 2008 investigated sensitivity to the choice of the hyperparameters of informative priors for tree size, splitting nodes, splitting variables, and splitting rules in classification trees and they concluded that posterior distribution is relatively robust to these priors except for extreme choices of them [96, 97].

OMML by simulation indicated that the best tree of Bayesian classification trees based on the informative priors has lower false negative rate in comparison to the best tree of Bayesian classification trees based on the non-informative priors [96, 97]. They also indicated the superior performance of Bayesian classification trees based on the informative priors in comparison to proposed expert elicitation approaches for Bayesian logistic regression model [97, 104, 105, 106, 107].

6.8. Other approaches for Bayesian classification and regression trees

Pratola like Wu et al. proposed new Metropolis-Hastings proposals for Bayesian regression trees for improving the convergence of the MCMC algorithm [108]. CGM, 2003, proposed Bayesian treed GLMs by extending CGM’s Bayesian approach (1998) [91]. Gramacy and Lee developed Bayesian treed Gaussian process models for a continuous outcome by combining standard Gaussian processes with treed partitioning [109]. Other Bayesian approaches are also proposed for tree-based models that we mention in the references. Refer to the Refs. [110, 111, 112] for other Bayesian tree approaches of CGM. Also, Chipman et al. review advance models for Bayesian treed methods and refer to the Ref. [113]. For study about other tree-based Bayesian approach, refer to Refs. [114, 115, 116, 117, 118]. Also, Refs. [119, 120] are proposed Bayesian approaches for ensemble trees.

7. Criteria for determining the predictive performance of classification and regression trees

Predictive performance of classification tree models can compare using accuracy measures such as [17, 121]: sensitivity, specificity, false positive rate, false negative rate, positive predictive value, negative predictive value, positive likelihood ratio, negative likelihood ratio, accuracy, Youden’s index, diagnostic odds ratio (DOR), F-measure, and area under curve (AUC). Sensitivity, specificity, positive and negative predictive values, Youden’s index, and accuracy have values between 0 and 1, and when these criteria are near to 1, then classification tree algorithm has better predictive performance. Also, false positive and false negative rates are between 0 and 1, and when these values are near to 0, then classification tree algorithm has better predictive performance. Classification tree models with positive likelihood ratio >10, negative likelihood ratio <0.1, and high diagnostic odds ratio have good predictive performance. AUC shows an overall performance measure and is between 0 and 1. Higher value shows an overall good performance measure, and a perfect diagnostic performance has an AUC equal to 1.

Predictive performance of regression tree algorithms can compare using criteria such as [122, 123]: Pearson correlation coefficient, root mean-squared error (RMSE), relative error (RE), mean error (ME), mean absolute errors (MAE), and bias.

8. Conclusion

Bayesian tree has some advantages in comparison to classic tree-based approaches. Classic CART model cannot explore the space of the tree fully and the result of tree is only locally optimal due to using greedy search algorithm. But Bayesian tree approaches investigate different tree structures with different splitting variables, splitting rules, and tree sizes, so these models can explore the tree space more than classic tree approaches. Indeed, Bayesian approaches are remedies for solving this problem of CART model. Also, CART is biased toward predictor variables with many distinct values, and Bayesian tree models can be a remedial for solving this problem. Because Bayesian approaches proposed by CGM, DMS, OML, and WTW utilize uniform distribution for selecting splitting node, splitting variables, and splitting rules, thus these approaches generate unbiased splits or have not any bias toward predictor variables with more splits. These approaches unlike classic tree approaches generate several trees that this advantage makes researchers to select the best tree based on study aim. Because in some studies, sensitivity is important for researcher and in other studies, specificity is important.

Some authors compared Bayesian approaches with classic tree approaches such as CART and random forests of Breiman and others models. Results of most papers indicated that Bayesian approach tends to present that the Bayesian method is superior to all other competitors. This can be for a variety of reasons: publication bias (methods that do not demonstrate superior performance typically do not get published), choice of examples that demonstrate superiority of their method, or more careful use of their method than the competing methods. Studies that may give more reliable comparisons would be ones in which there is no new method, and the paper is devoted to a comparison of existing approaches. For study about some of these papers, refer to Refs. [124, 125, 126, 127].

According to empirical results, we can conclude that Bayesian approaches have better performance in comparison to classic CART model. Also, despite some advantages for Bayesian tree approaches in comparison with classic tree models, the number of published articles based on using Bayesian tree approaches for data analysis is low. One of the major reasons for this problem can be related to lack of user-friendly software and or need to have programming knowledge. On the other hand, the number of published papers based on employing CART model, random forests, and other classic tree models is many and one of the reasons for this frequency can be several software programs such as CART, SPSS, TANAGRA, STATISTICA, R, and WEKA.

Bayesian tree approaches need more research, because these approaches unlike CART and random forests cannot impute missing values. These approaches also cannot create linear combination splits like other tree algorithms (CART, QUEST, and CRUISE), even though interpretation of these splits is hard, but results indicated that tree methods with these splits have superior prediction accuracy in comparison to tree with univariate splits [128].

© 2019 The Author(s). Licensee IntechOpen. This chapter is distributed under the terms of the Creative Commons Attribution 3.0 License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

How to cite and reference

Link to this chapter Copy to clipboard

Cite this chapter Copy to clipboard

Amal Saki Malehi and Mina Jahangiri (September 18th 2019). Classic and Bayesian Tree-Based Methods, Enhanced Expert Systems, Petrică Vizureanu, IntechOpen, DOI: 10.5772/intechopen.83380. Available from:

chapter statistics

43total chapter downloads

More statistics for editors and authors

Login to your personal dashboard for more detailed statistics on your publications.

Access personal reporting

Related Content

This Book

Next chapter

Automatic Mapping of Student 3D Profiles in Software Metrics for Temporal Analysis of Programming Learning and Scoring Rubrics

By Márcia Gonçalves de Oliveira, Ádler Oliveira Silva Neves and Mônica Ferreira Silva Lopes

Related Book

First chapter

Expert System for Identification of Sport Talents: Idea, Implementation and Results

By Vladan Papić, Nenad Rogulj and Vladimir Pleština

We are IntechOpen, the world's leading publisher of Open Access books. Built by scientists, for scientists. Our readership spans scientists, professors, researchers, librarians, and students, as well as business professionals. We share our knowledge and peer-reveiwed research papers with libraries, scientific and engineering societies, and also work with corporate R&D departments and government entities.

More About Us