Abstract
The main idea of this work is the use of machine learning of BAGGING or Bootstrap AGGregatiING, which is extended to average the classifiers based on a distance function. The idea of this function is to find the shortest distance from each data point to the classification boundary by using ‘Manhattan’ distance in decision trees and alternative distance measure is the ‘Mahalanobis’ distance used for Linear Discriminant Analysis or LDA, called modified bagging in this work. Thus providing a weighted voting system instead of equal weight voting, the classification error is reduced. Modified bagging is a viable option to reduce the variance which is a component of the classification error. Referring to the analysis, we conclude that modified bagging gives statistically significant improvement in Ripley’s data set with different bootstrap sample sizes.
Keywords
- machine learning
- bootstrap aggregating
- modified bagging
- classification error
- LDA
1. Introduction
Advances in data collection and processing technology have led to a revolution in database management. Classification plays a key role in understanding, summarising and finding useful patterns in data, for example in medicine, agriculture, archaeology, taxonomy and industry. Statistical classification has been traditionally developed by not only statisticians but also engineers and computer scientists such as machine learning and neural network.
There are two main aspects to classification: cluster analysis and discriminant analysis. The goal of cluster analysis or unsupervised learning is to identify classes from data, that is, to determine the number of clusters and assign each observation to a specific class. Each cluster can be defined as a region of observations that are similar enough to each other to be grouped together. Cluster analysis is applied in medicine to cluster the incidences of specific types of tumours and in history to group archaeological findings. Recent applications of clustering techniques have come from biology, for instance discovery of tumour classes using gene expression data (bioinformatics). A popular clustering algorithm is the
By contrast, in discriminant analysis or supervised learning, we have a training sample (or learning sample) of data, in which we observe the outcome and feature measurements of a set of objects. Using this data, the aim then is to construct (or build) a classification rule to predict the outcome for unseen objects (test set). Contexts in which discriminant analysis is fundamental include, as examples: (i) Mechanical procedure for sorting a collection of coins into several classes (e.g. 10, 20, 50 pence or 1 pound), in which the obtained measurements could be diameter, shape or weight. A measurement on which the coins differ can be used for sorting the coins into pre-specified classes. (ii) Patients who experience a heart attack admitted to a hospital; data are often obtained, such as heart rate, blood pressure, patient’s age and medical history, and a wide variety of other information. It would be useful to develop criteria to differentiate low-risk from high-risk patients. (iii) In social sciences, people use information collected from polls to predict the outcome of elections. Note that in the statistical literature, supervised learning problems (classification problems) are usually, but not always, referred to as discrimination.
All of these problems have in common the requirement of using knowledge of previously classified data to build a classifier in order to assign a class to a new observation. We refer to the construction or building of classification rules from data as discrimination, classification or learning. The quality of a classifier is characterised by the misclassification error rate on the test set. The main assumption is that observations in the test set are assumed to be generated from the same underlying distribution as the observations in the training set. Almost all methods of discriminant analysis can be seen as ways for the construction of a classifier.
Two contrasting points of view have been taken. We could take a parametric family, which can only be applied when the general parametric form of the probability density functions (pdfs) is known or assumed from either a theoretical knowledge or from studying a training set. This method’s most well-known use is linear discriminant analysis (LDA), which assumes that the class-conditional densities are multivariate normal distributions with distinct means but a similar covariance matrix. This is in contrast to a
Comparing discriminant analysis to cluster analysis, more data is available on the observations. Applying discriminating in a cluster analysis makes sense because the clusters (classes) produced by the analysis can ultimately be used for prediction. Despite the fact that discriminant analysis and cluster analysis provide a useful dichotomy of classification problems, numerous real-world issues include the characteristics of both circumstances. Note that the focus of this thesis is on supervised learning problems only.
In this work, bagging is applied to linear discriminant analysis (LDA). Aggregating several decision classifiers in a committee is generally more accurate than the use of a single classifier, possibly ignoring the opinion of some classifiers. Thus, it might be promising to use bagging technique to get a better classifier with a more stable solution. Here, we use the LDA classifier with the usual bagging, averaging combining rule and modified version. The modified bagging is based on averaging the probability of LDA classifiers by aggregating posterior distance. We consider these techniques from the perspective of a two-class linear discriminant function. Historically, [2] was the first to propose a procedure for a two-group problem based on maximising the separation between the groups, and it is widely applied in assigning future examples to these groups. Note that usually, LDA is stable when it is constructed on large training samples and unstable when small training samples are used, that is,
Typical discrimination and classification play a key role in understanding, summarising and finding useful patterns in data, for example in medicine, archaeology, taxonomy and industry. The term discrimination or separation refers to the process of deriving classification rules from data of classified observations, whereas classification or allocation refers to the application of the rules to new observations of unknown class. In practice, discrimination and classification frequently overlap and the distinction between them becomes blurred.
1.1 Linear discriminant analysis: overview
Contrasting discriminant analysis to cluster analysis, more data is available on the observations. Applying discriminating in a cluster analysis makes sense because the clusters (classes) produced by the analysis can ultimately be used for prediction. Although the fact that discriminant analysis and cluster analysis provide a useful dichotomy of classification problems, numerous real-world issues include the characteristics of both circumstances. Note that the focus of this thesis is on supervised learning problems only.
Consider modeling each class density as a multivariate normal distribution with the following forms:
with mean vector
A general approach for a classification rule is to consider the posterior probabilities. The Bayes discriminant rule with respect to the posterior is to allocate a new observation
In most practical situations, the population quantities
With these estimates inserted to give parametric estimates of
2. Bootstrap aggregating or bagging: overview
The quest for a ‘good’ learner is an important issue in discriminant analysis. Usually, a single classifier constructed on a training set is biased and has large variance. Consequently, such a classifier may be misleading and will typically generalise poorly. One can improve the classifier by combing multiple classifiers instead of a single classifier. The cooperation of several classifiers as a decision ‘committee’ has been proposed as a way of reducing misclassification error.
In Breiman’s paper [4], bagging, which has been shown to increase classifier accuracy, is used to enhance the performance of a learning algorithm. A general technique known as bagging involves combining several classifier outputs produced from a single classifier on a bootstrap resample version of the training set using a majority vote to improve the performance of a given learning algorithm.
Bagging produces random and independent training sets with replacement from the original data set. This procedure is repeated
The following is how bootstrapping and aggregating approaches are used in bagging: enter a training set, which consists of n observations ℓ = {(
2.1 Mechanics of bagging algorithm on LDA
Bagging is usually investigated for decision trees. Breiman [4] has shown that bagging could reduce the classification error of decision trees, and he noticed that bagging is useful for unstable procedures only. For stable rules like LDA, [7] found that bagging is useless, because most cases [4] used large training sets, when LDA is very stable. In this work, we perform our study for linear discriminant analysis to investigate if applying bagging can be useful in some situations. Aggregating techniques are implemented on LDA in the following way:
1.
estimate the parameters used by LDA in Eq. (4).
construct the LDA classifier
store the classifiers {
2. Each observation in
A final decision rule:
Where
2.2 Illustration of bagging algorithm on LDA
Linear discriminant analysis is optimal for providing a classification rule that minimises the misclassification error rate if each class is a sample from a multivariate normal population and the population covariance matrices are all equal. Decision tree classification can achieve lower error rates when these assumptions are violated, for example in the case of bimodality.
To illustrate this, let us consider the situation where the LDA is applied to a two-class problem of one-dimensional data. Sixteen values from the first population
In one dimension, the majority rule is in a special case equivalent to a ‘median’ procedure.
Let us consider the above classification example with a two-class problem. In this example, we have 32 training examples. Also, we have classifier
The majority vote is usually constituted by 50% + 1 of the class that has the power to make decisions binding upon the whole. Here, the majority criterion is equivalent to the median, since [#
3. Averaging combining rule on LDA
Based on the above observations on the variance of the mean and median, we introduce here the averaging combining rule when applying bagging to LDA classifiers, which is based on averaging their coefficients built on bootstrap replicates to create a final decision rule. As we have seen in the above example, the averaging has an advantage over the median procedure, which is in some sense equivalent to a majority vote that makes possible to use the averaging combining rule. Notice, however, that the averaging combining rule makes sense only when rules are ‘consistent’ at the boundaries. Thus, bagging in this case is organised as the follows:
1.
estimate the parameters used by LDA in Eq. (4)
construct the LDA classifier
store the classifiers {
2. combine the LDA classifier by averaging their coefficients into a final decision rule:
3.1 Modified bagging algorithm on LDA
As mentioned previously, the averaging combining rule performs well only if the constructed rules on the bootstrap samples are consistent at the boundaries. In practice, this situation may not hold, because bootstrap samples have different features, which lead to inconsistent classifiers at the boundaries, especially if the bootstrap sample size is small. For this reason, we introduce an alternative procedure, which is based on posterior distance or Mahalanobis distance. The posterior distance treats with respective mean vectors and covariance matrix as the specification of a probability distribution for that class. For each new observation, we calculate the probability that the observation came from each class; the observation is then classified to the class which gave the highest probability. To illustrate, consider the one-dimensional example shown in Figure 3.
Suppose, first of all, we have two probability distributions with common covariance matrix: that observations drawn from class 1 have known distribution with mean 4, whereas those from class 2 have known distribution with mean 9. The classifier
4. Mechanics of modified (weighted) bagging algorithm on LDA
A commonly used distance measure is the Mahalanobis distance, which is defined as where
estimate the parameters used by LDA in Eq. (4)
construct the LDA classifier
for each new observation
Where sign (.) = ±1 dependent on sign of its argument.
The procedure for modified (or weighed) bagging is performed by the following steps (Algorithm 3). Given a training set ℓ = {(
4.1 Ripley’s simulated data
The data set is a two-class classification problem in two features. A training set of 250 observations is used, and the error rates are estimated by using a further sample of size 1000 observations. We use the same data as in [8], as illustrated in Figure 4. The class distributions are chosen to allow a best-possible error rate of about 8% and are in fact equal mixtures of two normal distributions. The LDA classifier is constructed on the full training set and tested on the test set with test error rate of about 10.8% and has been superimposed (solid line) on the graph.
In this example, we perform weighted bagging, averaging rule and usual bagging for LDA. One of the reasons to choose the LDA classifier is ease of comparison. Another reason which nicely illustrates is that the training sample for the LDA is small, and therefore, the LDA becomes less stable. Here, we consider Ripley’s data set with different bootstrap sample sizes (
As a result, the classifier developed using a high training sample size may be comparable to the classifier developed using the entire training set. Therefore, some features of the classifiers built on a whole training set may be present in the bagged classifiers. The LDA classifier is where this phenomena is most pronounced. In general, the training sample size and the data distribution are strongly affected by the performance of the modified bagging and usual bagging, as well as the averaging combining rule.
5. Conclusion
Combining several decision classifiers in a committee can provide an improvement over the use of a single classifier. Bagging is a general technique which combines benefits of bootstrapping and aggregating. However, bagging is mainly investigated for decision trees and much less in linear discriminant analysis (LDA). Breiman [4] claims that bagging works well for unstable procedures, which often have high variance. This variance is a major component of classification error, which can be reduced by aggregating, often producing good results.
The idea of bagging is extended to averaging the classifiers based on the Manhattan distance in the decision tree (see [9]) or the Mahalanobis distance in the linear discriminant analysis (LDA) instead of a majority vote. In the first direction, we focused on creating decision trees, and experimental findings and analysis demonstrate that for all sample sizes, weighted bagged classifier outperforms conventional bagging on various tree levels. It is interesting to note that weighted bagging with stump trees outperforms standard bagging somewhat. In the second direction, we concentrated on building linear discriminant analysis (LDA). We have shown by using Ripley’s simulated data that weighted bagging can provide a good classifier, which is better than the averaging rule and usual bagged classifier constructed on small training sample size. Increasing the training sample sizes, classifier becomes stable; both modified bagging and usual bagging are identical to the averaging rule. These techniques depend on the training sample size, as well as the distribution of data used to construct the classifier. Therefore, it is crucial to consider them while using this strategy to enhance classifier performance.
5.1 Potential research directions
We have concluded that LDA could sometimes be preferable to usual bagging. In general, simple majority vote strategy is not a good choice for the aggregating rule, whereas the weighted vote technique is often a good choice for a bagged classifier in decision trees [9].
Furthermore, the weighted bagging may also perform well in LDA classifier. However, the choice of aggregating technique may be important, which strongly depends on the training sample size and the distribution of data.
We only use weighted bagging in this study for the situation of a two-dimensional predictor space, but what about the more important scenario in terms of practical application of many variables? We leave this as an unsolved research issue.
Acknowledgments
I like to thank Professor Charles Taylor from Leeds University for his scientific input on this effort.
References
- 1.
MacQueen J. Some methods for classification and analysis of multivariate observations. In: Lecam LM, Neyman J, editors. Proceedings of the 5th Berkeley Symposium on Mathematical Statistics and Probability. Berkeley, CA: California Unversity Press; 1967. pp. 281-297 - 2.
Fisher RA. The use of multiple measurements in taxonomic problems. Annals of Eugenics. 1936; 7 :179-180 - 3.
Hastie T, Tibshirani R, Friedman J. The elements of statistical learning – Data mining, inference, and prediction. In: Springer Series in Statistics. New York, NY: Springer; 2017 - 4.
Breiman L. Bagging predictors. Machine Learning. 1996a; 26 :123-140 - 5.
Efron B, Tibhirani R. An Introduction to the Bootstrap. London: Chapman & Hall; 1993 - 6.
Davison AC, Hinkley DV. Bootstrap Method and their Application. Cambridge: Cambridge University Press; 1997 - 7.
Breiman L. Bias, variance and arcing classifier. Technical report, Statistics Department, University of California, Berkeley. 1996b - 8.
Ripley BD. Neural networks and related methods for classification (with discussion). Journal of the Royal Statistical Society B. 1994; 56 :409-456 - 9.
El Gimati Y. Weighted bagging in decision trees: Data mining. JINAV: Journal of Information and Visualization. 2020; 1 (1):1-14. DOI: 10.35877/454RI.jinav149