Open access peer-reviewed chapter

Modified Bagging in Linear Discriminant Analysis: Machine Learning

Written By

Yousef El Gimati

Submitted: 19 August 2023 Reviewed: 20 September 2023 Published: 02 May 2024

DOI: 10.5772/intechopen.113260

From the Edited Volume

Research Advances in Data Mining Techniques and Applications

Edited by Yves Rybarczyk

Chapter metrics overview

227 Chapter Downloads

View Full Metrics

Abstract

The main idea of this work is the use of machine learning of BAGGING or Bootstrap AGGregatiING, which is extended to average the classifiers based on a distance function. The idea of this function is to find the shortest distance from each data point to the classification boundary by using ‘Manhattan’ distance in decision trees and alternative distance measure is the ‘Mahalanobis’ distance used for Linear Discriminant Analysis or LDA, called modified bagging in this work. Thus providing a weighted voting system instead of equal weight voting, the classification error is reduced. Modified bagging is a viable option to reduce the variance which is a component of the classification error. Referring to the analysis, we conclude that modified bagging gives statistically significant improvement in Ripley’s data set with different bootstrap sample sizes.

Keywords

  • machine learning
  • bootstrap aggregating
  • modified bagging
  • classification error
  • LDA

1. Introduction

Advances in data collection and processing technology have led to a revolution in database management. Classification plays a key role in understanding, summarising and finding useful patterns in data, for example in medicine, agriculture, archaeology, taxonomy and industry. Statistical classification has been traditionally developed by not only statisticians but also engineers and computer scientists such as machine learning and neural network.

There are two main aspects to classification: cluster analysis and discriminant analysis. The goal of cluster analysis or unsupervised learning is to identify classes from data, that is, to determine the number of clusters and assign each observation to a specific class. Each cluster can be defined as a region of observations that are similar enough to each other to be grouped together. Cluster analysis is applied in medicine to cluster the incidences of specific types of tumours and in history to group archaeological findings. Recent applications of clustering techniques have come from biology, for instance discovery of tumour classes using gene expression data (bioinformatics). A popular clustering algorithm is the k-means algorithm [1], which aims to partition a given data set into k clusters based on squared distance.

By contrast, in discriminant analysis or supervised learning, we have a training sample (or learning sample) of data, in which we observe the outcome and feature measurements of a set of objects. Using this data, the aim then is to construct (or build) a classification rule to predict the outcome for unseen objects (test set). Contexts in which discriminant analysis is fundamental include, as examples: (i) Mechanical procedure for sorting a collection of coins into several classes (e.g. 10, 20, 50 pence or 1 pound), in which the obtained measurements could be diameter, shape or weight. A measurement on which the coins differ can be used for sorting the coins into pre-specified classes. (ii) Patients who experience a heart attack admitted to a hospital; data are often obtained, such as heart rate, blood pressure, patient’s age and medical history, and a wide variety of other information. It would be useful to develop criteria to differentiate low-risk from high-risk patients. (iii) In social sciences, people use information collected from polls to predict the outcome of elections. Note that in the statistical literature, supervised learning problems (classification problems) are usually, but not always, referred to as discrimination.

All of these problems have in common the requirement of using knowledge of previously classified data to build a classifier in order to assign a class to a new observation. We refer to the construction or building of classification rules from data as discrimination, classification or learning. The quality of a classifier is characterised by the misclassification error rate on the test set. The main assumption is that observations in the test set are assumed to be generated from the same underlying distribution as the observations in the training set. Almost all methods of discriminant analysis can be seen as ways for the construction of a classifier.

Two contrasting points of view have been taken. We could take a parametric family, which can only be applied when the general parametric form of the probability density functions (pdfs) is known or assumed from either a theoretical knowledge or from studying a training set. This method’s most well-known use is linear discriminant analysis (LDA), which assumes that the class-conditional densities are multivariate normal distributions with distinct means but a similar covariance matrix. This is in contrast to a non-parametric or distribution-free approach in which no assumption is made about the type of distribution from which the samples are drawn. An example is to use kernel density or k-nearest neighbour. Classification decision trees provide another example of a successful non-parametric method.

Comparing discriminant analysis to cluster analysis, more data is available on the observations. Applying discriminating in a cluster analysis makes sense because the clusters (classes) produced by the analysis can ultimately be used for prediction. Despite the fact that discriminant analysis and cluster analysis provide a useful dichotomy of classification problems, numerous real-world issues include the characteristics of both circumstances. Note that the focus of this thesis is on supervised learning problems only.

In this work, bagging is applied to linear discriminant analysis (LDA). Aggregating several decision classifiers in a committee is generally more accurate than the use of a single classifier, possibly ignoring the opinion of some classifiers. Thus, it might be promising to use bagging technique to get a better classifier with a more stable solution. Here, we use the LDA classifier with the usual bagging, averaging combining rule and modified version. The modified bagging is based on averaging the probability of LDA classifiers by aggregating posterior distance. We consider these techniques from the perspective of a two-class linear discriminant function. Historically, [2] was the first to propose a procedure for a two-group problem based on maximising the separation between the groups, and it is widely applied in assigning future examples to these groups. Note that usually, LDA is stable when it is constructed on large training samples and unstable when small training samples are used, that is, n small relative to d. Here, we believe that LDA demonstrates that it is not always as stable as one would hope; consequently, applying modified bagging can be useful in some situations (see [3]).

Typical discrimination and classification play a key role in understanding, summarising and finding useful patterns in data, for example in medicine, archaeology, taxonomy and industry. The term discrimination or separation refers to the process of deriving classification rules from data of classified observations, whereas classification or allocation refers to the application of the rules to new observations of unknown class. In practice, discrimination and classification frequently overlap and the distinction between them becomes blurred.

1.1 Linear discriminant analysis: overview

Contrasting discriminant analysis to cluster analysis, more data is available on the observations. Applying discriminating in a cluster analysis makes sense because the clusters (classes) produced by the analysis can ultimately be used for prediction. Although the fact that discriminant analysis and cluster analysis provide a useful dichotomy of classification problems, numerous real-world issues include the characteristics of both circumstances. Note that the focus of this thesis is on supervised learning problems only.

Consider modeling each class density as a multivariate normal distribution with the following forms:

fix=12πd2Σi12exp12xμi/Σi1xμi,fori=1,2E1

with mean vector μi, covariance matrix i and d as the dimension of the feature space. Further, if we suppose that we have the special case where each of the classes has a common covariance matrix , this leads to a particularly simple linear discriminant analysis (or LDA).

A general approach for a classification rule is to consider the posterior probabilities. The Bayes discriminant rule with respect to the posterior is to allocate a new observation z to the population for which πifi(z) is maximised, where πi are the prior probabilities. Then, the allocation rule that allocates z to 1 is

μ1μ2/Σ1zμ1μ212μ1μ2/Σ1μ1+μ2logπ2π1E2

In most practical situations, the population quantities μi, i and πi are unknown and replaced by their sample counterparts. Hence, the samples of ni observations from each i are used to define a sample-based rule by replacing:

μi with x¯i, the estimated mean vector in i, ∑ with Sp, the pooled sample covariance matrix and πi with, π̂=ni/n the estimated prior probability. These estimates are given by

x¯=1ni:yi=jxi,,Sj=1nj1i:yi=jx¯ix¯jx¯ix¯j/,fori=1,2E3
Sp2=n11S1+n21S2n1+n22E4

With these estimates inserted to give parametric estimates of f1 and f2, the rule (2) now becomes a sample linear discriminant rule: allocate z to 1 if

x¯1x¯2/Sp1zx¯1x¯212x¯1x¯2/Sp1x¯1+x¯2logπ2π1E5
Advertisement

2. Bootstrap aggregating or bagging: overview

The quest for a ‘good’ learner is an important issue in discriminant analysis. Usually, a single classifier constructed on a training set is biased and has large variance. Consequently, such a classifier may be misleading and will typically generalise poorly. One can improve the classifier by combing multiple classifiers instead of a single classifier. The cooperation of several classifiers as a decision ‘committee’ has been proposed as a way of reducing misclassification error.

In Breiman’s paper [4], bagging, which has been shown to increase classifier accuracy, is used to enhance the performance of a learning algorithm. A general technique known as bagging involves combining several classifier outputs produced from a single classifier on a bootstrap resample version of the training set using a majority vote to improve the performance of a given learning algorithm.

Bagging produces random and independent training sets with replacement from the original data set. This procedure is repeated B times (say B = 50) and the resulting B classifiers [5]. A final classifier or prediction can be obtained by averaging the individual classifiers in regression or by taking the majority vote over individual classifiers in classification.

The following is how bootstrapping and aggregating approaches are used in bagging: enter a training set, which consists of n observations ℓ = {(xi, yi), i = 1 … n} with xi being the feature vector and yi being the response, taking values in y ϵ {1, 2…J}. Draw bootstrap samples each of size m with replacement from the original training set. Construct a classifier from each bootstrap sample, and the final classifiers built from whose output is the class predicted most often by its sub-classifiers (for more details see [6]). Figure 1 contains full details:

Figure 1.

Illustration of a majority vote proposed by the researcher depicts: BAGGING = Bootstrap + AGGregatING.

2.1 Mechanics of bagging algorithm on LDA

Bagging is usually investigated for decision trees. Breiman [4] has shown that bagging could reduce the classification error of decision trees, and he noticed that bagging is useful for unstable procedures only. For stable rules like LDA, [7] found that bagging is useless, because most cases [4] used large training sets, when LDA is very stable. In this work, we perform our study for linear discriminant analysis to investigate if applying bagging can be useful in some situations. Aggregating techniques are implemented on LDA in the following way:

Algorithm 1. Bagging algorithm on LDA

 1.  for b = 1 to B do

*b = bootstrap sample of size m from (iid sample with replacement)

      estimate the parameters used by LDA in Eq. (4).

      construct the LDA classifier c ℓ*b (x)

endfor

store the classifiers {c ℓ*b (x), b = 1,2…, B}

 2.  Each observation in or in a test set is then classified using a majority vote of c*1, c*2,…, c*B classifiers.

    A final decision rule: cbag (x) = argmax yi01b=1BIcbx=y,

    Where I[.] is the indicator function, I[.] = 1 if c ℓ*b (x) = y and 0 if c ℓ*b (x) ≠ y

2.2 Illustration of bagging algorithm on LDA

Linear discriminant analysis is optimal for providing a classification rule that minimises the misclassification error rate if each class is a sample from a multivariate normal population and the population covariance matrices are all equal. Decision tree classification can achieve lower error rates when these assumptions are violated, for example in the case of bimodality.

To illustrate this, let us consider the situation where the LDA is applied to a two-class problem of one-dimensional data. Sixteen values from the first population N(−2,1.2) and sixteen values from the second population N(2,1.2). The separation of these two sets of ‘univariate xs’ is assessed by a linear discriminant rule in Eq. (4). In Figure 2, one can see that the LDA classifier built on the complete data is unable to accurately segregate the data. Making classifiers with bootstrapped samples based on 10 replicates sometimes gives a better classifier, sometimes a worse one. Aggregating bootstrap versions using a simple majority vote of classifiers could allow us to get a better classifier than the original classifier. An interesting observation from this example is that the bagged classifier is closer to Bayes’ rule (where densities intersect) than the LDA classifier.

Figure 2.

LDA classifier vertical (solid) line, bootstrap classifiers (dotted lines) and majority vote vertical (dashed) line. The two classes are shown by ᴼ and *. Note that there are 16 observations in each class.

In one dimension, the majority rule is in a special case equivalent to a ‘median’ procedure.

Let us consider the above classification example with a two-class problem. In this example, we have 32 training examples. Also, we have classifier cb (x), b = 1, 2,…, B, as shown in Figure 1, so that if x < cb (x), then class 1 otherwise class 2. For a given new observation z, classify z to class 1 if [# b = 1,2 …,10 cb (x> z] > [# b = 1,2 …,10 cb (x) < z], as majority vote.

The majority vote is usually constituted by 50% + 1 of the class that has the power to make decisions binding upon the whole. Here, the majority criterion is equivalent to the median, since [# b = 1,2 …,10 cb(x> Ćb] = 10/2, so that if z < Ćb, then z is classified to class 1 otherwise to class 2, where Ćb is the median of cb (x), b = 1,2 …, 10. Notice that it is very difficult to define the median in more than one-dimensional data. However, while comparing the averaging classifier with the median classifier, it is necessary to evaluate moments like the variance to make sure that a suitable classification rule has a low variance to ensure a low error rate. Let us denote μ˜ is the median of the population. Standard asymptotic theory reveals that (4n[fμ˜]21 is an estimator of var.μ˜, if we knew the density function. For example, in the case of the normal distribution, we have μ˜=μ, so that fμ˜=fμ=1/σ2π. So, the variance of the mean here is smaller than that of the median.

Advertisement

3. Averaging combining rule on LDA

Based on the above observations on the variance of the mean and median, we introduce here the averaging combining rule when applying bagging to LDA classifiers, which is based on averaging their coefficients built on bootstrap replicates to create a final decision rule. As we have seen in the above example, the averaging has an advantage over the median procedure, which is in some sense equivalent to a majority vote that makes possible to use the averaging combining rule. Notice, however, that the averaging combining rule makes sense only when rules are ‘consistent’ at the boundaries. Thus, bagging in this case is organised as the follows:

Algorithm 2. Averaging combining algorithm on LDA

  1.  for b = 1 to B do

*b = bootstrap sample of size m from (iid sample with replacement)

      estimate the parameters used by LDA in Eq. (4)

     construct the LDA classifier c ℓ*b (x)

endfor

    store the classifiers {c ℓ*b (x), b = 1,2 …, B}

  2.  combine the LDA classifier by averaging their coefficients into a final decision rule:

cbagx=1Bb=1BclbxE6

3.1 Modified bagging algorithm on LDA

As mentioned previously, the averaging combining rule performs well only if the constructed rules on the bootstrap samples are consistent at the boundaries. In practice, this situation may not hold, because bootstrap samples have different features, which lead to inconsistent classifiers at the boundaries, especially if the bootstrap sample size is small. For this reason, we introduce an alternative procedure, which is based on posterior distance or Mahalanobis distance. The posterior distance treats with respective mean vectors and covariance matrix as the specification of a probability distribution for that class. For each new observation, we calculate the probability that the observation came from each class; the observation is then classified to the class which gave the highest probability. To illustrate, consider the one-dimensional example shown in Figure 3.

Figure 3.

Two probability densities from N(4,1.52) and N(9,1.52). Classifiers c ℓ*b (z1) and c ℓ*b (z2) (dashed) lines have higher probability, which are z1 and z2 to class 1 and class 2, respectively, according to posterior distance.

Suppose, first of all, we have two probability distributions with common covariance matrix: that observations drawn from class 1 have known distribution with mean 4, whereas those from class 2 have known distribution with mean 9. The classifier c ℓ*b (z1) has a higher probability in class 1 than in class 2, and consequently, it would be classified z1 to the class 1. In contrast, the classifier c ℓ*b (z2) has a larger probability in class 2 than in class 1. Consequently, the observation z2 would be classified to class 2. The posterior distance differs from the Euclidean distance in that it takes into account the variance of the probability distribution for each class.

Advertisement

4. Mechanics of modified (weighted) bagging algorithm on LDA

A commonly used distance measure is the Mahalanobis distance, which is defined as where Sp is the pooled sample covariance matrix, and xi and xj are respective vectors of measurements on observations i and j. Compared to the Euclidean distance or Manhattan distance, this distance measure has the advantage of explicitly accounting for any correlations that might exist between variables. This is similar to the posterior distance in the case of the data for each class being similarly distributed. Thus, a second modified version of bagging is implemented by us in the following way:

Algorithm 3. Modified bagging algorithm on LDA.

  for b = 1 to B do

*b = bootstrap sample of size m from (iid sample with replacement)

      estimate the parameters used by LDA in Eq. (4)

     construct the LDA classifier c ℓ*b (x)

     for each new observation z*, let

Dz=minmind(zx\clbxclbzmind(zx\xonedgeofXE7

endfor

cbagz=signb=1BDz×clbz,E8

  Where sign (.) = ±1 dependent on sign of its argument.

The procedure for modified (or weighed) bagging is performed by the following steps (Algorithm 3). Given a training set ℓ = {(xi, yi), i = 1… n}, which consists of n observations i = {(xi, yi)} with xi the two-dimensional feature vector and yi11, draw bootstrap samples l* = 1, …, L each of size m with replacement from the original training set L. Let Z be a 2D array covering the sample values x, and define the distance between z1 and z2 as

dz1z2=z11z21+z12z22.E9

4.1 Ripley’s simulated data

The data set is a two-class classification problem in two features. A training set of 250 observations is used, and the error rates are estimated by using a further sample of size 1000 observations. We use the same data as in [8], as illustrated in Figure 4. The class distributions are chosen to allow a best-possible error rate of about 8% and are in fact equal mixtures of two normal distributions. The LDA classifier is constructed on the full training set and tested on the test set with test error rate of about 10.8% and has been superimposed (solid line) on the graph.

Figure 4.

Two-class problem data from [8]. Two classes are displayed by 0 and 1: Note that there are 100 observations in each class. LDA classifier is constructed on the full data and has been superimposed with error rate of 10.8% (solid line), usual bagging (dashed-line) and weighted bagging (dotted line).

In this example, we perform weighted bagging, averaging rule and usual bagging for LDA. One of the reasons to choose the LDA classifier is ease of comparison. Another reason which nicely illustrates is that the training sample for the LDA is small, and therefore, the LDA becomes less stable. Here, we consider Ripley’s data set with different bootstrap sample sizes (m = 75 and 100). Bootstrap replicates of small training samples differ more from each other than the large training samples (by statistical properties). Therefore, it appears that bagging may be advantageous for an LDA classifier built on relatively small training samples, especially when the classifier is most influenced by perturbations in the composition of the training sample. However, a very small training sample may be very misleading. As shown in Figure 5 at the LHS that compares weighted bagging, averaging rule and usual bagging, constructed on a training sample of size (m = 75) and performed on the test set. One can clearly see the superiority of weight bagging over the other two procedures as the iterations increase. However, the optimal error rates can be found after 7 iterations by usual bagging, which gave better improvement than the later iterations. The test error rate was about 10.3% for modified bagging and 10.6% for both averaging rule and usual bagging. When the training sample size increased to m = 100, as shown at RHS of the same figure, all procedures are less accurate compared to the LHS plot produced by training sample size 75. With the further increase of the training sample size, up to m = 200, the modified bagging and usual bagging become more stable, nearly identical to the averaging rule. This result is consistent with the simulation study performed by Breiman [7] on LDA.

Figure 5.

Weighted bagging (solid line) and usual bagging (dashed line) of the different training sizes from Ripley’s data set. The left panel is based on m = 75, while the right panel is based on m = 100. Note that test error rates for both procedures are estimated over 50 repetitions.

As a result, the classifier developed using a high training sample size may be comparable to the classifier developed using the entire training set. Therefore, some features of the classifiers built on a whole training set may be present in the bagged classifiers. The LDA classifier is where this phenomena is most pronounced. In general, the training sample size and the data distribution are strongly affected by the performance of the modified bagging and usual bagging, as well as the averaging combining rule.

Advertisement

5. Conclusion

Combining several decision classifiers in a committee can provide an improvement over the use of a single classifier. Bagging is a general technique which combines benefits of bootstrapping and aggregating. However, bagging is mainly investigated for decision trees and much less in linear discriminant analysis (LDA). Breiman [4] claims that bagging works well for unstable procedures, which often have high variance. This variance is a major component of classification error, which can be reduced by aggregating, often producing good results.

The idea of bagging is extended to averaging the classifiers based on the Manhattan distance in the decision tree (see [9]) or the Mahalanobis distance in the linear discriminant analysis (LDA) instead of a majority vote. In the first direction, we focused on creating decision trees, and experimental findings and analysis demonstrate that for all sample sizes, weighted bagged classifier outperforms conventional bagging on various tree levels. It is interesting to note that weighted bagging with stump trees outperforms standard bagging somewhat. In the second direction, we concentrated on building linear discriminant analysis (LDA). We have shown by using Ripley’s simulated data that weighted bagging can provide a good classifier, which is better than the averaging rule and usual bagged classifier constructed on small training sample size. Increasing the training sample sizes, classifier becomes stable; both modified bagging and usual bagging are identical to the averaging rule. These techniques depend on the training sample size, as well as the distribution of data used to construct the classifier. Therefore, it is crucial to consider them while using this strategy to enhance classifier performance.

5.1 Potential research directions

We have concluded that LDA could sometimes be preferable to usual bagging. In general, simple majority vote strategy is not a good choice for the aggregating rule, whereas the weighted vote technique is often a good choice for a bagged classifier in decision trees [9].

Furthermore, the weighted bagging may also perform well in LDA classifier. However, the choice of aggregating technique may be important, which strongly depends on the training sample size and the distribution of data.

We only use weighted bagging in this study for the situation of a two-dimensional predictor space, but what about the more important scenario in terms of practical application of many variables? We leave this as an unsolved research issue.

Advertisement

Acknowledgments

I like to thank Professor Charles Taylor from Leeds University for his scientific input on this effort.

References

  1. 1. MacQueen J. Some methods for classification and analysis of multivariate observations. In: Lecam LM, Neyman J, editors. Proceedings of the 5th Berkeley Symposium on Mathematical Statistics and Probability. Berkeley, CA: California Unversity Press; 1967. pp. 281-297
  2. 2. Fisher RA. The use of multiple measurements in taxonomic problems. Annals of Eugenics. 1936;7:179-180
  3. 3. Hastie T, Tibshirani R, Friedman J. The elements of statistical learning – Data mining, inference, and prediction. In: Springer Series in Statistics. New York, NY: Springer; 2017
  4. 4. Breiman L. Bagging predictors. Machine Learning. 1996a;26:123-140
  5. 5. Efron B, Tibhirani R. An Introduction to the Bootstrap. London: Chapman & Hall; 1993
  6. 6. Davison AC, Hinkley DV. Bootstrap Method and their Application. Cambridge: Cambridge University Press; 1997
  7. 7. Breiman L. Bias, variance and arcing classifier. Technical report, Statistics Department, University of California, Berkeley. 1996b
  8. 8. Ripley BD. Neural networks and related methods for classification (with discussion). Journal of the Royal Statistical Society B. 1994;56:409-456
  9. 9. El Gimati Y. Weighted bagging in decision trees: Data mining. JINAV: Journal of Information and Visualization. 2020;1(1):1-14. DOI: 10.35877/454RI.jinav149

Written By

Yousef El Gimati

Submitted: 19 August 2023 Reviewed: 20 September 2023 Published: 02 May 2024