Open access peer-reviewed chapter - ONLINE FIRST

On the Use of Modified Winsorization with Graphical Diagnostic for Obtaining a Statistically Optimal Classification Accuracy in Predictive Discriminant Analysis

Written By

Augustine Iduseri

Submitted: February 21st, 2022 Reviewed: March 17th, 2022 Published: April 27th, 2022

DOI: 10.5772/intechopen.104539

From the Edited Volume

Principal Component Analysis [Working Title]

Prof. Fausto Pedro García Márquez

Chapter metrics overview

View Full Metrics

Abstract

In predictive discriminant analysis (PDA), the classification accuracy is only statistically optimal if each group sample is normally distributed with different group means, and each predictor variance is similar between the groups. This can be achieved by accounting for homogeneity of variances between the groups using the modified winsorization with graphical diagnostic (MW-GD) method. The MW-GD method involves the identification and removal of legitimate contaminants in a training sample with the aim of obtaining a true optimal training sample that can be used to build a predictive discriminant function (PDF) that will yield a statistically optimal classification accuracy. However, the use of this method is yet to receive significant attention in PDA. An alternative statistical interpretation of the graphical diagnostic information associated with the method when confronted with the challenge of differentiating between a variable shape in the groups of the 2-D area plot remains a problem to be resolved. Therefore, this paper provides a more comprehensive analysis of the idea and concept of the MW-GD method, as well as proposed an alternative statistical interpretation of the informative graphical diagnostic associated with the method when confronted with the challenge of differentiating between a variable shape in the groups of the 2-D area plot.

Keywords

• winsorization
• informative graphical diagnostic
• optimal training sample
• predictive discriminant analysis
• statistically optimal classification accuracy

1. Introduction

Binary classification when compared to multiple classification has wide range of real-world applications in many areas of human endeavors, such as criminal justice, education, medicine, email analysis, human resources management, pattern recognition, energy and environmental management, financial data analysis and economics, production systems management and technical diagnosis, marketing among others. Where the classification criterion comprises one or several predictor variables along with a categorical criterion, such a prediction will require the use of a predictive discriminant analysis (PDA). PDA is still the optimal method when the cost of misclassifying groups is clearly different and when there is greater interest in the accuracy of classifying separate groups. In most cases, evaluating the proportion of correct classification of a predictive discriminant function (PDF) in all sub-populations is equivalent to the estimation of the actual hit rate, Pa[1, 2]. That is, Pais the expected proportion of correct classification when a PDF that is built from a given training sample is validated on training samples from the sample population. In PDA, to improve or optimize classification accuracy or actual hit rate, researchers often rely on feature selection methods. The aim of feature selection methods in PDA is to choose the best subset of important predictor variables that will effectively reduce the intricacy of the PDF, thus facilitate interpretation, enhance or optimize the classification accuracy, and reduce the training time. Nevertheless, the promise of optimizing classification accuracy using variable selection methods is almost always unfulfilled, because the derived PDF is often obtained from a training sample that does not meet near optimal condition [1, 3, 4, 5, 6]. The actual hit rate of a PDF may be considered statistically optimal only if the assumptions of normality and/or homogeneity of variances are taken into account [5, 7]. This means that having a better subset is not a guarantee for achieving a statistically optimal classification accuracy.

In general, the task of enhancing or improving classification accuracy was examined in two ways. Several researchers use feature or variable selection techniques to select the best subset of predictors to construct a classification model. In addition to conventional feature selection techniques, including the stepwise and all possible subset methods [4, 8, 9]. Some widely known and used methods include the principal component analysis (PCA) used to obtain a set of low-dimensional features from a large set of features [10, 11]. The branch and bound technique which uses a greedy procedure to obtain the best subset [12], the genetic search algorithm [13, 14], the shrinkage methods [10, 15], the particle swarm optimization (PSO) approach which is a meta-heuristic technique used to enhance classification accuracy [16], representative methods based on dictionary learning (DL) for classification [17, 18, 19], support vector machines (SVMS) [20], and the hyper parameter tuning approach [21, 22]. We have the sequential analysis approach as well [23]. The heteroscedastic discriminant analysis merged with feature selection [24] and the modified leave-one-out (LOOCV) cross-validation method used as an alternative to the all-possible subset method [25]. A PDF’s classification accuracy is only statistically optimal if each group sample is normally distributed with different group means, and each predictor variance is similar between the groups [7]. None of these basic assumptions regarding the validity/reliability of the PDF are considered by any of the above methods. To address these gaps in feature selection techniques, other investigators are seeking alternatives to robust PDA by replacing conventional estimators with robust estimators. Some variants of these alternative methods include the dimensionality reduction/feature extraction for outlier detection (DROUT) [26], the minimum covariance determinant (MCD) [27], S-estimators [28], one-step M-estimator (MOM), and winsorized one-step M-estimator (WMOM) [29]. These other methods concentrate on building a robust PDF for deviations due to the presence of outliers in the training sample. Besides the presence of outliers in most training samples, there are also hidden influential observations resulting either from an incorrect distributional assumption or an inherent variability of the dataset [30]. Oftentimes, these hidden influential observations are not considered by any of the above methods for optimizing an actual hit rate. Consequently, the PDF’s solution obtained by either of the two approaches may be optimal but not statistically optimal. To overcome the problem of hidden influential observations, Iduseri and Osemwenkhae [6] proposed a novel method for attaining an optimal training sample. Their method otherwise known as modified winsorization with graphical diagnostic (MW-GD) method yielded a PDF’s solution which was statistically optimal for both the training sample that gave rise to it and for other training samples from the same population. However, the graphical diagnostic associated with this new method may be difficult to interpret if there are no significant differences between a variable shape in the groups of the 2-D area plot, and yet there is evidence of hidden influential observations in the training sample.

This paper provides a more comprehensive analysis of the idea and concept of the MW-GD method, as well as proposed an alternative statistical interpretation of the informative graphical diagnostic associated with the method when confronted with the challenge of differentiating between a variable shape in the groups of the 2-D area plot. The remaining sections of this paper are organized as follows. Sections 2 and 3 discuss the problems posed by the presence of outliers and legitimate contaminants in the training sample that yields the PDF, the concept of statistical optimality of the PDF classification accuracy, and the robustness of PDF, respectively. Section 4 describes in details the idea and concept of the modified winsorization with a graphical diagnostic for obtaining a statistically optimal training sample, as well as presents the proposed alternative statistical or numerical interpretation of the informative graphical diagnostic. Section 5 presents the results and discussions based on the application of two real life samples, while Section 6 presents the conclusions.

2. Outliers and legitimate contaminants in PDA

In PDA, an outlier is an observation which is not a member of a group, and is often indicative of an incorrect measurement or an incorrect allocation of the unit or observation. Such an outlying observation can cause severe problems that even the robustness of PDA may not overcome. Over the last two decades, many articles have been published about detecting outliers in discriminant analysis (DA) [31, 32, 33, 34, 35, 36, 37, 38]. In PDA, a popular means of treating outliers is to construct multiple PDFs with assumed outliers added and with assumed outliers removed [1]. The primary issue with this method is whether potential outliers should be remove one at a time, two at a time, or all at a time. With the SPSS DISCRIMINANT procedure, the chi-squared distribution is used to establish the typicality probability. These typicality probabilities are used to identify potential outliers in the context of PDA. However, Huberty and Olejnik [1] pointed out that when the group covariance matrices are not equal, the unit typicality probabilities are difficult to interpret because different distance metrics are used in the calculation. A common distance index used for detecting outliers or influential observations in the context of PDA is the Mahalanobis distance [39] which is also calculated as a byproduct in SPSS DISCRIMINANT procedure.

However, there are also hidden influential observations (or legitimate contaminants) resulting either from an incorrect distributional assumption (i.e., when the data turns out with a different structure than originally assumed) or an inherent variability of the dataset, see Osborne [30], and Iglewicz and Hoaglin [40] for more details. While hidden influential observations may actually belong to a training sample, but if not distributed randomly may reduce normality which often leads to violation of sphericity and multivariate normality assumptions in PDA. Hidden influential observation can also adversely affect the quality of the PDA solution and its generality. But how to identify and remove hidden influential observations before building a classification model (particularly in the PDA) has not receive any significant attention in the literature by statisticians or by methodologist and therefore not by any substantive researchers. Besides, the SPSS typicality or Mahalanobis index may not be able to identify hidden influential observations because their unit often belongs to a different group compared to outliers.

Therefore, much emphasis should be placed on cleaning the training sample to ensure that it meets its near-optimum condition by removing all legitimate contaminants from the training sample. This method is similar to optimizing decision trees (in particular classification trees) which consists in reducing the amount of impurity—see Myatt [41] for details. In the context of PDA, it will improve the similarity of each predictor variable variance between groups, thus improving the approximation of the true shape. This will in no doubt guarantee the statistical optimality of the PDF solution or classification accuracy.

3. Statistical optimality of a PDF classification accuracy

A PDF’s classification accuracy is only statistically optimal if each group sample is normally distributed with different group means, and each predictor variance is similar between the groups [5, 7]. In addition to the above requirements, it is recommended that there be at least four to five times as many cases as predictors in order to produce more accurate estimates. Note that in PDA, the failure of the training sample to meet the assumption of normality can result in a decrease in efficiency and accuracy—see Lachenbruch [42] as cited in Klecka [43]. However, a minor violation of this assumption will not decrease the accuracy of the classification. As long as the distributions of predictors are reasonably comparable, the estimation of most multivariate parameters does not require multivariate normality [44]. Moreover, under the central limit theorem, there is no need to worry about the assumption of normality as long as each group sample contains a very large number of observations. As a general rule, a PDF will still perform strongly against non-normality as long as the smallest group has over 20 cases, and the number of predictors is less than six [45]. Due to these robustness properties of PDA, researchers are barely concerned about the assumption of normality.

But where non-normality is due to outliers and/or hidden influencing observations other than skewness, violating this assumption has serious consequences, because PDA is very sensitive to outliers [45]. Likewise, more cases could be classified into the group with greater dispersion when the assumption of equality of variance-covariance matrices is not tenable [45]. In addition, the likelihood of belonging to a group may be distorted, and the PDF may also not be able to separate the groups as much as possible [43]. The accuracy of the discriminant weights estimates may be reduced if the variances of the predictors are not all similar between groups. They may be precise but not unbiased [46]. When the homogeneity of variance test is significant, it indicates that the training sample is contaminated with outliers and/or hidden influence observations, and the significance tests are unreliable [3, 45]. It is apparent from the foregoing that, if the assumption of homogeneity of variances is not satisfied, it is probable that the assumption of multivariate normality is not equally satisfied. This suggests that multivariate normality and homogeneity of variance assumptions can be taken into account if outliers and hidden influential observations are completely removed from a training sample. The practice of researchers relying on the robustness properties of PDA without checking for outliers and hidden influential observations, which may hinder maximal separation between the groups seems to be a norm. This practice is further encouraged by the general acceptance of a hit rate of 25% above that of chance. Assuming that you get a 95% hit rate, it is certain that you will not care about the two basic assumptions of PDA.

Whereas the reason for such good performance might be that the data support simple linear or quadratic separation boundaries. The general belief that linear classifiers are robust to minor violations of its basic assumptions (in particular is the assumption of multivariate normality) is often not tenable. Studies have shown that the reliability of a PDF solution is dependent upon adherence to the underlying assumptions [5]. The primary objective of PDA is classification, and if the percentage of correct classifications is not satisfactory, it is likely that variances in predictors are not similar across groups. That is to say the training sample is not statistically optimal. Therefore, it is necessary to adopt a screening method that will effectively identify and remove legitimate contaminants from training samples before using them to build a PDF. Iduseri and Osemwenkhae [6] proposed the modified winsorization with graphical diagnostic (MW-GD) method to identify and remove legitimate contaminants from training samples. The MW-GD method produced a statistically optimal training sample when applied to a real dataset, and the resulting PDF yielded a hit rate that was statistically optimal. As a result, the uncertainty about the PDF’s actual hit rate was greatly reduced. However, the informative graphical diagnostic associated the proposed method may be difficult to interpret if there are no significant differences between a variable shape in the groups of the 2-D area plot.

This paper proposes an alternative statistical interpretation of the informative graphical diagnostic when confronted with the challenge of differentiating between a variable shape in the groups of the 2-D area plot.

4. Identification and removal of legitimate contaminants in PDA

4.1 Correction for bias of discriminant weights in PDA

In general, extreme scores or outliers bias estimates of any parameter. One notable way to correct for bias is to change the data by changing the scores so as to reduce the impact of the extreme scores or adjust the shape of the distribution. Notable variants of changing the scores method include transforming the data, trimming the data and winsorizing the data. However, in PDA, where one is interested in differences between set of variables or groups, transformation may not be a good choice to correct for bias of discriminant weights. This is because, transformation can change the units of measurements, which may in turn affects the interpretation of the data because the data now relate to a different construct compared to the original data [47]. Similarly, trimming of data seemed odd since one could just discard lots of data. To overcome these inherent drawbacks associated with both methods, the winsorization method was adopted. These approach involves replacing a percentage of the highest score with the next highest score in the data and the same percentage of the lowest score are replaced with the next lowest score in the data. One major challenge with this method is that even the next higher or lowest score might still be an extreme score. Another variant of the winsorization involves replacing extreme score with a score three standard deviations from the mean. This variant of winsorization also suffers a major drawback. As noted by Field [47], the standard deviation will be biased by extreme scores, so this means that you are replacing scores with a value that has been biased by extreme scores. To address the observed shortcomings of both variants of the winsorization method, Iduseri and Osemwenkhae [6] proposed the modified winsorization with graphical diagnostic (MW-GD) method. The method proved very effective in identifying and removing legitimate contaminants.

4.2 The modified winsorization with graphical diagnostic (MW-GD) method

In this section, the modified winsorization with graphical diagnostic (MW-GD) method originally proposed by Iduseri and Osemwenkhae [6] is presented. In addition, a proposed alternative statistical interpretation of the informative graphical diagnostic associated with MW-GD method when confronted with the challenge of differentiating between bar shapes of the 2-D area plot is also presented. The MW-GD method, which involves a three-step procedure, will effectively identify and eliminate legitimate contaminants from predictor variables so that their variances between the groups are similar. The aim is to ensure that the training sample, DNtsatisfies the basic assumptions (particularly the assumption of homogeneity of variances) of the PDA. The three steps procedure produced an optimal training sample that was used to construct a PDF whose percentage of correct classification was not only statistically optimal for the training sample that produced it, but also for other training samples from the same population.

4.2.1 Algorithm for the modified winsorization with graphical diagnostic (MW-GD)

Let DNt=x1x2xNP×Nbe a training sample matrix that comprises Nobservations setxiyii=1n, obtained from a real dataset using any of the conventional feature selection techniques, where xi12Pdenote the corresponding predictor variable label, yi12Kdenote the corresponding group label, Pis the number of predictor variables, and Kis the number of groups.

Step 1: Identification of the predictor variables with legitimate contaminants.

For the training sample, DNtthe scores or observations of the predictor variables, X1,,XNare first arranged in ascending order so that the extreme scores at both ends can be identified. For each predictor variable in the unordered training sample, DNt, a pair of score (i.e., the most extreme and the least extreme scores initially identified at both ends of each distribution) is deleted and replaced with the median value, before the mean of the remaining scores is calculated. The median value was adopted in order to satisfy the assumption of independence of all cases. The median is a position average independent from all other cases, whereas the mean depends on all other cases. Consequently, substituting the identified influential observations with their median value takes into account the assumption that all observations must be independent. This process is repeated for other lower pairs of extreme values and stops when five modified winsorized means are obtained for each predictor variable. When the modified winsorized means values are plotted, the predictor variable with bar shapes that are not similar between groups in the 2-D area plot becomes the predictor variable with legitimate contaminants.

Step 2: Removing legitimate contaminants from the identified predictor variables.

To determine what percentage of winsorization is required to eliminate the legitimate contaminants, the modified winsorization process is repeated only for the predictive variable(s) identified with the legitimate contaminants until the highest hit rate is attained, thus obtaining a near optimal training sample given as:

DNoptimalt=x1x2xPE1

Step 3: Obtaining a statistically optimal hit rate.

The optimal training sample of Eq. (1), was then used to build the optimized PDF, Zoptgiven as:

Zoptimal=u1X1+u2X2++uPXP=ηDNoptimaltE2

where Zoptimalis the optimized PDF, uiare the discriminant weights, Xiare the predictor variables and ηDNoptimaltshows that the PDF is constructed with an optimal training sample.

To get a statistically optimal hit rate, let:

dj=1ifZ^j=Zj0otherwiseE3

where Z^jis the predicted response for the jthobservation in the optimized training sample, Zjis the value for the jthobservation in the optimized training sample. Therefore, a statistically optimal hit rate for the optimized PDF in (2) is given as:

Pa=1Ntj=1Ntdj×100E4

where Ntis the total number of cases over all groups in the optimized training sample. If we redefine djas:

dj=1ifZ^j=Zj0otherwiseE5

where Z^jis the predicted response for the jthobservation computed with the jthobservation removed from the training sample. The leave-one-out cross-validation (LOOCV) estimate of the optimized hit rate (4) is given by:

P^LOOCVa=1Nj=1Nvdj×100E6

where nvis the number of validation samples.

4.3 The proposed alternative statistical interpretation for the informative graphical diagnostic

As described in Step 1 of Section 4.2.1, the graphical representation of the modified winsorized means makes it possible to easily identify the predictor variable(s) whose variance is not significantly similar between the groups. If the variance of a predictor variable is not similar between groups, the bars representing the modified winsorized mean values for the variable in the 2-D area plot will not have similar shape. Such a variable is interpreted to be the identified variable that contains legitimate contaminants. This means that the more the shape of the bars differ between the groups, the easier it is to interpret the 2 D area plot. It is therefore necessary to provide an alternative interpretation when it is difficult to differentiate between a variable shape in the groups of the 2-D area plot. Therefore, a simple statistical or numerical interpretation is proposed for the informative graphical diagnosis when it is difficult to differentiate between a variable shape in the groups of the 2-D area plot. This alternative numerical interpretation consists of the following two simple steps:

Step 1: Fitting the modified winsorized means values to a linear regression model.

For each group, the modified winsorized means and their corresponding winsorized percentage values for each predictor variable are fitted to a linear regression model given as:

Y11=a+b11XY1P=a+b1PXE7

and

Y21=a+b21XY2P=a+b2PXE8

where, Y11,,Y1Pand Y21,,Y2Pare the modified winsorized mean values for the Ppredictor variables in groups 1 and 2, respectively, and Xis the corresponding winsorized percent values.

Step 2: Obtaining the absolute difference between the corresponding regression coefficients for the groups.

The absolute difference between the obtained regression coefficients (i.e., the slope) in group 1 and 2 is computed as:

δabs=b1ib2i,i=1,,pE9

where the subscripts 1 and 2 represents groups 1 and 2. The predictor variable that has an absolute difference of 0.75 or greater will be the variable identified with legitimate contaminants. In PDA, if two samples are equal in size, there is always a 50/50 chance. Most researchers would accept a classification accuracy of 25% greater than that caused by chance. Hence, the choice of 0.75 as the decision boundary.

5. Computational results and discussion

5.1 Using the modified winsorization with graphical diagnostic (MW-GD) method

To evaluate the effectiveness of the modified winsorization with graphical diagnostic (MW-GD) method, two real samples (see [6] for the two data sets) were considered with the second used as a validation sample. The first training sample is from a renowned financial journal, among Japanese business leaders, which can be compared to the Economist, Financial Times, and Business Week in Europe and the United States of America. This dataset contains 50 observations from each of the two groups of Japanese financial institutions, each bank being evaluated using the following seven performance indexes: (1) return on total assets (= total profits/average total assets), (2) labor profitability (= total profits/total employees), (3) equity to total assets (= total equity/average total assets), (4) total net working capital, (5) return on equity (= earnings available for common/average equity), (6) cost-profit ratio (= total operating expenditures/total profits), and (7) bad loan ratio (= total bad loans/total loans). However, taking into account the beneficial effect of feature selection and outlier detection as a preprocessing step, a best subset and critical value of Mahalanobis distance were first obtained using the SPSS stepwise method, and its compute command for critical value of Mahalanobis distance. The stepwise approach produced a best subset which comprises return on total assets (X1), labor profitability (X2), equity to total assets (X3), and bad loan ratio (X7). The SPSS output of the constructed PDF based on the training sample of four variables given by:

Z=0.005X1+0.006X2+0.004X3+0.005X7E10

Thus, the classification accuracy of the PDF, Z(10) and its LOOCV estimate are given by Eqs. (11) and (12).

Pa=1Ntj=1Ntdj×100=86.0%E11
P^LOOCVa=1Nj=1nvdj×100=85%E12

While the two critical values from the SPSS outputs for 0.95 and 0.99 used as the probabilities in the IDF.CHISQ function with four predictor variables were given as:

COMPUTEcritical=IDF.CHISQ0.954=9.49COMPUTEcritical=IDF.CHISQ0.994=13.28

The Mahalanobis distance values for all cases, as reported in the case wise statistics table, were all lower than the two critical values. This means that there are neither outliers nor hidden influential observations in the dataset or training sample. Next, the MW-GD algorithm was applied to the training sample, DNtconsisting of four predictor variables.

At step 1of the MW-GD method, a total of five modified winsorized means for the four predictor variables were obtained. The summary of the winsorized means values is presented in Table 1.

GroupsNo. of replaced sample pointsModified winsorized mean valuesWinsorized %
X1X2X3X7
10428.70329.48536.34821.320
2421.08317.32533.90824.504
4415.50307.60533.70827.408
6412.28298.32533.56830.3612
8408.90290.30534.46833.6216
10406.20282.70535.42835.5620
20341.50177.20368.46778.700
2335.24173.20366.92785.804
4329.66169.16365.20790.748
6324.76166.68362.94794.2212
8319.86166.22360.68797.3016
10315.44165.94358.66800.1220

Table 1.

Modified winsorized means for up to five pairs of winsorized values.

To provide a meaningful interpretation of Table 1, the modified winsorized means or averages for both groups, as shown in Table 1, were plotted with a 2D area plot in Excel Package. The process involves entering the first variable X1 modified winsorized averages for both groups into Excel spreadsheet in pairs (with X1 values for group 1 and 2 occupying the first 2 columns from row 1 to row 6), followed by variable X2 (with X2 values for group 1 and 2 occupying column 3 to column 4 from row 8 to 13), followed by variables X3 and X7 proceeding downward in steps. The graphical representation is presented in Figure 1.

A cursory look at Figure 1 shows that the winsorized average values for the four predictor variables in both groups represented by the 2-D area plot have similar shape (or have similar variances within the groups) except for predictor variable X2. Also, in Figure 1, the bar shape of the predictor variable X2 in group 1 looks like a rectangle whereas in group 2 it looks like a trapezium. The observed difference in the shape of the variable X2 bar indicates that variable X2 does not have similar variances in the groups, and therefore becomes the only variable with legitimate contaminants. It therefore means that the training sample does not satisfy the assumption of homogeneity of variances. This finding was corroborated by the result of the Box M test for equality of variance-covariance matrices for this training sample, which was significant. Apart from manually replacing the extreme values on both ends with the median value, for each percentage of winsorization using R aggregate () function, the average calculation time needed to generate each row result for groups 1 and 2 in Table 1 was 2 seconds.

At step 2of the MW-GD method, the modified winsorization process was performed for only variable X2. For each percent of winsorization, a PDF is constructed using all four predictor variables. The summary of hit rate results for each percent of winsorization is presented in Table 2. Table 2 shows that the highest hit rate of 97.00 was achieved when 5 data points at both ends of the data were replaced by the median value. This means that all the legitimate contaminants in variable X2 was completely replaced at 20% winsorization. In other words, at 20% winsorization, the lack of homogeneity of variances observed in the variable X2 was taken into account, thus obtaining a near optimal training sample, DNopt:

% of winsorizationNo. of replaced sample points% of training sample correctly classified
4285.00
8488.00
12691.00
16894.00
201097.00a
241296.00

Table 2.

Summary of hit rate results for each percent of modified winsorization for predictor variable, X2.

Optimal winsorization occurs at 20% with hit rate = 97.00%.

The optimized training sample, DNoptwas then used to construct a PDF. The SPSS output for the obtained PDF is given as:

Zopt=0.003X1+0.018X2+0.001X3+0.004X7E13

Thus, the PDF, Zopt(13) hit rate and its LOOCV estimate are given by Eqs. (14) and (15).

Pa=1Ntj=1Ntdj×100=97.0%E14
P^LOOCVa=1Nj=1nvdj×100=91%E15

In addition to the dataset from Japanese banks, a second real dataset was used to validate the first sets of results (11), (12), (14), and (15). This validation sample was obtained from the academic records of junior secondary school (JSS) 2, University Demonstration Secondary School (UDSS), University of Benin, Nigeria. The dataset contains 30 observations for both classes: Science and Art. The dataset consists of average scores for the three consecutive terms obtained for eleven (11) subjects, including English Language (X1), Mathematics (X2), Integrated Science (X3), Social Studies (X4), Introductory Technology (X5), Business Studies (X6), Home Economics (X7), Agricultural Science (X8), Fine Art (X9), Physical and Health Education (X10), and Computer Studies (X11). Using the SPSS stepwise method, a subset of three variables comprising introductory technology (X5), physical and health education (X10), and computer science (X11) was obtained. The SPSS output for the PDF is given as:

Z=0.135X50.102X10+0.058X11E16

Thus, the PDF, Z(16) hit rate and its LOOCV estimate are given by Eqs. (17) and (18).

Pa=1Ntj=1Ntdj×100=85.0%E17
P^LOOCVa=1Nj=1nvdj×100=83%E18

Also, the two critical values from the SPSS outputs for 0.95 and 0.99 used as the probabilities in the IDF.CHISQ function with three predictor variables were given as:

COMPUTEcritical=IDF.CHISQ0.954=7.81COMPUTEcritical=IDF.CHISQ0.994=11.34

The Mahalanobis distance values for all cases, as reported in the case wise statistics table, were all lower than the two critical values. This means that there are neither outliers nor hidden influential observations in the dataset or training sample. Once again, the proposed algorithm is applied to this second training sample, DNtconsisting of three predictor variables.

At step 1of the MW-GD method, a total of five modified winsorized means for the three predictor variables were obtained. The summary of the modified winsorized means values is presented in Table 3.

GroupsNo. of replaced sample pointsModified winsorized mean valuesWinsorized %
X5X10X11
1067.2769.7773.300.00
267.4070.0773.436.67
467.4370.1773.6013.33
667.4770.1373.6720.00
867.5370.1373.7026.67
1067.6070.1773.7333.33
2051.4762.4063.930.00
251.2762.5363.976.67
451.0762.7063.7313.33
650.9762.8363.6020.00
850.7763.0063.5026.67
1050.7363.2363.3033.33

Table 3.

Winsorized means for up to five pairs of winsorized values.

Again, to interpret Table 3, the modified winsorized means or averages for both groups, as shown in column three of Table 3, were plotted with a 2D area plot in Excel Package. The graphical representation is presented in Figure 2.

A cursory look at Figure 2 shows that the winsorized average values for the four predictor variables in both groups represented by the 2-D area plot have similar shape (or have similar variances within the groups). The similar shape shown by the three variables in each group indicates that there are no legitimate contaminants in the training sample, DNt. This implies that the fit between the training sample, DNtand the basic assumptions of PDA is sufficient (in particular is the assumption of homoscedasticity) to construct a PDF whose hit rate can be said to be statistically optimal. Therefore, for this dataset, the MW-GD algorithm ends at step 1. The initial training sample of three variables obtained from the second real dataset using SPSS stepwise method is therefore an optimal training sample.

5.2 Using the proposed alternative statistical interpretation for the informative graphical diagnostic

If the modified winsorized means for each variable in Table 1 are denoted as the independent variable YY=Y1Y2Y3Y4, and the winsorized percent as the dependent variable X, then the six pairs of values for X(0, 4, 8, 12, 16, 20) and Y1 (428.70, 421.07, 415.50, 412.28, 408.90, 406.20) are in good accordance to a regression model.

At step 1of the proposed alternative statistical interpretation for the informative graphical diagnostic, each of the six values of the modified winsorized means for the four variables in groups 1 and 2 with the six values of the winsorized percent are fitted to a linear regression model. The summary of the obtained values of the regression coefficient, bfor the four fitted regression models each for groups 1 and 2 are presented in Table 4.

GroupRegression coefficients, b
X1X2X3X7
1−1.088−2.316−0.0220.725
2−1.295−0.569−0.5001.036
δabs=b1ib2i0.21.80.50.3

Table 4.

Regression coefficients, bof fitted regression model to Table 1 data.

At step 2of the proposed alternative approach, an absolute difference between the obtained regression coefficients (i.e., the slope) for each variable in group 1 and 2 was also obtained. The summary of the absolute values is presented as the last row of Table 4.

The two step approach of the proposed alternative method was repeated using the data of Table 3. The summary of the obtained values of the regression coefficient, bfor the three fitted regression models each for groups 1 and 2, and the summary of the absolute values is presented as the last row of Table 5.

GroupRegression coefficients, b
X5X10X11
10.0090.0090.013
2−0.0230.024−0.020
δabs=b1ib2i0.00.00.0

Table 5.

Regression coefficients, bof fitted regression model to Table 3 data.

A cursory look at Table 4 shows that the regression coefficient (−2.316) of X2 in group 1 is notably lower than the corresponding regression coefficient (−0.569) of variable X2 in group 2, and the regression coefficients obtained for the variables X1, X3, and X7 in group 1 and in group 2, respectively. Also, the value of 1.8 which is the absolute difference between the regression coefficients of X2 in group 1 and in group 2 is significantly greater than the decision boundary value of 0.75. This indicates that the variable X2 does not have similar variances in groups formed by the dependent, and thus becomes the variable identified with legitimate contaminants. In Table 5, the values of the regression coefficients for the three variables in groups 1 and 2 are equivalently equal. Also, the values of the absolute difference between the regression coefficients of variable X5, X10, and X11 in group 1 and in group 2 are all equal, and significantly lower than the decision boundary value of 0.75. This indicates that there are no legitimate contaminants in variable X5, X10, and X11, respectively. This therefore implies that the fit between the validation sample, DNtand the basic assumptions of PDA is sufficient to construct a PDF whose hit rate can be said to be statistically optimal.

6. Conclusions

This paper addresses the issue of achieving a statistically optimal classification accuracy in PDA by first achieving an optimal training sample. For the first real dataset, a training sample of four variables was obtained using the SPSS stepwise method. The training sample gave a hit rate of 86.0%. When all legitimate contaminants in one of the four variables have been identified and eliminated using the MW-GD method, an optimal training sample was achieved. The optimized training sample was used to construct the PDF, Zopt(13) which gave a classification accuracy of 97.0% when tested on the initial training sample of four variables. This significant increase in classification accuracy suggests that the use of the WM-GD method seems to effectively enhance the similarity of each predictor variable variances between groups, thus taken into account the basic assumptions needed to achieve a statistically optimal classification accuracy. Using the second real dataset, a training sample of three variables was obtained and used to construct the PDF, Z(16), which yielded 85.0% hit rate. When the modified mean values of the three variables were plotted, the bar shape of the three variables in the two groups was similar. This means that the PDF, Z(16) hit rate of 85.0% cannot be increased further because the training sample that gave birth to it was statistically optimal.

The uniqueness of the MW-GD method lies in its ability to effectively identify and eliminate legitimate contaminants in one or several predictor variables, thus resolving any significant differences in the variances of the predictor variables between the groups. In other words, the MW-GD method is unique in its ability to sufficiently account for the basic assumptions required to achieve statistically optimal classification accuracy in PDA. As a result, an optimal training sample obtained from the first real dataset gave a statistically optimal hit rate of 97.0% compared to an initial maximum hit rate of 86.0%. For the second real dataset, the method was successful in confirming the optimality of the initial training sample obtained using the SPSS stepwise method. Similarly, the graphical diagnostic was able to identify the predictor variable(s) whose variance was not similar within the groups. Consequently, the graphical diagnostic associated with the proposed method could be used as an alternative graphical test of homogeneity of variances in PDA.

Another important contribution to the MW-GD method in this paper was the proposed alternative statistical interpretation for the graphical diagnostic associated with the MW-GD method demonstrated in Subsection 5.2. This proposed alternative statistical interpretation proved very effective in terms of identifying the variable with legitimate contaminants, and could serve as a useful alternative tool for identifying variables with legitimate contaminants in the event of any difficulties in differentiating between a variable shape in the groups of the 2-D area plot.

Finally, two real training samples have been used. Consequently, the validity of the experimental results is limited to the scope of the datasets used. Therefore, this paper believes that more experimental results are needed in order to reach a final conclusion on the efficiency of the MW-GD method compared to classical alternatives known to improve classification accuracy in PDA.

Nomenclature

DNt

this is the complete list of objects used in building the predictive discriminant function (PDF)

DNt

this is the complete list of objects without outliers and hidden influential observation used in building the predictive discriminant function (PDF)

K

the number of groups or categorical criterion

N

total number of cases over all groups in a dataset

Nt

total number of cases over all groups in the optimized training sample

nv

the number of validation samples

P

the number of predictor variables

Pa

this is the percentage of cases on the diagonal of the confusion matrix, or simply the percent of correct classification

p^a

this is the hit-rate obtained by applying a rule based on a particular training sample to future samples taken from the same population

p^LOOCVa

this is the leave-one-out cross-validation (LOOCV) estimate of the optimized hit rate

Xi

predictor variables

ui

discriminant weights

Z

a predictive discriminant function (PDF) created by a linear combination of observable variables

ZOptimal

an optimized predictive discriminant function (PDF) created by a linear combination of observable variables without outliers and hidden influential observations

Zj

value for the jth observation in the optimized training sample

Z^j

predicted response for the jth observation in the optimized training sample

Z^j

predicted response for the jth observation calculated with the jth observation removed from the training sample

References

1. 1. Huberty CJ, Olejnik S. Applied Manova and Discriminant Analysis. Hoboken, New Jersey: John Wiley and Sons Inc.; 2006. p. 406
2. 2. Iduseri A, Osemwenkhae JE. On estimation of actual hit rate in the categorical criterion predicting process. Journal of the Nigerian Association of Mathematical Physics. 2014;28(1):461-468
3. 3. Huberty CJ. Applied Discriminant Analysis. New York: Willey and Sons; 1994
4. 4. Thompson B. Stepwise regression and stepwise discriminant analysis need not apply here: A guidelines editorial. Educational and Psychological Measurement. 1995;55(4):525-534
5. 5. Uray M. Incremental, robust, and efficient linear discriminant analysis learning [thesis]. Graz, Austria: Institute for Computer Graphics and Vision, Graz University of Technology; 2008
6. 6. Iduseri A, Osemwenkhae JE. A new approach for improving classification accuracy in predictive discriminant analysis. Annals of Data Science. 2018;5(3):339-357. DOI: 10.1007/s40745-018-0140-9
7. 7. Croux C, Filzmoser P, Joossen K. Classification efficiencies for robust linear discriminant analysis. Statistica Sinica. 2008;18:581-599
8. 8. Draper NR, Smith H. Applied Regression Analysis. New York: Wiley; 1981
9. 9. Huberty CJ. Problems with stepwise methods: Better alternatives. In: Thompson B, editor. Advances in Social Science Methodology. Vol. 1. Greenwich, CT: JIA Press; 1989. pp. 43-70
10. 10. Bertrand C, Ernest F, Hao HZ. Principles and Theory for Data Mining and Machine Learning. Springer Series in Statistics. New York: Springer; 2009. pp. 569-576. DOI: 10.1007/978-0-387-98135-2
11. 11. Chiang LH, Russell EL, Braatz RD. Fault Detection and Diagnosis in Industrial Systems. New York: Springer; 2001
12. 12. Hand DJ. Branch and bound in statistical data analysis. Journal of the Royal Statistical Society: Series D (The Statistician). 1981;30(1):1-13
13. 13. Siedlecki W, Sklansky J. A note on genetic algorithms for large-scale feature selection. Pattern Recognition Letters. 1989;10(50):335-347
14. 14. Huerta EB, Duval B, Hao J. A hybrid LDA and genetic algorithm for gene selection and classification of microarray data. Neurocomputing. 2010;73:2375-2383
15. 15. Tibshirani R. Regression shrinkage and selection via the LASSO. Journal of the Royal Statistical Society: Series B: Methodological. 1996;58(1):267-288. DOI: 10.1111/j.2517-6161.1996.tb02080.x
16. 16. Shih-Wei L, Shih-Chieh C. A particle swarm optimization approach for enhancing classification accuracy rate of linear discriminant analysis. Applied Soft Computing. 2009;9(3):1008-1015. DOI: 10.1016/j.asoc.2009.01.001
17. 17. Jiang Z, Lin Z, Davis LS. Learning a discriminative dictionary for sparse coding via label consistent K-SVD. In: Proceedings of the 24th IEEE International Conference on Computer Vision and Pattern Recognition (CVPR); Colorado Springs, CO, USA; 20–25 June 2011. pp. 1697–1704. DOI: 10.1109/CVPR.2011.5995354
18. 18. Yang M, Zhang L, Feng X, Zhang D. Fisher discrimination dictionary learning for sparse representation. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV); Barcelona, Spain; 6–13 November 2011. pp. 543–550. DOI: 10.1109/ICCV.2011.6126286
19. 19. Kong S, Wang D. A brief summary of dictionary learning based approach for classification [Internet]. 2012. Available from:http://arxiv.org/pdf/1205.6544[Accessed: 2021-12-06]
20. 20. Mary-Huard T, Robin S, Daudin J. A penalized criterion for variable selection in classification. Journal of Multivariate Analysis. 2007;98:695-705. DOI: 10.1016/j.jmva.2006.06.003
21. 21. Daud M, Muhammad A, Affindi B, Retno V. Improving classification algorithm on education dataset using hyperparameter tuning. Procedia Computer Science. 2022;197:538-544. DOI: 10.1016/j.procs.2021.12.171
22. 22. Naman SB, Abhishek DP, Jegadeeshwaran R, Kaushal AK, Rohan SG, Atharva MK. A Bayesian optimized discriminant analysis model for condition monitoring of face milling cutter using vibration datasets. Journal of Nondestructive Evaluation. 2022;5(2):021002. DOI: 10.1115/1.4051696
23. 23. Osemwenkhae JE, Iduseri A. Efficient data-driven rule for obtaining an optimal predictive function of a discriminant analysis. Journal of the Nigerian Association of Mathematical Physics. 2011;18:373-380
24. 24. Stąpor K, Smolarczyk T, Fabian P. Heteroscedastic discriminant analysis combined with feature selection for credit scoring. Statistics in Transition New Series. 2016;17(2):265-280
25. 25. Iduseri A, Osemwenkhae JE. An efficient variable selection method for predictive discriminant analysis. Annals of Data Science. 2015;2(4):489-504
26. 26. Nguyen HV, Gopalkrishnan V. Feature extraction for outlier detection in high-dimensional spaces. Journal of Machine Learning Research. 2010;10(2):252-262
27. 27. Alrawashdeh MJ, Muhammad Sabri SR, Ismail MT. Robust linear discriminant analysis with financial ratios in special interval. Applied Mathematical Sciences. 2012;6:6021-6034
28. 28. Lim YF, Syed Yahaya SS, Idris F, Ali H, Omar Z. Robust linear discriminant models to solve financial crisis in banking sectors. In: Proceedings of the 3rd International Conference on Quantitative Sciences and Its Applications; Langkawi, Kedah; 12–14 August 2014. pp. 794–798. DOI: 10.1063/1.4903673
29. 29. Syed Yahaya SS, Lim Y, Ali H, Omar Z. Robust linear discriminant analysis. Journal of Mathematics and Statistics. 2016;12(14):312-316. DOI: 10.3844/jmssp.2016.312.316
30. 30. Osborne J, Amy O. The power of outliers (and why researchers should always check for them). Practical Assessment, Research and Evaluation. 2004;9(6):1-8
31. 31. Campbell NA. Shrunken estimators in discriminant and canonical variate analysis. Journal of the Royal Statistical Society: Series C: Applied Statistics. 1980;29(1):5-14. DOI: 10.2307/2346404
32. 32. Campbell NA. Robust procedures in multivariate analysis II: Robust canonical variate analysis. Journal of the Royal Statistical Society: Series C: Applied Statistics. 1982;31(1):1-8. DOI: 10.2307/2347068
33. 33. Gomez MJ, DeBenzo Z, Gomez C, Marcano E, Torres RH. Comparison of methods for outlier detection and their effects on the classification results for a particular data base. Analytica Chimica Acta. 1990;239:229-243
34. 34. Critchley F, Vitiello F. The influence of observations on misclassification probability estimates in linear discriminant analysis. Biometrika. 1991;78:677-690
35. 35. Sadek RF. Influence of outliers in classification analysis [thesis]. Anthens: University of Georgia; 1992
36. 36. Fung WK. On the equivalence of two diagnostic measures in discriminant analysis. Communications in Statistics - Theory and Methods. 1998;27:1923-1935. DOI: 10.1080/03610929808832199
37. 37. Riani M, Atkinson AC. A unified approach to outliers, influence, and transformations in discriminant analysis. Journal of Computational and Graphical Statistics. 2001;10(3):513-544. DOI: 10.1198/106186001317114965
38. 38. Acuña E, Rodríguez C. An empirical study of the effect of outliers on the misclassification error rate. IEEE Transactions on Knowledge and Data Engineering. 2004;17:1-21
39. 39. Mahalanobis PC. On the generalized distance in statistics. In: Proceedings of the 12th National Institute of Science; India. 1963. pp. 49-55
40. 40. Iglewicz B, Hoaglin DC. How to Detect and Handle Outliers. ASQC Basic References in Quality Control. Milwaukee, Wis: ASQC Quality Press; 1993
41. 41. Myatt GJ. Making Sense of Data: A Practical Guide to Exploratory Data Analysis and Mining. U. S. A.: A John Willey and Sons, Inc., Publication; 2007
42. 42. Lachenbruch PA. Discriminant Analysis. New York: Hafner; 1975
43. 43. Klecka WR. Discriminant Analysis. London: Sage Publications, Berverly Hills; 1980. p. 61
44. 44. Ashcraft AS. Ways to evaluate the assumption of multivariate normality. In: Paper Presented at the Annual Meetings of the Southwestern Psychological Association; New Orleans, LA. 1998
45. 45. Tabachnick BG, Fidell LS. Using Multivariate Statistics. 5th ed. USA: Pearson Education, Inc.; 2007. p. 382
46. 46. Hayes AF, Cai L. Using heteroskedasticity-consistent standard error estimators in OLS regression: An introduction and software implementation. Behavior Research Methods. 2007;39(40):709-722
47. 47. Field A. An Adventure in Statistics: The Reality Enigma. London: SAGE Publications Ltd.; 2016. pp. 315-321

Written By

Augustine Iduseri

Submitted: February 21st, 2022 Reviewed: March 17th, 2022 Published: April 27th, 2022