Summary of the 10-fold cross-validation calibration process - The Δ settings giving best accuracy rate concerning training blocs.

## Abstract

In data mining, classification is the process of assigning one amongst previously known classes to a new observation. Mathematical algorithms are intensively used for classification. In these, a generalization is inferred from the data, so as to classify new cases, or individuals. The algorithm may misclassify an individual if the inference machine is not able to sufficiently discriminate it. Therefore, it is necessary to go further into the analysis of the information provided by the individual, until it can be sufficiently identified as belonging to a class. This chapter developed this idea for the improvement of a certain class of classifiers, using medical data sets to validate the new algorithm proposed here: The Multivariate-Stepwise Gaussian Classifier (MSGC). The results showed that MSGC is at least as competitive as the Gaussian Maximum Likelihood Classifier. MSGC attained the greatest accuracy rate in two of the data sets, and obtained identical results in the two remaining data sets. Concerning medical applications, once a classification method has been successfully validated considering a particular scope of data, the recommendable would be its use for the best diagnosis. Meanwhile, other algorithms could be tested until they proved to be effective enough to be put into practice.

### Keywords

- data mining
- classification
- algorithm
- medical diagnosis and prediction

## 1. Introduction

Mankind has performed classification since remote years, as a part of daily life and survival. With human evolution, our motivation to classify has become more complex and wide, comprehending classification in a wide variety of fields like engineering, management, banking, marketing, psychology and medical diagnosis and prediction.

In the context of data mining, classification can be understood as the process of assigning to a new observation (sample) one among a set of previously known classes. In fact, the rapid increase in computational processing capacity, coupled with the low cost of storage, has contributed to the greater use of supervised or nonsupervised mathematical algorithms for computational classification. In these, in the learning phase, certain kind of generalization is inferred from the data, so that new cases, or individuals, can be classified by the inference machine.

It should be mentioned that in the medical field, there are several examples of researches applying successfully computational classification as an aid to the medical diagnosis. It can be referred, for instance, the research in [1], which apply a multivariate statistical analysis to explore the Dermatology Data Set (available in the UCI data repository, [2]) and construct a classifier, based only on the 12 clinical attributes, as an aid for the first medical consultation and diagnosis of erythemato-squamous dermatological diseases. The research results provide enhanced knowledge that can help to enrich dermatological diagnoses made by doctors. Also, the classifier developed using the linear discriminant analysis (LDA) obtains a high mean accuracy rate in relation to the six diseases (83.73% correct classifications). This rate means that patients have a good chance of being treated adequately, while biopsies may also be solicited to confirm diagnosis. A classification algorithm developed in [3] was tested over the Dermatology Data Set. This study reported mean accuracy rates (96.2 and 99.2% for a modified version of the algorithm). Note that it utilized all 34 features in the data set (clinical + histological attributes), which can certainly inform further the classifier, since it works knowing the biopsy results. In [4], an analysis is outlined attempting to classify the Dermatology Data Set by decision trees and employing all 34 features in the data set. The authors reported a 5.5 +/−1.46 error rate. A modified decision tree based on a genetic algorithm for attribute selection achieved a 4.2 +/−0.96 error rate. In [5], a classification algorithm is demonstrated, based on genetic algorithms that discovered comprehensible IF-THEN rules. The algorithm was submitted to all 34 features in the Dermatology Data Set and the result was 95% accuracy rate for classifications. By visiting the UCI data repository website, many other studies focusing several medical data sets are listed and can be accessed by the reader.

However, occasionally such generalization may not correctly classify an individual if the inference machine is not able to sufficiently discriminate it among the possible classes. Therefore, it is necessary to go more deeply into the analysis of the information provided by the individual being classified, until it can be sufficiently identified as belonging to a class.

This pursuing, moreover, may be analogous to the efforts made by physicians while performing their crucial diagnostics. In fact, medical theory and practice well acknowledge a basic foundation in medicine, that no two individuals are alike, either in health or illness. For this reason, more and more medical guidelines pursue this maxim, the individualities being considered in the midst of large numbers, examples being the programs of family physicians, homeopathy, psychoanalysis, encouragement of anamnesis rather than light and machine consultations and recent considerations involving slow medicine. Not to lengthen this subject too much, reference is made to the works ([6] p. 5–6, [7] p. 11–12], [8], and [9] p. 3). It could still be possible refer to a series of other initiatives that denote the search for health in its individual fullness, but what is important is that common sense says that such a foundation should also inform statistical methods and artificial intelligence applied to the classification of individuals.

In this context, this chapter seeks to develop this idea for the improvement of a certain class of well-known classifiers; for this purpose, it uses real medical data sets for the validation of the algorithm proposed here.

## 2. Classifiers based on the assumption that the form for the underlying density function is known

In parametric classification techniques, we learn from data under the assumption that the form for the underlying density function is known. The most common procedure is to consider the normal distribution, as is the case of Gaussian Maximum Likelihood Classifier (GMLC). Suppose there are * c*) having the highest likelihood among the classes. GMLC assumes that the data follows the multivariate normal density function:

In this equation,

The depicted above is mostly the case of some well-known classifiers like linear discriminant analysis (LDA), quadratic discriminant analysis (QDA) and regularized discriminant analysis (RDA), which are trustworthy classifiers based on GMLC computations that reach good results in several data situations. A basic difference between these three classifiers is that in the case of LDA, it is assumed that each class _{,} which provides an optimal mix of sample covariance matrix, global covariance matrix and the identity matrix, for instance, if (

However, for sure, these aforementioned methods have their shortcomings. Barreto [11] lists the more commonly identified shortcomings in the field literature, such as the fact that the mean and covariance estimates are optimal only asymptotically and can produce lower classification accuracy when the training sample is small, actually, unless many more than

Beyond these problems, this chapter wants to discuss that these methods maximize * c*. Therefore, by the inspection of Eq. (1), it is easy to see that this mathematical density will have problems in classifying a sample that presents close values for its distances of Mahalanobis considered in relation to the means of the involved classes, a particular situation that induces misclassification errors.

The solution to fix this is to benefit both from the training set and from information proportioned by the new sample itself to be classified. Doing this, the classifier can take into account new information that will improve the overall generalization proportioned by these traditional methods. Therefore, the proposal in this chapter is to make the classification algorithm able to identify and provide treatment to the sample cases presenting close values for its Mahalanobis distances until it can reveal more clearly its actual class for the Gaussian classifier.

## 3. The Multivariate-Stepwise Gaussian Classifier: A new classification algorithm

What is proposed is a new classification method: The Multivariate-Stepwise Gaussian Classifier (MSGC). MSGC theoretically works on the basis of the already depicted GMLC method. Its contribution is to treat individually a sample to be classified if this sample presents close values for its Mahalanobis distances with respect to the class means involved in the classification, so that the discrimination made by the classifier is, in thesis, inconclusive. In this case, the algorithm will work employing dimensionality reduction by disregarding, one by one, in a stepwise process, the

The key question is: what would be the best numerical dissimilarity between the distances of Mahalanobis obtained from a sample and the class means so that its classification is optimal? It can be anticipated that this response depends on the database to be focused, which will require the previous calibration of the method proposed here.

### 3.1. Description of the algorithm

Given _{.}’ The Multivariate-Stepwise Gaussian Classifier (MSGC) algorithm pseudocode is (considering

(0) begin algorithm (initialize variables and counters,

(1) while

(1.1)

(1.2) while

(1.2.1) calculate the mahalanobis distances

(1.2.2) calculate

(1.2.3) if

end if;

(1.2.4) if

else.

end do (referring to step 1.2);

(1.3) if

(1.3.1) if

assign to sample

end if;

(1.3.2) if

assign to sample

end if;

else.

(1.3.3) if

assign to sample

end if;

(1.3.4) if

assign to sample

end if;

(1.4)

end do (referring to step 1);

end of the algorithm.

Note that for simplicity of exposition, the above pseudocode was written for

Finally in this section, it should be added that recent literature involving classifiers which are in some way based on the GMLC method makes no mention of an algorithm that works like MSGC. See [15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28].

The Multivariate-Stepwise Gaussian Classifier (MSGC) algorithm was implemented by means of The R Program for Statistical Computing [29] (version 2.14.0).

## 4. Comparing MSGC with traditional GMLC method

### 4.1. Methodology

Some real data sets from the UCI repository [2] (available from: http://archive.ics.uci.edu/ml/datasets/) are used to compare MSGC to GMLC method.

A 10-fold cross validation is widely used in the related literature like [13, 30] to present a more stable estimate of the performance of a classification method. Then, it was used here.

So to calibrate the MSGC algorithm and define the best value for

For comparison GMLC was also implemented in the R program and applied to exactly the same blocs generated by the depicted 10-fold cross-validation process.

### 4.2. Presentation of data sets and comparison of classification results

Pima Indians Diabetes Data Set comprises 768 entries (8 medical and demographical attributes and a class variable), 550 of the entries classified as 0 and 268 classified as category 1. Attribute information: (1) number of times pregnant, (2) plasma glucose concentration a 2 hours in an oral glucose tolerance test, (3) diastolic blood pressure (mm Hg), (4) triceps skin fold thickness (mm), (5) 2-hour serum insulin (mu U/ml), (6) body mass index (weight in kg/(height in m)^2), (7) diabetes pedigree function, (8) age (years) and (9) class variable (0 or 1). Ten mutually exclusive folds were randomly sampled from Pima Indians Diabetes Data Set (9 validation folds including 77 entries and the tenth fold comprising 75). The key importance involved in the classification of Pima Indians Diabetes Data Set lies in the possibility of diagnosing diabetes disease, considering the numerical attributes, since class 1 is interpreted as tested positive for diabetes.

Breast Cancer Winsconsin (Original) Data Set comprises 699 entries (9 attributes and a class variable), 458 of them classified as category 2 “benign” and 241 classified as category 4 “malignant” (recoded as 0 and 1, respectively). Attribute information: (1) sample code number (id number), (2) clump thickness: 1–10, (3) uniformity of cell size: 1–10, (4) uniformity of cell shape: 1–10, (5) marginal adhesion: 1–10, (6) single epithelial cell size: 1–10, (7) bare nuclei: 1–10, (8) bland chromatin: 1–10, (9) normal nucleoli: 1–10, (10) mitoses: 1–10 and (11) class: (2 for benign and 4 for malignant). Ten mutually exclusive folds were randomly sampled from the Breast Cancer Winsconsin (Original) Data Set (9 validation folds including 69 entries and the tenth fold comprising 62). Sixteen original entries with missing data were removed. As for Breast Cancer Wisconsin (Original) Data Set, this data set can be used to predict the severity (benign or malignant) of a clump of cells in relation to the nine numerical attributes.

Haberman’s Survival Data Set comprises 306 entries (three attributes and a class variable), 81 of them classified as category 2 and the remaining 225 classified as category 1 (recoded as 1 and 0, respectively). Attribute information: (1) age of patient at time of operation (numerical), (2) patient’s year of operation (year-1900, numerical), (3) number of positive axillary nodes detected (numerical) and (4) survival status (class attribute), 1 = the patient survived 5 years or longer, 2 = the patient died within 5 years. Ten mutually exclusive folds were randomly sampled from the Haberman’s Survival Data Set (9 validation folds including 31 entries and the tenth fold comprising 27). The main interest in the classification task involving the Haberman’s Survival Data Set would be the attempt to predict the life expectancy of patients undergoing breast cancer surgery, taking into account their age at the time of surgery and the number of axillary nodes removed.

Mammographic Mass Data Set presents discrimination of benign and malignant mammographic masses based on BI-RADS attributes and the patient’s age. It comprises 961 entries of data (five attributes and a class variable). The class associated with each record is the field ‘severity,’ 0 or 1. Attribute information: (1) BI-RADS assessment: 1–5 (ordinal), (2) age: patient’s age in years (integer), (3) shape: mass shape: round = 1, oval = 2, lobular = 3, irregular = 4 (nominal), (4) Margin: mass margin: circumscribed = 1, microlobulated = 2, obscured = 3, ill-defined = 4, spiculated = 5 (nominal), (5) Density: mass density: high = 1, iso = 2, low = 3, fat-containing = 4 (ordinal) and (6) severity: benign = 0 or malignant = 1 (binominal). A total of 131 original entries with missing data were removed. Ten mutually exclusive folds were randomly sampled from the Mammographic Mass Data Set (all of them with 83 entries). In relation to the Mammographic Mass Data Set, [2] informs that “* Mammography is the most effective method for breast cancer screening available today. However, the low positive predictive value of breast biopsy resulting from mammogram interpretation leads to approximately 70% unnecessary biopsies with benign outcomes. (…) This data set can be used to predict the severity (benign or malignant) of a mammographic mass lesion from BI-RADS attributes and the patient’s age*.”

To illustrate the 10-fold cross-validation process for MSGC calibration, Table 1 summarizes the values for

DATA | Bloc 1 | Bloc 2 | Bloc 3 | Bloc 4 | Bloc 5 | Bloc 6 | Bloc 7 | Bloc 8 | Bloc 9 | Bloc 10 |
---|---|---|---|---|---|---|---|---|---|---|

PI | 0.6 | 0.6 | 0.6 | 0.6 | 0.6 | 0.6 | 0.6 | 0.6 | 0.6 | 0.1 |

BR | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |

HB | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |

MA | 0.4 | 0.4 | 0.4 | 0.4 | 0.4 | 0.4 | 0.4 | 0.4 | 0.4 | 0.4 |

From Table 1, it is possible to see that best values for

Table 2 shows synoptically the accuracy rate mean and standard error for all data sets and methods (the best results for each data sets are highlighted in bold). Both methods were proficient in classifying data and obtained relatively similar results.

DATA | MSGC | GMLC |
---|---|---|

PIMA | 73.41 (2.25) | |

BREAST | 94.92 (0.75) | 94.92 (0.75) |

HABERMAN’S | 75.10 (2.42) | 75.10 (2.42) |

MAMMOGRAPHIC | 80.00 (1.11) |

From Table 2,we can see that MSGC attained the greatest accuracy rate in two out of four data sets (PIMA and MAMMOGRAPHIC). For HABERMAN’S and BREAST, both methods achieved identical results since for these data sets MSGC was set with

Note that accuracy rate was chosen as the criterion for comparison, but in medicine, sometimes the physician needs to know other criteria like sensitivity, specificity or precision; in this case, the data analyst should take care to also calculate them based on the algorithm results.

We also have to remark the positive aspect that these results for MSGC algorithm are transcendent. Since GMLC is the basis on which other traditional classification methods (namely RDA, QDA and LDA) are based, an improvement made in GMLC, such as this obtained through the MSGC method, will probably imply improvements in performance also for RDA, QDA and LDA. Future research shall prove this.

## 5. Conclusion

A new classification algorithm is presented in this chapter: The Multivariate-Stepwise Gaussian Classifier (MSGC).

MSGC theoretically works on the basis of the Gaussian Maximum Likelihood Classifier (GMLC) method. Its contribution is to treat individually a sample to be classified if this sample presents close values for its Mahalanobis distances with respect to the class means involved in the classification, so that the discrimination made by the classifier is, in thesis, inconclusive. In this case, MSGC will work employing dimensionality reduction by disregarding, one by one, in a stepwise process, the

For better performance, MSGC may be previously calibrated by means of a training set. A 10-fold cross-validation process was used to calibrate the algorithm.

MSGC was applied for data classification and its performance was compared with the traditional GMLC method considering four real medical data sets available in the UCI data repository. These data represent a range of different types of data dependence structure and dimensionality. The results showed that the performance of the MSGC algorithm is at least as competitive as GMLC. MSGC attained the greatest accuracy rate in two of the data sets (PIMA and MAMMOGRAPHIC). For HABERMAN’S and BREAST data sets, both methods achieved identical results. It was concluded that MSGC can be used as an effective classification tool in a wide range of data sets.

The presented results for the MSGC algorithm are transcendent. Since GMLC is the basis on which other traditional classification methods (namely RDA, QDA and LDA) are based, an improvement made in GMLC, such as this obtained through the MSGC algorithm, will probably imply improvements in performance also for RDA, QDA and LDA.

After reaching the conclusions, an additional discussion arises. With the emergence of the big data, as a robust successor to data mining emerged from the exponential development of computers and storage media since the 1990s, it has been a tendency to think of the intensive use of multiple algorithms simultaneously, in supervised or nonsupervised approaches, to analyze data and discover patterns. This certainly makes sense, as it has already been mentioned in this chapter that there is no one classification method or algorithm better than another. Beyond to a greater robustness or scalability of some methods over others, what concrete exists is a dependence of the results against the target database.

Therefore, in this context, and considering a matter as important as the medical clinic, once a classification method has been tested and successfully validated considering a particular scope of data, the most recommended would be its use for the best diagnosis. Meanwhile, if possible, already known or new algorithms could be tested for various diseases and symptoms data until they proved to be robust and effective enough to be put into medical practice.

Finally, it is important to remark that mathematical classifier serves as an aid to the crucial medical diagnosis made by the physician.

## Acknowledgments

The author would like to thank the UCI Machine Learning Repository and the data donors for putting real data sets at the disposal of the scientific community and would also like to thank The R Foundation for Statistical Computing and its contributors for developing and making the R program available to the public. The Breast Cancer Wisconsin (original) Data Set was obtained from the University of Wisconsin Hospitals, Madison, from Dr. William H. Wolberg, and the author would like to thank him.

## Nomenclature

GMLC | Gaussian Maximum Likelihood Classifier |

LDA | linear discriminant analysis |

MSGC | Multivariate-Stepwise Gaussian Classifier |

QDA | quadratic discriminant analysis |

RDA | regularized discriminant analysis |