GAs parameters.
Open access peer-reviewed article
This Article is part of NUMERICAL ANALYSIS AND SCIENTIFIC COMPUTING Section
Version of Record (VOR)
*This version of record replaces the original advanced online publication published on 28/03/2022
Article metrics overview
614 Article Downloads
View Full Metrics
Article Type: Research Paper
Date of acceptance: February 2022
Date of publication: March 2022
DoI: 10.5772/acrt.01
copyright: ©2022 The Author(s), Licensee IntechOpen, License: CC BY 4.0
This article provides an optimisation method using a Genetic Algorithm approach to apply feature selection techniques for large data sets to improve accuracy. This is achieved through improved classification, a reduced number of features, and furthermore it aids in interpreting the model. A clinical dataset, based on heart failure, is used to illustrate the nature of the problem and to show the effectiveness of the techniques developed. Clinical datasets are sometimes characterised as having many variables. For instance, blood biochemistry data has more than 60 variables that have led to complexities in developing predictions of outcomes using machine-learning and other algorithms. Hence, techniques to make them more tractable are required. Genetic Algorithms can provide an efficient and low numerically complex method for effectively selecting features. In this paper, a way to estimate the number of required variables is presented, and a genetic algorithm is used in a “wrapper” form to select features for a case study of heart failure data. Additionally, different initial populations and termination conditions are used to arrive at a set of optimal features, and these are then compared with the features obtained using traditional methodologies. The paper provides a framework for estimating the number of variables and generations required for a suitable solution.
feature selection
feature optimisation
genetic algorithms
human reasoning
wrapper selection
Author information
The explosion of data capture in general, and specifically in the health industry, has created new challenges in terms of understanding and benefitting from the data [1]. This increase in data has led to an explosion in terms of the size and dimensional complexity of datasets, with respect to the number of distinct measurements or features. When machine-learning and data mining algorithms are applied to high-dimensional data, the performance of pattern modelling and the effectiveness of classification is reduced [2], as the computational complexity increases in terms of both processing time and the storage space required (see [3, 4] and [5]).
Reducing dimensionality can be done by using one of two primary methods, namely feature selection and feature extraction. Feature extraction creates a new smaller set of features that projects the original high-dimensional features to a new feature subset with lower dimensionality. On the other hand, feature selection is the process of creating a smaller subset by removing irrelevant and redundant features. Hence, they are both considered as effective dimension reduction approaches, having the advantage of improving learning performance, increasing computational efficiency, decreasing memory storage, and building better generalisation models [6]. A distinction between the two is that, in feature selection, a subset of the original features is created, whilst for feature extraction, a set of novel features is produced from the original features [7].
Feature selection is based on selecting a subset of features from the original dataset where [8]:
Reduction to the classification accuracy itself is minimised;
The class distribution of the values for the selected features is optimised to be close to the class distribution of the original set of all the features.
There are three recognised approaches to feature selection, namely filter, wrapper and embedded approaches [9, 10]. Firstly,
In a typical case, the feature selection problem can be solved optimally using an exhaustive search to evaluate all possible options. Typically, a heuristic method is employed to generate subsets of features for evaluation procedures, where each subset is evaluated to determine the “goodness” of the selected features. The current work focuses on the application of Genetic Algorithms (GAs) to the feature selection problem, while also addressing some of the underlying fundamental problems of Genetics Algorithms, namely (i) when do we terminate the algorithms, (ii) how to determine the number of chromosomes to be used and (iii) how much of the search space has been explored. By integrating a GA within a feature selection problem, it is possible to develop a framework within which answers to these questions can be obtained.
This paper first casts the problem of feature selection as an optimisation problem, which automatically lends itself to the application of Genetic Algorithms (section 2). Furthermore, by casting exploration of the subset set space as a graph problem, it is feasible to derive specific properties for the GAs as applied to the feature selection problem. Finally, in section 3, through the use of a Heart Failure dataset [11, 13] the approach is applied and evaluated [14, 15]. Appendix A provides a list of all of the variables for the case considered below, and Appendix B, Table 4 provides a list of the main acronyms.
This research was primarily based on an experimental approach, where feature selection was applied to a published data set to identify the most important factors. This was then compared with established practice by clinicians, to validate the computer based identification of the key factors.
Section 3 describes the theory and relevant background material on feature selection. It develops the theory and provides proofs of some of the relevant properties that enables estimates of the number of iterations required to achieve a solution. This is then applied to a large clinical data set to demonstrate the effectiveness of the techniques and approach.
All feature selection algorithms typically consist of four steps (see figure 1), namely:
a subset generation step;
evaluation of the subset;
a step to check against the stopping criteria, and once this is met, finally
Classically, a heuristic method is employed to generate subsets of features for evaluation procedures, where each subset is evaluated to determine the “goodness” of the selected features. The validation is done by carrying out different tests and comparisons with the previous subset. If a new subset has worse performance, then the previous subset is retained. This process is repeated until a stopping criterion is reached, as shown in figure 1. A feature is a variable in a dataset, and as mentioned previously, if the number of variables is large, this can lead to problems that are intractable, i.e. cannot be solved in polynomial time. One approach that can aid in finding solutions is to reduce the number of features by selecting features, to drop those features that are deemed unnecessary, or that may actually impede the predictive performance of a model.
A general feature selection problem, with
Given a set
Given the ease of data collection, a particular issue is that they often have far more variables than are strictly necessary to achieve a correct result. This is more acute in clinical datasets. The problem becomes one of reconstructing the data set with a reduced, optimal, number of features,
A feature set
the relationships between
Such a formulation lends itself to a combinatorial search for an optimal set of features (see [21–23]). An alternative approach is to consider the two problems of feature selection and classification together [24], through the use of scaling variables 𝜔
The length of 𝜔
In order to investigate the properties of (2), we need the following proposition:
Let
Then for 𝜔 ∈𝛺;
The proof is routine and follows the methodology that can be found in [21] & [24]. The approach taken by a feature search algorithm for selecting optimal feature sets is often an iterative one, where two problems are repeatedly solved, these are:
A Genetic Algorithm ensures that over successive generations, the population “evolves” toward an optimal solution based on the principle of survival of the fittest (see [3, 9, 17]). GAs allow the combining of different solutions generation after generation to extract the best solution in the quickest time [25]. Generally, the search method used by Genetic Algorithms has lower complexity than the more formal approaches mentioned earlier [26].
The individuals in the genetic space are called chromosomes. The chromosome is a collection of genes where a real value or a binary encoding can generally represent genes. The number of genes is the total number of features in the data set. Thus, the vector is a chromosome, and 𝜔
Every iteration or generation, a new set of chromosomes are evaluated through the use of a fitness function. This is typically the cost function used in FSP1. Given the population, P, there are essentially P subsets of features being evaluated at every iteration. In this approach, the chromosomes with the least performance (or fitness) are also eliminated, thus not available for the cross over or mutation operations. In this manner, good subsets are “evolved” over time [22]. The commonly used methods for selection are Roulette-wheel selection, Boltzmann selection, Tournament selection, Rank selection, and Steady-state selection. The selected subset is ready for reproduction using crossover and mutation. The idea of a crossover operator is that some combination of the best features of the best set of chromosomes would yield a better subset. On the other hand, mutation takes one chromosome and randomly changes the value of a gene. Thus, adding or deleting features in a subset. This process is repeated until there is no improvement, or as is often the case, a set number of iterations or generations is reached. This is illustrated in figure 3.
The general structure of the algorithm is as follows
Choose an initial population,
Evaluate the fitness of each chromosome
Perform Selection
Perform Crossover
Perform Mutation
Evaluate fitness,
if stopping criteria satisfied Stop; Else go to Step 2.
Steps 1 and 4 solve PSP2, steps 3a and 3b solve FSP1.
The solution to both FSP1 and FSP2, can be considered to be a set of Kuhn-Tucker points, i.e. points that satisfy the K_T conditions for FSP1 and FSP2, but never together at the same time [22].
For most optimisation techniques, there is often a requirement that the cost function is monotone. However, there is no guarantee that this is satisfied globally. However, if the cost function is not locally monotone somewhere, a solution to the problem will not exist. It is possible to estimate the number of generations required to arrive at an optimal set of features of the requirement of monotonicity is enforced in the GAs. Thus, if it were made explicit that from one generation to generation the best subset is an improvement on the previous best, the sequence of cost function values would be monotone. One of the consequences of the monotone requirement is that the initial population of chromosomes and the final set of chromosomes are disjointed.
This condition can be satisfied for the initial population only; for subsequent generations, this may be not possible, for the same operation on two different pairs of chromosomes may result in the same set of chromosomes as children. However, given this restriction, it is interesting to use this to gain an insight into the operation of GAs.
The fitness function
The second definition is interesting and required for further analysis. It essentially focuses on the first or leading chromosome of any generation, since for each generation, the chromosomes are ordered (definition 2). This definition also suggests that with every successive generation, the fitness of the leading chromosome of the i’th generation is better than the previous generation.
Two generations
This definition is required to keep in line with the requirements of analysis based on graph searches.
For a given fixed M, where M is the number of chromosomes, and a genetically monotone fitness function
The proof follows from the definitions. The conditions (assumptions) here are often difficult to satisfy. Getting an ordered set for a generation is already present in the algorithm, and so is the presence of a genetically monotone function in the form of a fitness function. However, the genetically disjoint requirement is difficult to satisfy, and there is no guarantee it will be fulfilled at all during a given run of the algorithm. Thus, often the terminating criteria of a genetic algorithm are the number of generations. In most random search algorithms, the function
Two generations
If the sequence of generation sets
The proposition is both intuitive and follows the fact that every generation set
Louis and Rawlins [27] have shown that the crossover operator has no role to play in the convergence of the algorithms, as the average Hamming distance of the population in every generation is maintained. Their analysis indicated that the selection criteria are on which dictates the convergence rate of the genetic algorithm. Their analysis shows that the traditional crossover operation, like 1-point, two-point l-point, uniform crossover, does not change the average Hamming distance of the chromosomes within a generation, and thus, the difference of the distance from generation to generation is maintained. They suggested that selection criteria are the main parameter for determining convergence. However, Greenhalgh and Marshall [27], showed that irrespective of the coding (binary or otherwise), the key controller of convergence (or the ability of the genetic algorithm to have visited all the possible points available) is given by the mutation probability. They indicated, following on from Aytug
For a given 0 ≤𝛾 <1, and 𝜇 ≤0.5 (where 𝛾 is the
The proof is intuitive. Even though crossover may lead to all the subsets of a particular generation being identical, the presence of mutation changes this. However, with both present, it is always the case [27]. These two conditions imply that every generation has a set of chromosomes which have not been evaluated in the previous generation, that the crossover operations do not generate the same solutions within a generation, and we would need at most generations to arrive at a solution. Within the definitions of the GAs, it is not possible to guarantee both of these can be satisfied. If they are, it is feasible to determine the number of generations required to arrive at a solution given the initial population size.
In order to assess the complexity of the Genetic Algorithms, we need to consider the following:
The initial population size, and
The number of generations required to satisfy the performance criteria.
GAs can often reach a state of stagnation, a possible reason for this that the above two points act as constraints, and it becomes feasible that the same population is generated at every generation. In order to get a deeper insight into the process, consider the following:
Since
The dataset under consideration is a real-life heart failure dataset [29]. In this dataset, there are 60 features for 1944 patient records (See Table 3 in Appendix A). The classification here is simply “dead” or “alive”. The data sets were imputed by different methods such as Concept Most Common Imputation (CMCI) and Support Vector Machine (SVM). The different datasets obtained were tested in order to select a good set for feature selection [29]. The selection of a dataset was based on accuracy, sensitivity and specificity. The dataset where the imputation was based on SVM was thus selected. The experiments were designed using Weka (version 3.8.1-199-2016), and validation was done using a 10-fold validation. It can be seen from Alabed
The GAs was implemented as described earlier and can be seen in [31]. The parameters for the GAs are shown in Table 1.
GAs Parameter | Value |
---|---|
Number of features | 60 |
Population size | 50, 75, 100 |
Genome length | 60 |
Population type | Bit strings |
Fitness function | kNN-based classification error |
Number of generations | 130 |
Crossover | Arithmetic crossover |
Mutation | Uniform mutation |
Selection scheme | Roulette wheel |
Elite count | 2 |
In most evolutionary methods, and Genetic Algorithms in particular, the population size—i.e. the number of chromosomes—is of particular interest. This often influences the quality of the solution and computing time and is also dependent on the available memory [32–34]. There have been a number of studies on appropriate population sizes, see [35, 36], with many suggesting that a “small” population size would result in poor solutions [37–39], while a “large” population size would result in greater complexity in finding a solution [40, 41]. Indeed, this is apparent if the problem is viewed as a search through the 2
The chromosomes are evaluated using a fitness function based on Oluleye’s fitness function [31]. Here the function is minimising the error while at the same time reducing the number of features (see equation (2)). The fitness function which evaluates the chromosomes is based on the kNN-function [31]. The chromosomes are ranked based on the kNN based fitness. In this case, chromosomes with the lowest fitness have a better chance of surviving.
A roulette wheel selection approach was used for selecting the chromosomes for the crossover operations where each individual is assigned as a “slice” of the wheel in proportion to the fitness value of the individual. Therefore, the fitter an individual is, the larger the slice of the wheel. The wheel is simulated by a normalisation of fitness values of the population of individuals as discussed in FSP2. The selection process is shown below
Determine the sector angles for each chromosome
Generate a random number
Select the
And repeat.
It should be noted that the chromosomes are a bit string, clearly for such representation we could use a single point or a
This paper focuses on two aspects of feature selection, (a) the total number of subsets (population size) to be evaluated at each iteration, and (b) the number of features in each subset.
In order to assess the appropriate population size, different population sizes were tested. The results are presented in figures 4 and 5. The best results were obtained where the population is 100 and
In order to investigate if larger population sizes improve the performance dramatically, the population was increased in steps to 400, 600, and 800. Figure 6 shows the accuracy for different generations, and the optimal accuracy is 86.3% which is less than 87.7% that was achieved using a population of 100. The results also indicate that an increase in population size does not change the results significantly to warrant the increase in complexity to achieve these results.
Another test was carried out with 3 population sizes of 50, 75, 100 (see figure 4). And running the Genetic Algorithms for a different number of generations, and different values of
At the same time, it can be noticed that the value of the best-fit chromosome in each generation oscillates after the initial improvements. This clearly indicates the algorithm has arrived at a set of features which are good, but it starts to oscillate, and an increase in the number of generations does not change either the value of fitness with the best chromosome, nor does the mean value change. Indeed, the number of generations was increased to 130 and the results are similar. The results shown in figure 6 show that the performance has not significantly improved but confirmed the earlier analysis of the possible number of generations needed and in [15]. Thus, for this dataset approximately 50 generations are needed for convergence of the best-fit chromosome.
This is further illustrated in figures 7 and 8. Here the performance of the selected features in classification is shown for the tests runs shown in figure 6. This further confirms, what was shown in figures 4 and 5.
The question then remains one of determining whether an optimal solution has been obtained. Figure 6, illustrates two aspects of the feature selection problem. The first is that independent of the number of generations or population size the optimal value of the fitness function is reached. However, what is the chromosome which is the best is an answer which is difficult to arrive at independently. The oscillatory behaviour of the fitness of the best chromosome in all cases shows that there is no one chromosome which is the best and that it is possible there is more than one chromosome which is good. This is where, an expert in the application has to be consulted in order to pick an appropriate sub-set of features for the application.
It has been suggested [43, 44] that the performance can be further improved by running a second level GAs. Thus, using the 27 features, selected by the genetic algorithm, the number of selected features were reduced from 27 to 14 features as shown in figure 7; however, the performance of GAs has slightly reduced by 1% where the accuracy was 87.8% to 86.77%. Figure 7 shows the results obtained for these tests.
In order to assess this, further tests were carried out. Two further feature selection algorithms were used. The Symmetrical Uncertainty [43, 44] and Correlation-Based-Selector (CFS) [45]. These results were compared to those obtained by Al-Khaldy [46]. Al Khaldy investigated several feature selection methods, including wrapper and filters methods, and further used a representative set of classification methods for evaluating the features selected. These methods enabled the identification of a core set of features, from the same dataset. Table 2 represents the common features between this work and Al Khaldy [46] work. It could be noticed that there is a common feature between both of them.
Common feature between GAs, CFS, systematical uncertainty | Al Khaldy |
---|---|
Urea (mmol/L) | 4 |
Uric Acid (mmol/L) | 4 |
MCV (fL) | 5 |
Iron (umol/L) | 6 |
Ferritin (ug/L) | 4 |
CRP (mg/L) | 3 |
CT-proET1 | 7 |
LVEDD (HgtIndexed) | 6 |
E | 3 |
Height (Exam)(m) | 2 |
FVC (L) | 6 |
It is said that there is no so-called “best feature selection method” [47] since the performance of feature selection relies on the performance of the learning method. The number of features selected by the GAs was 27 features using GAs and the accuracy was 87.8%. Of these features, three variables are the ones used by clinicians in diagnosing heart failure [46], namely Urea, Uric acid and Creatinine.
Feature selection algorithms based on analytical methods are numerically complex. On the other hand, Genetic algorithms, are less complex and can be used to arrive at a solution to complex optimisation problems. Casting the feature selection problem as an optimisation problem, and using a Genetic Algorithm for its solution provides us with a solution which is useful. However, the key questions of population sizes, the number of generations etc. remain to be answered. This paper, provides partial answers to these questions. Through the analysis of a complex dataset, it has illustrated the rules of thumb. What is interesting, is that the required number of generations does not change much given the different population sizes. What changes is the chromosome with the best fitness, this is natural for most feature selection problems. It shows that although it is possible to select a good sub-set, this subset is not unique and that there will always be a small variation in them. The compromise here is based around (a) expert advice, (b) the extra computational effort and (c) the marginal improvement in performance. The trade-offs are often then dependent on the nature of the application, if the margins are very fine, then a small marginal improvement in performance is well worth it.
Feature Number | Feature Name (measurement) | Feature Number | Feature Name |
---|---|---|---|
1 | Age | 31 | MR-proADM |
Mid regional pro-adrenomedullin | |||
2 | Sodium (mmol/L) | 32 | CT-proET1 |
C-terminal proendothelin-1 | |||
3 | Potassium (mmol/L) | 33 | CT-proAVP |
Copeptin | |||
4 | Chloride (mmol/L) | 34 | PCT |
procalcitonin | |||
5 | Bicarbonate (mmol/L) | 35 | Rate (ECG) (bpm) |
6 | Urea (mmol/L) | 36 | QRS Width (msec) |
QRS complex width | |||
7 | Creatinine (umol/L) | 37 | QT |
QT Interval | |||
8 | Calcium (mmol/L) | 38 | LVEDD(cm) |
left ventricular end-diastolic diameter | |||
9 | Adj Calcium (mmol/L) | 39 | LVEDD (HgtIndexed) |
10 | Phosphate (mmol/L) | 40 | BSA (m2) |
Body Surface Area | |||
11 | Bilirubin (umol/L) | 41 | Left Atrium (cm) |
12 | Alkaline Phophatase (iu/L) | 42 | LeftAtrium (BSAIndexed) |
13 | ALT (iu/L) | 43 | Left Atrium (HgtIndexed) |
Alanine transaminase | |||
14 | Total Protein (g/L) | 44 | Aortic Velocity (m/s) |
15 | Albumin (g/L) | 45 | E |
Examination | |||
16 | Uric Acid (mmol/L) | 46 | Height (Exam) (m) |
17 | Glucose (mmol/L) | 47 | Weight (Exam) (kg) |
18 | Cholesterol (mmol/L) | 48 | BMI |
Body Mass Index | |||
19 | Triglycerides (mmol/L) | 49 | Pulse (Exam) (bpm) |
20 | Haemoglobin (g/dL) | 50 | Systolic BP (mmHg) |
21 | White Cell Count (109/L) | 51 | Diastolic BP (mmHg) |
22 | Platelets (109/L) | 52 | Pulse BP (mmHg) |
23 | MCV (fL) | 53 | Pulse BP (mmHg) |
mean corpuscular volume | |||
24 | Hct (fraction) | 54 | FEV1 (l) |
hematocrit | Forced expiratory volume | ||
25 | Iron (umol/L) | 55 | FEV1 Predicted (l) |
26 | Vitamin B12 (ng/L) | 56 | FEV1 |
27 | Ferritin (ug/L) | 57 | FVC (l) |
Forced vital capacity | |||
28 | CRP (mg/L) | 58 | FVCPredicted (l) |
C-Reactive Protein | |||
29 | TSH (mU/L) | 59 | FVC |
Thyroid-Stimulating Hormone | |||
30 | MR-proANP | 60 | PEFR (l) |
Mid-regional pro atrial natriuretic peptide | peak expiratory flow rate |
Acronym | Meaning |
---|---|
CFS | Correlation-based-selector |
Cost function | |
F | Set of features |
FSP | Feature search problem |
GA | Genetic algorithms |
Fitness function | |
NP | Non-deterministic polynomial-time |
The data and code used in the work is available on request from the authors.
This work was assisted with support from the institutional research support fund. There is no conflict of interest in the work itself or the analysis.
Written by
Article Type: Research Paper
Date of acceptance: February 2022
Date of publication: March 2022
DOI: 10.5772/acrt.01
Copyright: The Author(s), Licensee IntechOpen, License: CC BY 4.0
© The Author(s) 2022. Licensee IntechOpen. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.
Impact of this article
614
Downloads
607
Views
Join us today!
Submit your Article