State of the art of feature selection criteria: the criteria that work with the FS criteria evaluation framework used in this study are underlined.
Hyperspectral imagery consists of hundreds of contiguous spectral bands. However, most of them are redundant. Thus a subset of well-chosen bands is generally sufficient for a specific problem, enabling to design adapted superspectral sensors dedicated to specific land cover classification. Related both to feature selection and extraction, spectral optimization identifies the most relevant band subset for specific applications, involving a band subset relevance score as well as a method to optimize it. This study first focuses on the choice of such relevance score. Several criteria are compared through both quantitative and qualitative analyses. To have a fair comparison, all tested criteria are compared to classic hyperspectral data sets using the same optimization heuristics: an incremental one to assess the impact of the number of selected bands and a stochastic one to obtain several possible good band subsets and to derive band importance measures out of intermediate good band subsets. Last, a specific approach is proposed to cope with the optimization of bandwidth. It consists in building a hierarchy of groups of adjacent bands, according to a score to decide which adjacent bands must be merged, before band selection is performed at the different levels of this hierarchy.
- band selection
- spectral optimization
- land cover
High-dimensional remote sensing imagery, such as hyperspectral (HS) imagery, generates huge data volumes, consisting of hundreds of contiguous spectral bands. Several difficulties are caused by this high dimensionality. First, the Hughes phenomenon  can occur when classifying such data, even though modern classifiers such as support vector machines (SVM) and random forests (RF) are less sensitive to it [2, 3] except when very few training data are available . Second, important computing times are required to process such high-dimensional data. Third, storing data requires huge volumes. Last, displaying high-dimensional imagery can be necessary, while human vision is limited to three colours [5, 6].
Hyperspectral data consist of hundreds of contiguous spectral bands, but most of these adjacent bands are highly correlated to each other. Thus a subset of well-chosen bands is generally sufficient for a specific problem. This enables to design adapted superspectral sensors dedicated to such specific land cover classification. Spectral optimization (SO) or optimal band extraction (BE) consists in identifying the most relevant spectral band subsets for such specific applications. Spectral optimization is a specific dimensionality reduction (DR). DR aims at reducing data volume minimizing the loss of useful information and especially of class separability. Dimensionality reduction techniques can be separated into feature extraction (FE) and feature selection (FS) categories.
FE consists in reformulating and summing up original information. Principal component analysis (PCA), minimum noise fraction (MNF), independent component analysis (ICA) and linear discriminant analysis (LDA) are examples of state-of-the-art feature extraction techniques. On the opposite, FS selects the most relevant features for a problem. When applied to HS data, it is named band selection (BS) and compared to FE; it enables to keep the physical meaning of the selected bands. For instance, in spectroscopy, FS has sometimes been performed by specialists identifying specific absorption bands or spectrum behaviour corresponding to a material, and this knowledge has then been used in expert systems (e.g.  for specific minerals,  for asbestos,  for asphalt or  for urban materials). At the end, SO is at the interplay between FS and FE as it aims at optimizing both band positions (FS) along the spectrum and width (FE).
This study aims at defining a SO strategy to design superspectral sensors dedicated to specific land cover classification problems. SO and FS are optimization problems involving both a metric (that is to say a score measuring the relevance of band subsets) to optimize and an optimization strategy. This study first focuses on the choice of a FS relevance score suitable for generic optimization heuristics. Both classification performance and selection stability will be considered. As an intermediate result, band importance profiles are considered providing hints about the relevance of the different parts of the spectrum. Once the FS criterion chosen, this chapter copes with the optimization of bandwidth, applying FS within a hierarchy of groups of adjacent bands.
2. FS: requirements and state of the art
In the state of the art, FS is often a first step in a specific classification workflow, while the context of this work is the design of superspectral sensors dedicated to specific land cover classification problems. Thus the selected band subset must be as efficient as possible for most classifiers and not only for the used FS criteria. Thus, their ability to discriminate between classes using selected feature subsets (that is to say their classification performance) independently from any classifier has to be considered to assess the FS criteria quality. Furthermore, the stability of the proposed solutions has also to be considered. Last but not least, in this sensor design context, constraints about the maximum number of bands to select exist. To sum it up, a good FS criterion for sensor design has to be parsimonious, making it possible to select stable band subsets discriminant for most classifiers. Thus, for a fair analysis, FS criteria must be compared for a same selected band subset size, and results must be evaluated according to different classifiers. Besides, computing time was not considered as an important criterion in this specific context of sensor design, where FS is not a preprocessing in a classification workflow.
Thus, this study focuses on the comparison of several FS criteria (presented in Section 2.1) for supervised classification problems (that is to say when classes and their ground truth are taken into account). To have a fair comparison, all these criteria will be optimized using the same generic optimization algorithms. It was here decided to use such generic optimization heuristics, in the context of sensor design, since such methods enable to easily control the number of bands to select and to add additional constraint within the band extraction process as in the second part of the study. The use of generic optimization methods necessarily excludes of the comparison feature ranking criteria (such as ReliefF [11, 12]) and FS methods where the score and the optimization method are strongly related to, for instance, SVM-RFE . All criteria will be tested on several classic hyperspectral data sets.
2.1 FS: state of the art
Even though hybrid approaches involving several criteria exist [14, 15], FS methods and criteria are often differentiated between ‘filter’ (independent from any classifier), ‘wrapper’ (related to the classification performance of a classifier) and ‘embedded’ (related to the quality of classification models estimated by a classifier, but not directly to classification accuracy). It is also possible to distinguish supervised and unsupervised ones, especially for filters, that is to say whether a notion of classes is taken into account or not. All approaches mentioned below are summed up in Tables 1 and 2. Nevertheless, it must be kept in mind that hybrid approaches involving several criteria belonging to these different FS criteria categories often exist, as, for instance, in  or , where features are selected based on a wrapper method, respectively, guided or associated with filter criteria (mutual information between selected bands and between the ground truth).
Filter methods compute relevance scores independently from any classifier. Some filter methods are ranking approaches: features are ranked according to an individual score of importance. Such individual feature scores can be supervised or unsupervised. For instance, the well-known ReliefF score [11, 12] or scores measuring the correlation between features and ground truth  are supervised ones. However, such individual feature importance measures do not take into account the correlations between selected features. Thus, a feature subset composed of the best features according to such measures is not necessarily an optimal solution, in the sense that it is not parsimonious.
Other ranking methods are unsupervised: they use importance measures calculated from a feature extraction technique. For instance,  ranks bands according to a score of importance calculated from PCA decomposition. Correlated bands are then removed according to a divergence measure. Du et al. and Hasanlou et al. [17, 19] have a similar approach using ICA instead of PCA. Other unsupervised approaches also use results of a PCA selecting the most similar features to the first PCA [48, 49].
Other filter approaches associate a score to feature subsets. In unsupervised case,  also performs a constrained energy minimization to select a set of bands having minimum correlation between each other. In supervised cases, separability measures such as Bhattacharyya or Jeffries-Matusita (JM) distances can be used in order to identify the best feature subsets for separating classes [25, 26, 27, 50]. Other separability measures based on the minimum estimated abundance covariance (related to the ability of the band subset to correctly unmix several sources) have also been used as in .
High-order statistics from information theory such as divergence, entropy and mutual information can also be used to select the best feature sets achieving the minimum redundancy and the maximum relevance, either in unsupervised situations as in [6, 38] or in supervised ones as in [14, 30, 31, 51, 52]. Martínez-Usó et al.  first clusters ‘correlated’ features and then selects the most representative feature of each group. Le Moan et al.  selects the three bands belonging to three red, green and blue spectral domains so that their correlation is minimized. In supervised cases, [14, 30, 31, 51, 53] select the set of bands that are more correlated to the ground truth and less correlated to each other. The most difficult is then to balance both criteria.
The orthogonal projection divergence  is another way to measure correlation between bands by the extent to which it is possible to express one band as a linear combination of the already selected bands. Last,  uses support vector clustering applied to features in order to identify the most relevant ones.
To sum it up, there are many various filter criteria corresponding to different approaches. Ranking methods according to an individual feature importance score remain limited, especially the ones only based on a supervised score, since they are not aware of the dependencies between selected features. Filter approaches associating a score to feature subsets are more interesting. Supervised and unsupervised approaches can be distinguished. Unsupervised approaches are interesting, but in a classification context, there is still a risk to select features that will not be all useful for the classification problem.
Wrapper relevance score associated with a feature set simply corresponds to its corresponding classification performance (measured by an accuracy score). Examples of such scores can be found in [14, 15, 24, 54] using SVM classifier, [36, 39] maximum likelihood classifier,  random forests,  spectral angle mapper or  a target detection algorithm.
Embedded FS methods are also related to a classifier, but feature selection is performed using a feature relevance score different from a classification accuracy. Most of the time, embedded approaches directly select features during the classifier training step. Several types of embedded FS approaches can be distinguished .
Some embedded approaches are regularization based models. A classifier is trained according to an objective function where a fit-to-data term that minimizes the classification error is associated with a regularization function, penalizing models when the number of features increases or forcing model coefficients associated with some features to be small. Features with the coefficients close to 0 are eliminated. Examples of some approaches can be found in [29, 56, 57]. They also include the L1-SVM  and the least absolute shrinkage and selection operator (LASSO) FS [41, 57] approaches. Such approaches are fast and efficient. However, it can be more difficult to adapt them, for instance, to take into account additional constraints, since FS criterion and optimization method are linked.
Other embedded approaches use the built-in mechanism for feature selection in the training algorithm of some classifiers. For instance, random forests (RF)  and decision trees can be considered as performing an embedded feature selection, since, when splitting a tree node, only the most discriminative feature according to Gini impurity criterion is used among a feature subset randomly selected . This FS eliminates the less useful features, but there is no guarantee to select a parsimonious feature subset: redundant features can be selected.
Some embedded approaches also provide feature importance measures, such as random forest classifier . It is processed on samples left out of the bootstrapped samples and is based on the permutation decrease accuracy: the importance of a feature is estimated by randomly permuting all its values in these samples for each tree, as the difference averaged over all the trees between prediction accuracy before and after permuting this feature. Other embedded approaches providing feature importance use them in a pruning process that first uses all features to train a model, before progressively eliminating some of them while maintaining model performance. SVM-RFE  is a well-known embedded approach where the importance of the different features in a SVM model is considered. Such approach has been extended to multiple kernel SVM by , associating a different kernel to each feature, estimating the model and then using the weights associated with these kernels as feature importance measures.
Other approaches do not calculate a score of importance for each feature individually, but evaluate the relevance of sets of features. Such scores often measure the generalization performance of the obtained model. Thus, the FS is not directly performed during the training step, but uses an intermediate result of the training step. For instance, [46, 59] use the generalization performance, e.g. the margin of a SVM classifier, as a separability measure to rank sets of features. The out-of-bag error rate of a random forest  can also be considered as such score. These scores are calculated for feature subsets and measure the generalization performance of the model provided by the classifier. Thus, they can be considered as an alternative between filter separability measures and wrapper scores.
Embedded approaches can also be extended to unmixing methods, as, for instance, in  where band selection is integrated into an endmember and abundance determination algorithm by incorporating band weights and a band sparsity term into an objective function.
2.1.4 Optimization methods
Another issue for band selection is to determine the best set of features corresponding to a given criteria. An exhaustive search is often impossible, especially for wrapper techniques. Therefore, heuristics have been proposed to find a near-optimal solution without visiting the entire solution space. Optimization methods can be either specific to a FS method (as for most embedded ones) or generic. Generic optimization methods can be divided into two groups: sequential and stochastic.
Several incremental search strategies have been detailed in , including the sequential forward search (SFS) starting from one feature and incrementally adding another feature making it possible to obtain the best score or on the opposite the sequential backward search (SBS) starting for all possible features and incrementally removing the worst feature. Variants such as sequential forward floating search (SFFS) or sequential backward floating search (SBFS) are proposed in . Serpico and Bruzzone  also proposes a variant of these methods called steepest ascent (SA) algorithms.
Among stochastic optimization strategies used for feature selection, several algorithms have been used for feature selection, including genetic algorithms [14, 15, 46, 54, 55], particle swarm optimization (PSO) [24, 47], clonal selection , ant colony  or even simulated annealing [27, 61].
In the specific case of hyperspectral data, adjacent bands are often very correlated to each other. Thus, hyperspectral band selection faces the problem of the clustering of the spectral bands. Band clustering/grouping has sometimes been performed in association with individual band selection. For instance,  first groups adjacent bands according to conditional mutual information and then performs band selection with the constraint that only one band can be selected per cluster. Su et al.  performs band clustering applying k-means to band correlation matrix and then iteratively removes the too inhomogeneous clusters and the bands that are too different from the representative of their cluster. Martínez-Usó et al.  first clusters ‘correlated’ features and then selects the most representative feature of each group, according to the mutual information. Chang et al.  performs band clustering using a more global criterion taking specifically into account the existence of several classes. Simulated annealing is used to maximise a cost function defined as the sum, over all clusters and over all classes, of correlation coefficients between bands belonging to a same cluster.
3. Which band selection criterion?
This study is a comparison of FS criteria that can be optimized using generic optimization heuristics, thus excluding several specific embedded or ranking approaches. The following FS criteria (listed in Table 3) were evaluated.
3.1 Compared FS criteria
3.1.1 Filter FS criteria
Filter criteria are independent from any classifier. Only scores assessing the relevance of feature subsets were considered, excluding filter FS methods ranking features independently according to an individual feature score (e.g. ReliefF).
Separability measures are used to identify the feature subsets achieving the best class distinction. Fisher, Bhattacharyya and Jeffries-Matusita measures [25, 26, 27, 50] are such scores. They were used assuming Gaussian class models. Let and be the mean and covariance matrices of the spectral distribution of class . Fisher separability between classes and is defined in equation (1)
Bhattacharyya separability between classes and is defined by equation 2.
As Bhattacharyya and Fisher separability measures are defined for binary problems, their mean overall possible pairs of classes were here used as FS criteria. To sum it up, the next separability measures were used as FS criteria:
Mean Fisher (fisher) separability measures calculated over all pairs of classes (equation 3):E3
Mean Bhattacharyya (Bdist) separability measures calculated over all pairs of classes (equation 4):
Jeffries-Matusita measure (jm) defined in equation 5:
126.96.36.199 Mutual information
Another FS criterion based on high-order statistics from information theory, e.g. mutual information (MI), was adapted from  and tested: it took into account both feature-class dependencies and between feature correlations. It is defined in equation 6.
for a feature subset , where is the MI between feature and classes, is the MI between features and and is the entropy of feature . It is referred to as mi in Table 3.
3.1.2 Wrapper and embedded FS criteria
Several classifiers were used to define wrapper scores related to their classification performances achieved using feature subsets. Only fast classifiers which did not require an optimization of hyper-parameters were used:
Maximum likelihood classification (ML): assuming a Gaussian model for the spectral distribution of classes, mean vectors and covariance matrices are estimated for each class during the training step. Each new sample is then labelled by its most probable class according to the model.
SAM and SID: these classifiers are specific to hyperspectral data. The spectral angle mapper (SAM) consists in classifying a sample according to the angle between its spectrum and reference spectra. The spectral information divergence  comes from dissimilarity measures between statistical distributions and more precisely the Kullback-Leibler measure.
Support Vector Machine (SVM) : SVM has been intensively used to classify remote sensing data and especially hyperspectral data [2, 15, 64]. Training a SVM classifier aims at estimating the best frontiers between classes. Only a one-against-one linear SVM was used here. Indeed, it is fast and enables to avoid an optimization of hyper-parameters, contrary to other kernels. Besides, using a linear SVM introduces a constraint to select bands achieving a linear separation between classes.
Decision trees (DT) .
Random forests (RF)  is a modification of bagging applied with decision trees. It can achieve a classification accuracy comparable to boosting  or SVM . It does not require assumptions on the distribution of the data, which is interesting when different types or scales of input features are used. It was successfully applied to remote sensing data such as multispectral data, hyperspectral data or multisource data. This ensemble classifier is a combination of tree predictors built from multiple bootstrapped training samples. For each node of a tree, a subset of features is randomly selected. Then, the best feature with regard to Gini impurity measure is used for node splitting. For classification, each tree gives a unit vote for the most popular class at each input instance, and the final label is determined by a majority vote of all trees.
These different classifiers were chosen because their underlying principles were different from each other. SAM, SID and ML rely on class models, while the others use inter-class separation models. RF can model even complex class frontiers remaining quite fast, while linear SVM selects features achieving the most possible linear separation between classes.
Wrapper FS scores measuring classification performance were considered:
Kappa coefficient: for all of these classifiers, the Kappa coefficient has been used as a FS score.
Classification confidence score: in addition, another FS score taking into account the classification confidence was also used . Indeed, most classifiers provide classification confidence indices and a class membership measuring the degree to which the sample belongs to the different classes according to the classifier. Let be a set of labelled ground truth samples and their associated true label . Let be the class membership measuring the probability for x to belong to class . A possible feature selection score taking into account class membership measures and thus classification confidence can be defined by equation 7:E7
with if and otherwise and the label given to x by the classifier. Such score measures both the ability to well classify the test samples for a given feature set and the separability between classes. Indeed, the more the samples are well classified, the more the score increases. The more the classifier is confident for well-classified samples, the more the score increases. The more the classifier is confident for bad-labelled samples, the more the score decreases. This confidence score was used in our experiments only for RF and linear SVM classifiers.
Embedded FS criteria. The two following criteria measuring the generalization performance of two classifiers were also tested. They are not pure embedded but can be considered as intermediate between wrapper and embedded. However, differentiating them from previous common wrapper scores, they are here referred to as ‘embedded’ in the sense that they assess the classification performance directly using a measure calculated directly while training the classifier and not after an evaluation of the model on a test data set. These scores are:
The margin of a linear 1-vs-1 SVM classifier (without parameter optimization) (svm.lin.marg), that is to say the distance between the class frontier and its support vectors.
The out-of-bag error  of a RF classifier (rf.oob). The out-of-bag samples are left out of the bootstrapped samples when training the RF.
3.2 Assessment approach
It must be kept in mind that study is a comparison of FS criteria and not of optimization methods. Thus all were optimized using the same optimization heuristics on the same classic hyperspectral data sets (3.3). The proposed workflow (Figure 1) includes two steps. The suitable number of bands to select is first estimated for each data set, thanks to an incremental FS optimization algorithm called sequential forward floating search (SFFS) . Then, the core comparison of FS criteria was performed. They were optimized to select this fixed number of bands using a stochastic FS optimization algorithm. A genetic algorithm (GA) (3.2.2) was used. Indeed, it proved to be efficient and generic enough to be used for all tested criteria. Besides it can provide valuable intermediate results (3.2.4) to assess FS stability. GA was launched several times to select this fixed number of bands for all tested FS criteria. It thus provided several possible band subset solutions. Indeed, performing FS several times was also a way to benefit from the stochastic nature of GA and thus to explore more band subset configurations. These different solutions were then quantitatively evaluated, according to different classifiers, to be able to draw conclusions about their relevance quite independently from a given classifier (3.2.3). Besides, to perform a qualitative analysis of the obtained solutions (and especially their stability), band importance measures were derived from intermediate results provided by this stochastic FS (3.2.2). It enabled to visually identify the parts of the spectrum considered as important by the FS criterion and to have a qualitative analysis concerning the stability of the proposed band subset solutions according to the FS criterion.
In practice, for each FS criterion, the GA feature selection process was launched five times on five limited data sets (100 training and 500 (300 for Indian Pines) testing samples) randomly selected with replacement among the whole data set. To sum it up, at the end, 25 ‘optimal’ feature subset solutions were thus obtained for each criterion and had to be evaluated (Figure 2).
3.2.1 Optimal band subset size using a sequential FS algorithm
Intermediate results of a sequential FS algorithm were used to identify how many bands must be selected. In our experiments, the sequential forward floating search (SFFS) algorithm was used .
This optimization method provides useful intermediate results. Indeed, it selects the ‘best’ sets of bands for different band subset sizes, starting from 1. Thus, it provides for each of them both the selected band subset (that could then be evaluated according to the performance of several classifiers) and the value reached by the FS score. Therefore, it enables to observe the evolution of FS score and classification quality, with the number of selected bands and then to decide how many bands are necessary to obtain suitable results. Other sequential methods as SVM-RFE  or SFS could also provide such information, but contrary to them, SFFS has the advantage to question at each step the selected set of bands obtained at the previous step, which enables possible modifications in the already selected band subset.
3.2.2 Band subset solutions using a genetic algorithm
Genetic algorithm (GA) is a family of stochastic optimization heuristics simulating the evolution mechanisms on a population of individuals. A score measuring its adaptation and its aptitude to stay alive is associated with each individual. In FS context, each individual is a feature subset and the score is the FS score.
Algorithm 1 Genetic algorithm.
It is intended to select less than bands among a band set . is the FS score to optimize.
Initialization: () Randomly generate a population of individuals, i.e. sets of bands.
t ← t+1
Calculate the score of each band subset of the current population.
Keep only the () best band subsets of the current population. Let be this remaining population.
Generate a new population of individuals from :
for all new individual do
Randomly select 2 parents among .
Obtain a new individual by randomly crossing these 2 parents.
Random mutations occur (randomly replacing a selected band by another one) in order to avoid to stay in a local optimum.
188.8.131.52 GA-derived importance measures
The GA approach has some advantages for our problem. First, only the best solution is usually kept, while GA has visited many other candidates. Many of them have scores quite similar to the score of the best solution: they are almost as good as the final solution. Therefore, these intermediate results can be used to determine which bands are often selected in the solutions (see Figure 3) of these intermediate good band subset populations . Thus, an individual band importance score (defined in equation 8) is calculated for each band , measuring the occurrence at which it has been selected by GA among the different best sets of bands obtained for all generations
To increase robustness, GA can be launched several times (i.e. so that different initializations and mutations occur) and over several training/testing sets randomly extracted from the whole data set. The proposed importance score is calculated for each of these results. Finally, the mean of these scores is considered for each band, giving the importance associated with this band.
3.2.3 Quantitative evaluation
In state of the art, FS is often considered as a first step in a specific classification workflow. In this context, wrappers are considered as achieving the best classification performance for a problem while sometimes lacking generality and being too classifier dependent. However, in our superspectral sensor design context, selected band subsets must be as efficient as possible for most classifiers and not only for the used FS criteria. Therefore, selected band subsets were here evaluated considering their classification quality reached with several classifiers.
Kappa coefficient was used as classification quality measure for the next classifiers: ML, RF and 1-vs-1 SVM with a radial basis function (RBF) kernel (with optimized parameters). It can here be noted that the latter was the only one not involved previously in a tested FS criterion. Thus, RBF SVM is the only classifier that is completely independent from all tested FS criteria. To come into details, evaluation was performed and averaged on five training/testing sample sets: for each of them, classifiers were trained using 50 samples per class (in order to be in a difficult case with few training samples), and results were evaluated on all remaining ground truth samples. For each FS criterion, all selected band subsets (obtained for the several launches of the algorithm) were evaluated, and the mean Kappa coefficient was then computed over all of them (see Figure 2).
3.2.4 Selected band stability
Another evaluation criterion of the FS criteria quality was the stability of the selected features. As explained in Section 3.2.2, band importance profiles (Figure 3) can be derived from intermediate results of a GA feature selection. As the contiguous bands in hyperspectral data are correlated, such band importance profile should be quite regular and smooth (i.e. not too noisy). The smoothness/regularity of these profiles is thus related to the stability of the solutions obtained using a FS criterion. Furthermore, the final optimal solutions provided by the different launch of GA can also be examined. This analysis remains only qualitative.
3.3 Data sets
Three state-of-the-art available hyperspectral data sets were used for the experiments:
Pavia City Centre scene2: This first data set is a hyperspectral scene acquired by the ROSIS sensor over the city centre of Pavia with a 1.3 m spatial resolution. It is a reflectance VNIR hyperspectral image with a spectral resolution ranging from 460 nm to 860 nm. Noisy bands have been discarded, and only 102 spectral bands from the original 115 bands have been kept. It covers an urban area (city centre). Its associated land cover ground truth consists of nine urban classes (materials and vegetation).
Indian Pines scene3: This hyperspectral scene was collected by the AVIRIS sensor over the Indian Pines test site in North-western Indiana. It is a radiance VNIR-SWIR hyperspectral image consisting of 220 spectral bands ranging from 400 to 2500 nm. Its associated ground truth consists of agricultural classes and other classes concerning perennial vegetation (forest, grass). In our experiments, only nine classes out of the original were kept. The discarded classes concerned less than 400 samples, which were considered as too few for our experiments.
Salinas scene4: This hyperspectral scene was collected by the AVIRIS sensor over the Salinas Valley in California at a 3.7 m spatial resolution. It is an at-sensor radiance VNIR-SWIR hyperspectral image consisting of 224 spectral bands ranging from 400 to 2500 nm. Its associated ground truth consists of agricultural classes, that is to say different kinds of culture at different growing steps.
3.4 Results and discussion
3.4.1 Optimal number of bands using SFFS
An optimal number of bands to select was identified using SFFS incremental FS method, starting from one selected band and incrementing the band subset until a maximal number of bands. Indeed, this maximum number of bands was fixed to 20 considering the superspectral sensor design application, for which the number of possible spectral bands is limited. In practice, the influence of the number of selected bands on the FS score and on the classification performance (measured by Kappa and the F-score of the worst classified class) for a RBF SVM classifier using the best selected band subset was considered. The optimal number of bands was chosen as the one from which these scores virtually no longer increase. Results obtained using several FS scores were also considered to make this decision, and at the end, the number of bands to select is a trade-off between several FS criteria.
For Pavia data set, the influence of the number of selected bands on the FS score and on the classification performance (measured by Kappa and the F-score of the worst classified class) for a RBF SVM classifier using the best selected band subset can be seen in Figure 4. The different quality indices no longer evolve a lot from five bands, except the minimal F-score increasing slightly up to seven bands. Similar results were obtained using several FS criteria, even though some differences exist. For instance, the quality indices increased slower for jm than for rf.conf in Figure 4. Thus seven bands were selected for Pavia data set for further experiments.
The same kind of results was obtained for Salinas, and seven bands were also selected for this data set in further experiments.
For Indian Pines, obtained results are slightly different as shown in Figure 5. The FS score increases fastly until seven bands are selected. Then, it remains quite constant for rf.conf but continues to very slightly increase for jm. The same phenomenon can be observed for classification accuracies reached by a RBF SVM classifier using the selected band subsets. For rf.conf FS criterion, a maximum is reached around 10–11 selected bands, while for jm, a stage is reached for these values followed by a new slight increase.
However, it must be kept in mind that this data set is more difficult than the other ones. Indeed, on the one hand, it offers less training/testing samples (and thus an increased risk of over-fitting). On the other hand, classes are more difficult to distinguish to each other, and raw classification results (that is to say without any regularization post-processing step) remain noisy. Thus 10 bands were selected in further experiments for Indian Pines data set.
3.4.2 Comparison of FS criteria
GA optimization heuristic was then launched to select 7 bands for Pavia, 10 bands for Indian Pines and 7 bands for Salinas. For each FS score, several feature subset solutions were proposed using GA. Their classification quality rate (Kappa) (averaged over all of them) using several classifiers is presented in Figure 6. At the first glance, most of the time, Kappa coefficients reached using features selected according to different FS scores are correlated over the different classifiers (RBF SVM, RF and ML) used for evaluation. Indeed, if a FS score leads to the best classification for a classifier, it will also generally be the best for the other classifiers. Thus the relevance of score appeared to be quite independent from the classifier used at validation step.
It can also be noticed from Figure 6 that the best FS scores lead to quite equivalent classification quality. This is clearly visible for Pavia and to a less extent for Salinas. On the opposite, results are more contrasted on Indian Pines. This might be due to the fact that Indian Pines is a more difficult data set, with a stronger intra-class variability and inter-class similarity, whereas Pavia is a quite simple data set with few well-distinguished classes. These results will now be discussed for each category of FS criteria. Band importance provided by GA will also be considered.
184.108.40.206 Comparison of wrapper criteria
It can be seen from Figure 6 that the FS scores sam.K and sid.K are less good than the other wrapper scores. This phenomenon appears strongly for Indian Pines and Salinas and is also a light trend for Pavia. The fact that it is more striking on Indian Pines scene can be related to the important intra-class variability of this data set.
The other wrapper scores relying on Kappa coefficient as a measure of classification performance lead to quite equivalent quantitative results. However, band importance profiles (Figures 7 and 8) provide other additional information. For instance, for Pavia data set (Figure 7), the FS score svm.lin.K tends to select the first bands (around band 5) of the spectrum, even though these bands are quite noisy. ml.K score performs very well considering classification performance but tends to be very sensitive to a probable atmospheric artefact, paying a lot of importance to bands from band 80 to band 85 and especially to band 82. This part of the spectrum corresponds to an atmospheric correction artefact, and not to a true discriminant phenomenon. This trend to select bands corresponding to this artefact is also observed for other FS scores.
Using classification confidence-based FS scores instead of classic classification accuracy scores tends to improve results. This trend can be observed in Figure 6 both for RF and SVM: using rf.conf instead of rf.K or using svm.lin.conf instead of svm.conf tends to slightly improve classification quality. Considering band importance profiles obtained for Pavia (Figure 7), using rf.conf instead of rf.K avoids to select the noisy bands around band five. Band importance profiles obtained using rf.conf also seem to be slightly more regular than using rf.K both for Pavia (Figure 7) and Indian Pines (Figure 8). Thus, using a confidence-based FS score tends to regularize feature importances and thus to stabilize feature selection.
220.127.116.11 Comparison of wrapper and embedded criteria
Classification qualities reached using both tested embedded criteria (svm.lin.marg and rf.oob) appeared to be generally less good than using the wrapper scores associated with these two classifiers. This is especially clear for svm.lin.marg, which is the worst FS score, for all classifiers used at evaluation step.
Even though it performs quite well, feature subsets selected using rf.oob lead generally to worse classification performance than using the best wrapper scores, and especially rf.K and rf.conf, also associated to random forests.
18.104.22.168 Comparison of wrapper and filter criteria
Considering classification quality (Figure 6), mutual information (mi) leads to different results for the various data sets: on Pavia data set, feature subsets selected according to this FS score enable to reach classification performance as good as the best wrapper scores, while on Indian Pines data set, obtained results are among the worst. Band importance profiles (Figures 9 and 10) obtained using mi are also very different from those obtained for the other FS scores: they tend to neglect wide parts of the spectrum. This is especially striking for Indian Pines data set, where bands from 30 to 100 are not considered as important, contrary to other FS scores.
The other tested filter FS scores are separability measures. They perform very well considering classification quality (Figure 6): they lead to classification results as good or better than those obtained using the best wrapper FS scores. In particular, the Jeffries-Matusita separability distance (jm) appears to be one of the best FS scores.
However, considering band importance profiles obtained for Pavia (Figure 9) using jm, it tends to strongly focus on a part of the spectrum (bands 80 to 85) concerned by artefacts caused by atmospheric corrections. This phenomenon also occurred for bdist and fisher and, as explained above, was also observed for some wrapper FS scores.
Furthermore, band importance profiles obtained using jm FS score seem slightly more noisy or more difficult to interpret than using the best wrapper FS scores (rf.K,rf.conf).
FS score comparison. Some wrapper, embedded and filter FS scores were tested and evaluated on several data sets:
svm.lin.marg appears clearly as the worst of them, performing poorly on all data sets.
Other ones (sam.K, sid.K and im) perform quite good on simple data sets but poorly on the most difficult one (Indian Pines).
Most perform well, leading to good classification performance. The best FS scores are filter separability measures or wrapper FS scores. However some slight trends can be observed:
Filter separability scores tend to lead to slightly better classification results than wrapper scores. Especially jm often appears as the best FS score according to quantitative analysis. However, considering band importance profiles, it tends to lead to less regular profiles and thus to less stable solutions than some wrapper scores. Besides they appear to be sensitive to an atmospheric correction artifact for Pavia data set.
Confidence-based wrapper scores taking into account classification confidence (rf.conf or svm.lin.conf) perform better than classic wrapper scores expressed as a simple classification “hard label” error rate. This trend could be observed both in quantitative (classification performance) and qualitative (band importance profiles) analyses. Indeed, taking into account classification confidence tends to regularize feature importances and provide more stable feature subsets.
At the end, the most interesting FS scores are rf.conf for wrappers and jm for filters, since they lead to the best quantitative results. rf.conf seems to provide more stable results than jm, considering its more regularized band importance profile. Besides it is more robust to some artefacts (e.g. atmospheric correction artefact for Pavia). However, even though computing times were not discussed in this study, it must be added that FS selection using filter separability measures (such as jm) is faster than using wrapper scores such as rf.conf.
Thematic comments. Conclusions about interesting spectrum parts can be drawn using the importance profiles provided by the different FS criteria:
Optimized spectral configurations are different from one FS criterion to another. Indeed, some parts of the spectrum are identified as important by most FS criteria, but other ones correspond to a clear disagreement.
Spectrum parts considered as important can often be understood considering the spectra of classes. Indeed, they can correspond to almost constant spectrum parts located before or after a strong variation of spectra of some classes. They can also correspond to intersections between the spectra of several classes.
For Indian Pines and Salinas scenes, no precaution was taken to handle noisy bands corresponding to the main atmospheric absorption windows. However, importance measures associated with these bands were very weak for most FS criteria (except the worse of them). Such observation can be considered as an additional quality criterion for the tested FS scores.
Band importance profiles obtained for Indian Pines are often more difficult to analyse than for Pavia. Nevertheless, some common trends could be observed, especially in the SWIR domain, where some blobs along the spectrum are visible for most FS criteria and might correspond approximately to the locations of some spectral bands of the WorldView-3 satellite.
4. Exploring bandwidth and extracting optimal spectral bands using hierarchical band merging
Works in the previous section were dedicated to the identification of a FS score. It was used for band selection, that is to say to select a subset of original bands out of a hyperspectral data set (without optimizing their weights). This section will focus on band extraction and will consider band subsets composed of spectral bands with different spectral widths. Indeed, optimizing spectral width is important to design a spectral sensor, as having wider bands is a way to limit signal noise while having too wide bands can also lead to a loss a useful information.
4.1 Band grouping and band extraction: state of the art and proposed strategy
4.1.1 State of the art
Band grouping and clustering. In the specific case of hyperspectral data, adjacent bands are often very correlated to each other. Thus, band selection encounters the question of the clustering of the spectral bands of a hyperspectral data set. This can be a way to limit the band selection solution space. Band clustering/grouping has sometimes been performed in association with individual band selection. For instance,  first groups adjacent bands according to conditional mutual information and then performs band selection with the constraint that only one band can be selected per cluster. Su et al.  performs band clustering applying k-means to band correlation matrix and then iteratively removes the too inhomogeneous clusters and the bands too different from the representative of the cluster to which they belong. Martínez-Usó et al.  first clusters ‘correlated’ features and then selects the most representative feature of each group, according to mutual information. Chang et al.  performs band clustering using a more global criterion taking specifically into account the existence of several classes: simulated annealing is used to maximise a cost function defined as the sum, over all clusters and over all classes, of the sum of correlation coefficients between bands belonging to a same cluster. Bigdeli et al. and Prasad et al. [68, 69] perform band clustering, but not for band extraction: a multiple SVM classifier is defined, training one SVM classifier per cluster. Bigdeli et al.  has compared several band clustering/grouping methods, including k-means applied to the correlation matrix or an approach considering the local minima of mutual information between adjacent bands as cluster borders. Prasad and Bruce  proposes another band grouping strategy, starting from the first band of the spectrum and progressively growing it with adjacent bands until a stopping condition based on mutual information is reached.
Band extraction. Specific band grouping approaches have been proposed for spectral optimization. De Backer et al.  defines spectral bands by Gaussian windows along the spectrum and proposes a band extraction optimizing score based on a separability criterion (Bhattacharyya error bound) thanks to a simulated annealing.  merges bands according to a criteria based on mutual information. Jensen and Solberg  merges adjacent bands decomposing some reference spectra of several classes into piece-wise constant functions. Wiersma and Landgrebe  defines optimal band subsets using an analytical model considering spectra reconstruction errors. Serpico and Moser  proposes an adaptation of his steepest ascent algorithm to band extraction, also optimizing a JM separability measure. Minet et al.  applies genetic algorithms to define the most appropriate spectral bands for target detection. Last, some studies have also studied the impact of spectral resolution , without selecting an optimal band subset.
4.1.2 Proposed approach
The approach proposed in this study consists in first building a hierarchy of groups of adjacent bands. Then, band selection is performed at the different levels of this hierarchy.
Thus, it is here intended to use the hierarchy of groups of adjacent bands as a constraint for band extraction and a way to limit the number of possible combinations, contrary to some existing band extraction approaches such as  that extract optimal bands according to JM information using an adapted optimization method or  that directly use a genetic algorithm to optimize a wrapper score.
4.2 Hierarchical band merging
The first step of the proposed approach consists in building a hierarchy of groups of adjacent bands that are then merged. Even though it is here intended to be used to select an optimal band subset, this hierarchy of merged bands can also be a way to explore several band configurations with varying spectral resolution, that is to say with contiguous bands with different bandwidth.
4.2.1 Proposed algorithm
Notations. Let be the (ordered) set of original bands. Let be the hierarchy of merged bands. is the ith level of this hierarchy of merged bands. It is composed of merged bands, that is to say ordered groups of adjacent bands from .
Thus, each is defined as a spectral domain:
Thus, the merged band obtained when merging two such adjacent merged bands and is .
Let be the score that has to be optimized during the band merging process.
The proposed hierarchical band merging approach is a bottom-up one. The algorithm is defined below:
Initialization: (that is to say that each merged band of the first level of the hierarchy only contains one individual original band).
Band merging: create level l + 1 from level l:
Find the pair of adjacent bands at level that will optimize the score if they are merged: find with .
A table is defined to link the different merged bands at consecutive hierarchy levels:
for , .
At the end, the value of a pixel in a merged band is defined as the mean of its values over the different bands it contains.
4.2.2 Band merging criteria
Several optimization scores were examined. (In the algorithm described in Section 4.2.1, this score is aimed to be minimized.) They can be either supervised or unsupervised, depending whether classes are considered or not at this step.
22.214.171.124 Correlation between bands
Between band correlation (either the classic normalized correlation coefficient or mutual information) (see Figure 11) measures the dependence between bands. So a first band merging criterion intends to merge adjacent bands considering how they are correlated to each other. Thus, it tries to obtain consistent groups of adjacent correlated bands.
Such measure inspired from  can be defined by the next function in equation 9 (intended to be minimized):
where is the correlation score between bands and .
126.96.36.199 Spectra approximation error
Band merging can also use the method as described in  to decompose some reference spectra of several classes into piece-wise constant functions (Figure 12). Adjacent bands are then merged trying to minimize the reconstruction error between the original and the piece-wise constant reconstructed spectra.
Such measure is defined by the next function (see equation 10) for a set of spectra:
where denotes the mean of spectra over spectral domain .
Another criterion to merge adjacent band is their contribution to separability between classes. Possible separability measures are the Bhattacharyya distance (B-distance) or the Jeffries-Matusita distance [25, 50] already used as FS score in 3.
At a level of the band merging hierarchy, the best set of merged bands is the one that maximizes class separability. So a possible criterion (to minimize) for band merging can be defined by equation 11 as
Figure 13 shows results on Pavia data set for the three criteria described in the previous section. The separability-based criterion tends to lead to more different results than the other ones. The different criteria do not consider the same parts of the spectrum as having to be kept at fine resolution. For instance, correlation or spectra reconstruction criteria tend to fast merge bands between number 30 and 32, while separability tends to preserve them at fine resolution. On the opposite, separability tends to fast merge some bands in the red-edge domain, while the other criteria keep this domain at fine resolution. This can be understood considering the underlying criteria; indeed adjacent bands are not very correlated to each other in this domain, and the slope of spectra is strong for vegetation classes; thus they cannot be merged easily according to correlation or spectra approximation error band merging criteria. On the opposite, the only interesting information for classification (e.g. for class separability) is the fact there is a slope there and thus the values of the bands before and after this domain. Thus, merging these red-edge bands will have a little impact on class separability.
As the hierarchy of merged bands can also be a way to explore several band configurations with varying contiguous bands with different spectral resolution, the different band configurations corresponding to the different levels were evaluated using a classification quality measure. Thus, for each level, a classification was performed using a support vector machine (SVM) classifier with a radial basis function (rbf) kernel and evaluated. Its Kappa coefficient was considered.
Such results are presented on Figure 14. It can be seen that some spectral configurations made it possible to obtain better results than at original spectral resolution. Configurations obtained using the correlation coefficient are generally less good than for the two other criteria. Except for Pavia, the spectra piece-wise approximation error merging criterion tends to lead to the best results. But for Pavia, the classification Kappa reached using the different criteria remained very similar.
4.3 Band selection within the hierarchy
4.3.1 Greedy algorithm
To optimize spectral configuration for a limited number of merged bands, a greedy approach was first used: it performed band selection at the different levels of the hierarchy of merged bands, paying no attention at results obtained at the previous level. Thus a set of merged bands was selected at each level of the hierarchy.
The feature selection (FS) score to optimize was the JM separability measure. It was optimized at each level of the hierarchy using an SFFS incremental optimization heuristic .
Obtained results on Pavia data set are presented on Figure 15: five merged bands (as in ) were selected at each level of the hierarchy of merged bands. The positions of the selected merged bands do not change a lot when climbing the hierarchy, except when reaching the lowest spectral resolution configurations. At some levels of the hierarchy, the position of some selected merged bands can also move and then come back to its initial position when climbing the hierarchy.
Thus, it can be possible to use the selected bands at a level to initialize the algorithm at the next level . This modified method will be presented in Section 4.3.2.
The merged band subsets selected at the different levels of the hierarchy were evaluated according to a classification quality measure. As in the previous section, the Kappa coefficient reached by a rbf SVM was considered. Results for Pavia and Indian Pines data sets can be seen in Figure 16. At each level of the hierarchy, 5 bands were selected for Pavia, and 10 bands for Indian Pines. It can be seen that these accuracies remain very close to each other whatever the band merging criterion used, and no band merging criterion tends to really be better than the other ones. Results obtained using merged bands are generally better than using the original bands.
4.3.2 Taking into account the band merging hierarchy during selection
188.8.131.52 Proposed algorithm
The previous merged band selection approach is greedy and computing time expensive. So an adaptation of the SFFS heuristic was proposed to directly take into account the band merging hierarchy in the band selection process. As for the hierarchical band merging algorithm, a bottom-up approach was chosen. Contrary to the greedy approach, this one uses the band subset selected at the previous lower level when performing band selection at a new level of the hierarchy of merged bands. This algorithm is described below:
Let be the set of selected merged bands at level of the hierarchy. (NB: the same number of bands is selected at each level of the hierarchy.)
Initialization: standard SFFS band selection algorithm is applied to the base level of the hierarchy.
Iterations over the levels of the hierarchy:
Generate from :
Remove possible duplications from .
Question : find band such that maximizes FS score, i.e. .
Then apply classic SFFS algorithm until .
Obtained results on Pavia scene for the band merging criterion ‘spectra piece-wise approximation error’ are presented in Figure 17: five merged bands were selected at each level of the hierarchy, starting from an initial solution obtained at the bottom level of the hierarchy.
As for previous experiments, obtained results were evaluated both for Pavia (5 selected bands) and Indian Pines (10 selected bands) data sets. Kappa reached for rbf SVM classification for merged band subsets selected at the different levels of the hierarchy (built for band merging criterion ‘spectra piece-wise approximation error’) can be seen both for the greedy FS algorithm and for the hierarchy aware one in Figure 18: obtained results remain very close, whatever the optimization algorithm.
Both algorithms lead to equivalent results considering classification performance (see Table 4), while the proposed hierarchy aware algorithm is really faster.
Hyperspectral imagery consists of hundreds of contiguous spectral bands, but only a subset of well-chosen bands is generally sufficient for a specific classification problem. So it is possible to design superspectral sensors dedicated to specific land cover classification tasks. This chapter presented a spectral optimization strategy to identify the most relevant spectral band subset for such sensor, optimizing both band position and width. Spectral optimization involves a band subset relevance score as well as a method to optimize it.
This study first focused on the definition of this relevance score. Several filter, wrapper and embedded scores compatible with generic optimization heuristics were compared, and both their classification performance and selection stability were considered for band selection problem. At the end, most of them brought good results. Jeffries-Matusita distance score tended to lead to slightly better quantitative classification results than the best wrapper scores but also being less stable. Wrapper scores taking into account classification confidence performed better than classic wrapper scores expressed as a simple classification “hard label” error rate. For instance, a random forest confidence-based score was identified as one of the best criteria, considering both quantitative and qualitative analyses. As an intermediate result of this FS criteria comparison, a method to create band importance profiles according to the different criteria was proposed providing visual hints about the relevance of the different parts of the spectrum. Then the study focused on the optimization of bandwidth, which is important in a spectral sensor design context, as having wider bands is a way to limit signal noise while having too wide bands can also lead to a loss a useful information. A strategy consisting in building a hierarchy of groups of adjacent bands before applying band selection at the different levels of this hierarchy using an adaptation of an incremental algorithm for this problem. This band grouping strategy enabled to limit the problem’s combinatory while considering relevant band subsets composed of spectral bands with different spectral widths. It was also a way to consider several possible solutions and evaluate their impact.
- Pavia data set is provided by Pavia University available at http://www.ehu.eus/ccwintco/index.php?title=Hyperspectral_Remote_Sensing_Scenes.
- Indian Pines data set is provided by Purdue University and available at https://engineering.purdue.edu/∼biehl/MultiSpec/hyperspectral.html.
- Salinas data set was downloaded from http://www.ehu.eus/ccwintco/index.php?title=Hyperspectral_Remote_Sensing_Scenes.