A Multi-Features Fusion of Multi-Temporal Hyperspectral Images Using a Cooperative GDD/SVM Method A Multi-Features Fusion of Multi-Temporal Hyperspectral Images Using a Cooperative GDD/SVM Method

Considering the emergence of hyperspectral sensors, feature fusion has been more and more important for images classification, indexing and retrieval. In this chapter, a cooperative fusion method GDD/SVM (Generalized Dirichlet Distribution/Support Vector Machines), which involves heterogeneous features, is proposed for multi-temporal hyperspectral images classification. It differentiates, from most of the previous approaches, by incorporating the potentials of generative models into a discriminative classifier. Therefore, the multi-features, including the 3D spectral features and textural features, can be integrated with an efficient way into a unified robust framework. The experimental results on a series of Hyperion images show that the precision is 92.64% and the recall is 91.87%. The experiments on AVIRIS dataset also confirm the improved performance and show that this cooperative fusion approach has consistence over different testing datasets.


Introduction
Considering the emergence of hyperspectral sensors, feature fusion has been more and more important for images classification, indexing and retrieval. In this chapter, a cooperative fusion method GDD/SVM (Generalized Dirichlet Distribution/Support Vector Machines), which involves heterogeneous features, is proposed for multi-temporal hyperspectral images classification. It differentiates, from most of the previous approaches, by incorporating the potentials of generative models into a discriminative classifier. Therefore, the multi-features, including the 3D spectral features and textural features, can be integrated with an efficient way into a unified robust framework. The experimental results on a series of Hyperion images show that the precision is 92.64% and the recall is 91.87%. The experiments on AVIRIS dataset also confirm the improved performance and show that this cooperative fusion approach has consistence over different testing datasets.

Problem statement
The semantic categorization of remote-sensing images requires analysis of many features of the images such as texture, spectral profiles, etc. Current feature fusion approaches commonly concatenate different features. It gives, generally good results and several approaches have been proposed using this schema. However, most of them have various conditional constraints, such as noise and imperfection, which might retain the use of such systems under degraded performance. However, how to fuse heterogeneous features in a flexible way is still an open research question.
Similarly, in the area of Supervised Machine Learning (SML), diversity with respect to the errors committed by component classifiers has received much attention. Generative and discriminative approaches are two distinct schools of probabilistic machine learning. It has shown that discriminative approaches such as SVM [1] outperform model based approaches due to their flexibility in decision boundaries estimation. Conversely, since that discriminative methods are concerned with boundaries, all the classes need to be estimated conjointly [2]. Complementary, one of the interesting characteristics, that generative models have over discriminative ones, is that they are learnt independently for each class. Moreover, following their modeling power, generative models are able to deal with missing data. An ideal fusion method should combine these two approaches in order to improve the classification accuracy.

Generalized dirichlet distribution
Priors based on Dirichlet location-scale mixture of normals are widely used to model densities as mixtures of normal kernels. A random density f arising from such a prior can be expressed as where φ(·) is the standard normal density and the mixing distribution P follows a Dirichlet process.
[3] initiated a theoretical study of these priors for the problem of density estimation. They showed that if a density f 0 satisfies certain conditions, then a Dirichlet location mixture of normals achieves posterior consistency at f 0 . Their conditions can be best summarized as f 0 having a moment generating function on an open interval containing [−1, 1]. Ghosal and van der Vaart (2001) extended these results to rate calculations for the more general Dirichlet location-scale mixture prior. However, they restricted the scale parameter σ to a compact interval [σ, σ] ⊂ (0, ∞).

Preliminaries
To make this chapter relatively self-contained, we recall the definitions of posterior consistency in the context of density estimation and regression. These definitions formalize the concept that in order to achieve consistency, the posterior should concentrate on arbitrarily small neighborhoods of the true model when more observations are made available.
Posterior consistency for density estimation: Suppose X 1 , X 2 , · · · are independent and identically distributed according to an unknown density f 0 . We take the parameter space as F -a set of probability densities on the space of the observations and consider a prior distribution Π on F . Then the posterior distribution Π(·|X 1 , · · · , X n ) given a sample X 1 , · · · , X n is obtained as, We say that the posterior achieves weak (or strong) posterior consistency at f 0 if for any weak (or L 1 ) neighborhood U of f 0 , Π(U|X 1 , X 2 , · · · , X n ) → 1 almost surely as n → ∞.
Posterior consistency for regression: Suppose one observes Y 1 , Y 2 , · · · from the model s are known non-random covariate values and ǫ i 's are independent and identically distributed with an unknown symmetric density f 0 . The regression coefficients α 0 , β 0 are also unknown. Here, it is appropriate to consider the parameter space as Θ = F * × R × R, where F * is a set of symmetric probability densities on R with a prior Π on Θ. The posterior distribution Π(·|Y ! , · · · , Y n ) is then computed as, .
We say that the posterior achieves weak consistency at ( f 0 , α 0 , β 0 ) if for any weak neighborhood U of f 0 and any δ > 0, almost surely as n → ∞.

Density estimation: weak consistency
We start with weak posterior consistency for the problem of density estimation. Our main tool is the following theorem due to Schwartz (1965).
A prior Π achieves weak posterior consistency at a density f 0 , if We would use the notation f 0 ∈ KL(Π) to indicate that a density f 0 satisfies (2).
General Mixture Priors First consider the case when the mixing distribution P in (1) follows some general distributionΠ, not necessarily a Dirichlet process. It is reasonable to assume that the weak support ofΠ contains all probability measures on R × R + that are compactly supported. The next lemma reveals the implication of this property.
Consider an f 0 ∈ F such that x 2 f 0 (x)dx < ∞. Supposef = φ * P is such thatP((−a, a) × (σ, σ)) = 1 for some a > 0, 0 < σ < σ. Then for any ǫ > 0, there exists a weak neighborhood W ofP such that for any f = φ * P with P ∈ W, The proof of this lemma is similar to the proof of Theorem 3 of Ghosal et al. (1999) and we present it in the appendix. Here we state and prove the main result.
Let f 0 (x) be a continuous density on R satisfying: 1. f 0 is nowhere zero and bounded above by M < ∞. 2 Then, f 0 ∈ KL(Π).
Assumption 4 provides the important moment condition on f 0 . Assumption 2 is satisfied by most of the common densities and assumption 3 can be viewed as a regularity conditions. The interval [x − 1, x + 1] that appears in assumption 3 can be replaced by [x − a, x + a] for any a > 0.
Proof. of Theorem 3.1.2 Note that, Therefore, the result would follow if for any ǫ > 0, we can find anf which makes f 0 log f 0 f dx < ǫ/2 and also satisfies the condition of Lemma 3.1.2. Next we show how to construct such anf .
Consider the densities f n = φ * P n , n ≥ 1, with P n 's constructed as, dP n (θ, σ) = t n I (θ∈[−n,n]) f 0 (θ)δ σ n (σ) (5) where σ n = n −η , t n = ( n −n f 0 (y)dy) −1 , I A is the indicator function of a set A and δ x is the point mass at a point x. Note that f n can be simply written as, Find a positive constant ξ such that Since t n → 1 and σ n → 0, (7) would imply that f n (x) → f 0 (x) as n → ∞ by continuity of f 0 . Therefore one can conclude, Since t n is a decreasing sequence and f 0 (θ) < M for all θ ∈ R, one can readily see that for all n ≥ 1 and all x ∈ R, Now, fix an x ∈ R. Since, |x − θ| ≤ |x| + n for all θ ∈ [−n, n] and t n ≥ 1, it follows that for all n ≤ |x|, The last inequality follows from the fact that τ η φ(τ η (|x| + τ)) is decreasing in τ for τ ≥ 1.
A little algebraic manipulation with (9) and (12) obtains, ∀n ≥ 1, From the assumptions of Theorem 3.2, it can be easily verified that the function on the right hand side of the above display is f 0 integrable. Therefore an application of DCT on (8) implies that, Therefore we can simply choosef = f n 0 for some large enough n 0 .

Dirichlet mixture of normals
Next we considerΠ = Dir(αG 0 ), a Dirichlet process with parameter αG 0 . Here α is a positive constant and G 0 is a probability measure on R × R + .
Note that the moment condition of Theorem 3.1.2 is substantially reduced.
Let f 0 be a density on R satisfying and for large x < 0, Other than the important moment condition on f 0 this theorem also requires some regularity in the tail of the base measure G 0 . For example, assumption 3,3' requires the tail of G 0 not to decay faster than a polynomial rate for the scale parameter σ. This condition seems very reasonable since the Cauchy density itself can be written as a scale mixture of normals with the mixing density having a polynomial decay towards infinity.
A standard choice for G 0 is the conjugate normal-inverse gamma distribution (see Escobar and West 1995), under which, θ|σ ∼ N(0, ξσ 2 ) and σ −2 ∼ Gamma(r, λ), for some ξ, r, λ > 0. For such a G 0 with r ∈ (1/2, 1), one can show that the conditions of Theorem 3.2 hold true with η ∈ (2r/(1 + r), 1), β = r(2 − η) and γ = 2r. For example, the conditions in Assumptions 3, 3' are satisfied since, for some positive constants c, c ′ . To see that the conditions of Assumptions 4, 4' also hold, note that, An argument similar to the one provided above shows that the second term, namely, Pr(σ −2 < e −2|x| η +1 ) is bounded by a constant times e −2r|x| η +r . Therefore, this term can be made smaller than c|x| −γ for a suitable constant c. Now, using the inequality and φ(·) are the standard normal distribution and density functions, we obtain for some positive constants c, c ′ , c ′′ . The desired inequality follows from these two bounds. Therefore, such a choice of G 0 would lead to posterior consistency, for example, when f 0 is a Cauchy density.
Proof. of Theorem 3. 2 We simply need show that such an f 0 satisfies the condition of Lemma Define a class of subsets of R × R + indexed by x ∈ R, as follows: These sets are of particular interest, since for f = φ * P, By the assumptions of the Theorem, this quantity can be made arbitrarily small for a suitably large x 0 if we can show that P(K x ) > c 1 exp(−c 2 |x| η ) for all |x| > x 0 for some fixed constants c 1 , c 2 > 0. Therefore it suffices to prove that, For any τ > 0 there exists an x 0 > 0 and a set The proof of this Lemma is fairly technical. It makes an extensive use of the tail behavior of a random probability P arising from a Dirichlet process. For clarity of reading, we present details of the proof in the Appendix.

Density estimation: strong consistency
We establish L 1 -consistency of a Dirichlet location-scale mixture of normal prior Π by verifying the conditions of Theorem 8 of Ghosal et al. (1999). This theorem is reproduced below.
Let Π be a prior on F such that f 0 ∈ KL(Π). If there is a δ < ǫ/4, c 1 , c 2 > 0, β < ǫ 2 /8 and F n ⊆ F such that for all n large, then Π achieves strong posterior consistency at f 0 .
Here J(δ, G) denotes logarithm of the covering number of G by L 1 balls of radii δ.
We first show how to calculate J(δ, G) for certain type of sets G. For some a > 0, u > l > 0 define Then, where b 0 , b 1 and b 2 depend upon κ but not on a, l or u.
Proof. Take F n = F κ a n ,l n ,u n . Then the conditions of Theorem 4 are easily verified using Lemma 4 for a suitable choice of κ > 0. IfΠ = Dir(αG 0 ), verification of conditions 1 and 2 becomes particularly simple. For example, if G 0 is a product of a normal on θ and an inverse gamma on σ 2 , then the conditions of theorem 4 are satisfied if a n = O( √ n), l n = O(1/ √ n) and u n = O(e n ).

Support vector machines
We give, in this section, a very brief presentation of Support Vector Machines (SVMs) that is needed for the definition of their functional versions. We refer the reader to e.g. [4] for a more comprehensive presentation. As stated in section ??, X denotes an arbitrary Hilbert space. Our presentation of SVM departs from the standard introduction because it assumes that the observations belong to X rather than to a d. This will make clear that the definition of SVM on arbitrary Hilbert spaces is not the difficult part in the construction of functional SVM. We will discuss problems related to the functional nature of the data in section 4.1.5.
Our goal is to classify data into two predefined classes. We assume given a learning set, i.e. N examples (x 1 , y 1 ), . . . , (x N , y N ) which are i.i.d. realizations of the random variable pair (X, Y) where X has values in X and Y in {−1, 1}, i.e. Y is the class label for X which is the observation.

Hard margin SVM
The principle of SVM is to perform an affine discrimination of the observations with maximal margin, that is to find an element w ∈ X with a minimum norm and a real value b, such that y i ( w, x i + b) ≥ 1 for all i. To do so, we have to solve the following quadratic programming problem: The classification rule associated to (w, b) is simply (x) = sign( w, x + b). In this situation (called hard margin SVM), we request the rule to have zero error on the learning set.

Soft margin SVM
In practice, the solution provided by problem (P 0 ) is not very satisfactory. Firstly, perfectly linearly separable problems are quite rare, partly because non linear problems are frequent, but also because noise can turn a linearly separable problem into a non separable one. Secondly, choosing a classifier with maximal margin does not prevent overfitting, especially in very high dimensional spaces (see e.g. [5] for a discussion about this point).
A first step to solve this problem is to allow some classification errors on the learning set. This is done by replacing (P 0 ) by its soft margin version, i.e., by the problem: Classification errors are allowed thanks to the slack variables ξ i . The C parameter acts as an inverse regularization parameter. When C is small, the cost of violating the hard margin constraints, i.e., the cost of having some ξ i > 0 is small and therefore the constraint on w dominates. On the contrary, when C is large, classification errors dominate and (P C ) gets closer to (P 0 ).

Non linear SVM
As noted in the previous section, some classification problems don't have a satisfactory linear solution but have a non linear one. Non linear SVMs are obtained by transforming the original data. Assume given an Hilbert space H (and denote ., . H the corresponding inner product) and a function φ from X to H (this function is called a feature map). A linear SVM in H can be constructed on the data set (φ(x 1 ), y 1 ), . . . , (φ(x N ), y N ). If φ is a non linear mapping, the classification rule (x) = sign( w, φ(x) H + b) is also non linear.
In order to obtain the linear SVM in H one has to solve the following optimization problem: It should be noted that this feature mapping allows to define SVM on almost arbitrary input spaces.

Dual formulation and Kernels
Solving problems (P C ) or (P C,H ) might seem very difficult at first, because X and H are arbitrary Hilbert spaces and can therefore have very high or even infinite dimension (when X is a functional space for instance). However, each problem has a dual formulation. More precisely, (P C ) is equivalent to the following optimization problem (see [6]): This result applies to the original problem in which data are not mapped into H, but also to the mapped data, i.e., (P C,H ) is equivalent to a problem (D C,H ) in which the x i are replaced by φ(x i ) and in which the inner product of H is used. This leads to: Solving (D C,H ) rather than (P C,H ) has two advantages. The first positive aspect is that (D C,H ) is an optimization problem in N rather than in H which can have infinite dimension (the same is true for X ).
The second important point is linked to the fact that the optimal classification rule can be written ( . This means that both the optimization problem and the classification rule do not make direct use of the transformed data, i.e. of the φ(x i ). All the calculations are done through the inner product in H, more precisely through the values φ(x i ), φ(x j ) H . Therefore, rather than choosing directly H and φ, one can provide a so called Kernel function K such that K(x i , x j ) = φ(x i ), φ(x j ) H for a given pair (H, φ).
In order that K corresponds to an actual inner product in a Hilbert space, it has to fulfill some conditions. K has to be symmetric and positive definite, that is, for every N, x 1 , . . . , x N in X and α 1 , . . . , If K satisfies those conditions, according to Moore-Aronszajn theorem [? ], there exists a Hilbert space H and feature map φ such that

The case of functional data
The short introduction to SVM proposed in the previous section has clearly shown that defining linear SVM for data in a functional space is as easy as for data in d, because we only assumed that the input space was a Hilbert space. By the dual formulation of the optimization problem (P C ), a software implementation of linear SVM on functional data is even possible, by relying on numerical quadrature methods to calculate the requested integrals (inner product in L 2 (µ), cf section ??).
However, the functional nature of the data has some effects. It should be first noted that in infinite dimensional Hilbert spaces, the hard margin problem (P 0 ) has always a solution when the input data are in general positions, i.e., when N observations span a N dimensional subspace of X . A very naive solution would therefore consists in avoiding soft margins and non linear kernels. This would not give very interesting results in practice because of the lack of regularization (see [5] for some examples in very high dimension spaces, as well as section ??).
Moreover, the linear SVM with soft margin can also lead to bad performances. It is indeed well known (see e.g. [7]) that problem (P C ) is equivalent to the following unconstrained optimization problem: CN . This way of viewing (P C ) emphasizes the regularization aspect (see also [8][9][10]) and links the SVM model to ridge regression [? ]. As shown in [11], the penalization used in ridge regression behaves poorly with functional data. Of course, the loss function used by SVM (the hinge loss, i.e., h(u, v) = max(0, 1 − uv)) is different from the quadratic loss used in ridge regression and therefore no conclusion can be drawn from experiments reported in [11]. However they show that we might expect bad performances with the linear SVM applied directly to functional data. We will see in sections ?? and ?? that the efficiency of the ridge regularization seems to be linked with the actual dimension of the data: it does not behave very well when the number of discretization points is very big and thus leads to approximate the ridge penalty by a dot product in a very high dimensional space (see also section ??).
It is therefore interesting to consider non linear SVM for functional data, by introducing adapted kernels. As pointed out in e.g. [10], (P C,H ) is equivalent to Using a kernel corresponds therefore both to replace a linear classifier by a non linear one, but also to replace the ridge penalization by a penalization induced by the kernel which might be more adapted to the problem (see [9] for links between regularization operators and kernels). The applications presented in ?? illustrate this fact.

Overview of the proposed fusion schema
In this chapter, we propose a new technique in remote-sensing images classification by fusing heterogeneous representations. The proposed approach involve several steps including preprocessing; features extraction; features fusion; matching and classification stages. The block diagram of the proposed technique is shown in Fig. 1. In our previous work [12], we proposed a novel 3D model which design the spectral signature as a three dimensional function which are the time, reflectance, and wavelength band (equation 1). For each pixel, we generated a surface (3D Mesh) which generalizes the usual signature by adding a time dimension. We call this new representation the multi-temporal spectral signature. Interested readers can refer to [12].

Images pre-processing and features extraction
In this study multi-temporal hyperspectral images constitutes the source data. Spectral and textural features are the foundational data for this kind of images. The 3D spectral features are extracted from the relative mesh of a given pixel (multi-temporal spectral signature) while the textural ones are derived directly from images. Mainly, two features vectors are generated for each pixel as follows: Heat kernel signature (HKS) : The HKS is a signature computed only from the intrinsic geometry of an object. Suppose (m, g) is a compelte Riemannian manifold, g is the Riemannian metric. δ is the Laplace-Beltrami operator. The eigenvalues {λ n } and eigenfunctions {φ n } of δ are δφ n = λ n φ n , where φ n is normalized to be orthonormal in L 2 (M). The Laplace spectrum is given by 0 = λ 0 < λ 1 ≤ λ 2 ≤ . . . , λ n → ∞. △ is the Laplace-Beltrami operator. As a local shape descriptor, Sun et al. [? ] defined the heat kernel signature (HKS) by : where λ 0 , λ 1 , · · · ≥ O are eigenvalues and 0 , φ 1 , . . . are the corresponding eigenfuctions on the Laplace-Beltrami operator, satisfying δ X φ i = λ i φ i . Let's denote this vector by Y.

Spatio-temporal Gabor filters:
Texture is one of the important characteristics used in identifying objects or regions of interest. It contains important information about the structural arrangement of surfaces. Fusing texture with 3D spectral information is conducive to the interpretation of remote seeing image [13]. We use a method for dynamic texture modeling based on spatio-temporal Gabor filters. Briefly, the sequence of images is convolved with a bank of spatiotemporal Gabor filters and a feature vector is constructed with the energy of the responses as components. Let's denote this vector by Y ′ .

Multi-Features fusion based on a cooperative GDD/SVM classifier
In this section, we present an approach that combines an SVM classifier [1] with a generatively trained GDD model and profits, accordingly, from the advantages of both techniques. The key idea here is to concatenate the extracted features into one vector and to project it in a new space. First, a straightforward feature combination approach is used to concatenate feature vectors (Y and Y ′ ) to a single feature vector X = (X i1 , . . . , X idim ). The dim size may differ from one pixel to another making the fusion and classification a challenging tasks. To overcome this limit, we use the Generalized Dirichelet Distribution (GDD) model [14] to map each feature vector into its Fisher score. Therefore, the Fisher kernel function from the GDD is used to replace the Gaussian kernel in the classical SVM.
Let (X 1 , . . . , X N ) denote a collection of N multi-temporal hyperspectral pixels. Each data X i is assumed to have dim size, X = (X i1 , . . . , X idim ). Each data X i is assumed to be drawn from the following finite mixture model : where M is the number of components, the P(j), (0 < P(j) < 1 and ∑ dim j=1 P(j) = 1) are the mixing proportions and p(X/j, θ j ) is the Probability Density Function PDF. θ is the set of parameters to be estimated : θ = (α 1 , . . . , α M , P(1), . . . , P(M)).
If the random vector X = (X i1 , . . . , X idim ) follows a Dirichelet distribution, the joint density function is given by : Since that each feature vector X may has an arbitrary dimension, the proposed method defines the fusion as a projection from one feature vector space (spectral bands) to another with a fixed dimentionnality. Accordingly, the feature-level fusion is done by projecting the vector X combining into one vector in the Fisher space. Thus, the generative model will have its impact on the final classification result through the projection of the extracted features in this new space.
SVM classifier is used to classify the fused features and the multi-temporal dataset of images. Given the generative model obtained by GDD with parameters θ, we compute for each sample X the Fisher score U d = ▽ θ logP(x|θ) (the gradient of the log likehood of x for model θ). The Fisher kernel operates in the gradient space of the generative mode and provides a natural similarity measure between data samples. For each sample, this score is a vector of fixed dimentionality. Using this score, the Fisher Information matrix is defined as I = E X i U X i T U X i . After Fisher score normalization, we compute the Fisher kernel function on the basis of the Euclidean distance between the scores of the new sample and the training samples : In the second stage, suppose our training set S consists of labels input vectors (X i , z j ), i = 1, . . . , m where X i ∈ R n and z i ∈ {±1}. Given a kernel matrix and a set of labels z i for each sample, the SVM proceeds to learn a classifier of the form, where the coefficients α i are determined by solving a constrained quadratic program which aims to maximize the margin between the classes. In our experiments we used the LIBSVM package. Our research deals with multi-class problem. The One-Vs-One approach is adopted to extend the proposed approach to multi-temporal hyperspectral classification.

Results and discussion
The proposed approach was tested on two different data sets. The datasets involve several types of information with dimensions ranging from 176 to 183 bands. The first dataset, Hyperion, contains vegetation type data, is divided into five classes, has 183 spectral bands and has a pixel size of 30m. The second set is from an airborne sensor (AVIRIS), divided into 7 classes, has 176 spectral bands and a pixel size of 18m. First, we present experiments that assess the classification accuracy of the proposed approach (PA). We also included the direct SVM fusion and a probabilistic fusion approach in our comparison as a baseline. Figure (2) summarizes the results obtained. At each level of label noise we carry out four experiments, and the figures show the mean performance. The strength of this approach is that it combines the rich modeling power of GDD with the discriminative power of the SVM algorithm.