Open access peer-reviewed chapter

Hyperspectral and Multispectral Image Fusion Using Deep Convolutional Neural Network - ResNet Fusion

Written By

K. Priya and K.K. Rajkumar

Submitted: 06 May 2022 Reviewed: 18 May 2022 Published: 19 July 2022

DOI: 10.5772/intechopen.105455

From the Edited Volume

Hyperspectral Imaging - A Perspective on Recent Advances and Applications

Edited by Jung Y. Huang

Chapter metrics overview

142 Chapter Downloads

View Full Metrics

Abstract

In recent years, deep learning HS-MS fusion has become a very active research tool for the super resolution of hyperspectral image. The deep conventional neural networks (CNN) help to extract more detailed spectral and spatial features from the hyperspectral image. In CNN, each convolution layer takes the input from the previous layer which may cause the problems of information loss as the depth of the network increases. This loss of information causes vanishing gradient problems, particularly in the case of very high-resolution images. To overcome this problem in this work we propose a novel HS–MS ResNet fusion architecture with help of skip connection. The ResNet fusion architecture contains residual block with different stacked convolution layer, in this work we tested the residual block with two-, three-, and four- stacked convolution layers. To strengthens the gradients and for decreases negative effects from gradient vanishing, we implemented ResNet fusion architecture with different skip connections like short, long, and dense skip connection. We measure the strength and superiority of our ResNet fusion method against traditional methods by using four public datasets using standard quality measures and found that our method shows outstanding performance than all other compared methods.

Keywords

  • convolution neural network
  • residual network
  • ResNet fusion
  • stacked layer
  • dense skip connection

1. Introduction

Spectral imaging technology captures contiguous spectrum for each image pixel over a selected range of wavelength bands in the spectrum. Thus, spectral images accommodate more information than conventional monochromatic or RGB images. The wide range of spectral information available in hyperspectral image brings the spectral imaging technology into a new horizon of research for analyzing the pixel content at macroscopic level. This tremendous change in image processing research area makes revolutionary developments in every walks of human life in coming future. In general, spectral images are divided into either Multispectral (<20 numbers of wavelength bands sampled) or Hyperspectral (>20 numbers of wavelength bands). Multispectral image (MSI) captures a maximum of 20 spectral bands whereas Hyperspectral image (HSI) captures hundreds of contiguous spectral bands at a time. Due to this exciting prominence, HSI is now becoming an emerging area and at the same time faces a lot of challenges to analyze the minute details of the pixel content in image processing and computer vision areas [1].

Hyperspectral images (HSIs) are rich in spectral information that highly strengthens their information storing ability. This property of HSI is enable rapid growth in the development in many areas such as remote sensing, medical science, food industry, and various computer vision tasks. However, hyperspectral images capture all these bands in a narrow wavelength range, and hence it limits the amount of energy received by each band. Therefore, the HSI information can be easily influenced by many kinds of noises, and it leads to lower the spatial resolution of HSI [2].

Many studies have been introduced in literature so far to control the tradeoff between the spatial and spectral resolution in the hyperspectral images. As a result of this, many HS–MS fusion methods are evolved in the past decades to address it. The straightforward approach of the HSMS fusion method has become the most popular and trending research area of image processing and computer vision. The early approach is pansharpening-based image fusion that fuses spectral and spatial information from low resolution multispectral (LR–MS) images with high resolution (HR) panchromatic (PAN) images to enhance the spatial and spectral resolution of the fused image. Subsequently, pansharpening image fusion algorithms have been gradually extended to HS-MS image fusion [3].

In HS–MS fusion, a high spatial and spectral hyperspectral image is estimated by fusing LR–HS image with HR–MS image of the same scene. However, the estimated spatial and spectral data quality is highly influenced by the constraints used in the fusion process. Recently, neural network-based methods have been widely used in many areas to improve the HSMS fusion quality in both spatial and spectral domains. One such network named as convolution neural network (CNN) in deep learning (DL) performs much better in image reconstruction, super-resolution, object detection, etc. [4].

In CNN, each layer takes the output from the previous layer, which tends to lose information as the network goes into deeper architecture. In this work, we use ResNet-based HSMS fusion by adding the skip connection between the convolution layers. This skip connection helps to map the identity of information throughout the deep convolution network [5].

The following sections of this paper are arranged as Section 2 includes various literature reviews of HSMS fusion methods in both traditional and newly introduced deep learning methods. Section 3 includes the materials and methods used in this work. Sections 4 and 5 includes the detailed representation of problem formulation and implementation of our work. The results and discussion of our proposed method are discussed in Section 6, and finally, Section 7 concludes the proposed work with future scope.

Advertisement

2. Review of literature

2.1 Traditional methods

Many algorithms have been proposed to enhance the spatial quality of HS images in past decades. One such popular and attractive method is HS-MS image fusion, which is mainly divided into four groups: component substitution (CS), multi-resolution analysis (MRA), Bayesian approach, and spectral unmixing (SU) [6]. The CS and MRA methods are described under the concept of an injection framework. In this framework, the high-quality information from one image is injected into another [7]. Apart from these, Bayesian-based methods use probability or posterior distribution of prior information about the target image. The posterior distribution of the target image is considered based on the given HS and MS images [8]. Later, spectral unmixing-based HSMS image fusion was introduced and is one of the promising and widely used methods for enhancing the quality of HS image.

In SU method, the quality of the abundance estimation highly depends on the accuracy of the endmembers. Therefore, any obstruction that occurs during the end member extraction process leads to inconsistency in the abundance estimation. To overcome this limitation, Paatero and Tapper in 1994 [9] introduced nonnegative matrix factorization (NMF) method and it was popularized in article by Lee and Seung in 1999 [10]. It has become an emerging tool for processing high-dimensional data due to the automatic feature extraction capability. The main advantage of this NMF method is that it shows a unique solution to the problem compared to other unmixing techniques [11]. In general, NMF based on the spectral unmixing jointly estimates both endmember and corresponding fractional abundance in a single step are mathematically represented as follows,

Y=EAE1

Where the output matrix Y is simultaneously factorized into two nonnegative matrix E (endmember) and A (abundance) without any prior knowledge and hence NMF comes under an unsupervised framework [12]. Later NMF is one of the trending methods for blind source spectral unmixing problems. NMF factorizes the input matrix into a product of two nonnegative matrices (endmember matrix, E and abundance matrix, A) by enforcing nonnegativity. So NMF method has high relevance in SU to enhance the quality of the image by adding these constraints. Finally, SU-based fusion is accomplished by using coupled NMF (CNMF) method to obtain enhanced hyperspectral image with high spatial and spectral quality. The CNMF fusion algorithm gives high-fidelity reconstructed image compared to other existing fusion methods [13].

Yokoya et al. in 2012 [14] introduced a coupled non-negative matrix factorization (CNMF) method, which is an unsupervised unmixing-based HS-MS image fusion. CNMF uses a straightforward approach to unmixing and fusion processes, so its mathematical formulation and implementation are not as complex as other existing fusion methods. Finally, this method optimizes the solution with minimum residual errors and reconstructs the high-fidelity hyperspectral image.

Simoes et al. in 2015 [15] introduced a super-resolution method for hyperspectral image termed as HySure. This method formulated a new model to preserve the edges between the objects during the unmixing-based data fusion. This method uses an edge-preserving constraint called vector total variation (VTV) regularizer that preserves the edges and promotes piecewise smoothness to the spatial quality of the image.

Lin et al. in 2018 [16] introduced a convex optimization-based CNMF (CO-CNMF) method. This method is proposed by incorporating sparsity and sum-of-squared-distances (SSD) regularizer. To extract high-quality data from the images, this method uses an SSD regularizer and provides sparsity by using 1 -norm regularization. By adding these two regularization terms with two convex subproblems helps to upgrade the performance of the existing CNMF method. However, sometimes performance degradation may occur in the CO-CNMF algorithm as the noise level increases. Therefore, it is necessary to add image denoising and spatial smoothing constraints with this fusion method.

Yang et al. in 2019 [17] introduced a total variation and signature-based (TVSR) regularizations CNMF method named as TVSR-CNMF. The TV regularizer is added to the abundance matrix to ensure the images spatial smoothness. Similarly, a signature-based regularizer (SR) is added with the endmember matrix for extracting high-quality spectral data. So, this method helps to reconstruct a hyperspectral image with good quality in spatial and spectral data.

Yang et al. in 2019 [18] introduced a sparsity and proximal minimum-volume regularized CNMF method named as SPR-CNMF. The minimum-volume regularizer controls and minimizes the distance between selected endmembers and the center of mass of the selected region in the image to reduce the computational complexity. It redefines the fusion method at each iteration until reaches the simplex with minimum volume. This method improves the fusion performance by controlling the loss of cubic structural information.

After being influenced by this work, we implemented an unmixing-based fusion algorithm named fully constrained CNMF (FC-CNMF). This method is a modified version of CNMF by including all spatial and spectral constraints available in the literature. In our method, a simplex with minimum volume constraint is imposed with the endmember matrix to exploit the spectral information fully. Similarly, sparsity and total variation constraints are incorporated with the abundance matrix to provide dimensionality reduction and spatial smoothness to the image. Finally, we evaluated the quality of the fused image obtained by FC-CNMF against the methods discussed in the literature using some standard quality measures. From these evaluations, we understood that our method shows better performance by yielding high-fidelity in the reconstructed images.

These traditional approaches reconstruct the high-resolution hyperspectral image by fusing the high-quality data from hyperspectral and multispectral images. However, to improve the quality of the reconstructed images, these approaches use different constraints such as sparsity, minimum volume simplex, and total variance regularization, etc. The performance and quality of the reconstructed HS image are highly influenced by these constraints and therefore our existing methods still have an ample space to enhance the quality of HSI.

2.2 Deep learning methods

Deep learning (DL) is a subbranch in machine learning (ML) and has shown remarkable performances in the research field, especially in the area of image processing and computer vision recently. DL is based on an artificial neural network that has been widely used in different areas such as super-resolution, classification, image fusion, object detection, etc. DL-based image fusion methods have the ability to extract deep features automatically from the image. Therefore, DL-based methods overcome the difficulties that are faced during the conventional image fusions methods and make the whole fusion process as easier and simple.

A deep learning-based HS-MS image fusion concept was first introduced by Palsson et. al in 2017 [19]. In this method, they used a 3-D convolutional neural network (3D-CNN) to fuse LR–HS and HR–MS image to construct HR-HS image. This method improves the quality of hyperspectral image by reducing noise and the computational cost. In this paper, they focused on enhancing the spatial data of LR–HS image without any changes in the spectral information and it caused the degradation of spectral data [19].

Later, Masi et al. in 2017 [20] proposed a CNN-architecture for image super-resolution, which uses deep CNN for extracting both spatial and spectral features. Deep CNN is used to acquire features from HSI with a very complex spatial-spectral structure. But in this paper, authors used deep CNN with single branch CNN architecture which is difficult to extract the discriminating features from the image.

To overcome this drawback, Shao and Cai in 2018 [21] designed a fusion method by extending CNN with depth of 3D-CNN for obtaining better performance while fusion. For implementing this, they used a remote sensing image fusion neural network (RSIFNN) that uses two CNN branches separately. One branch extract the spectral and the other extract the spatial data from the image. In this way, this method helps to exploit the spectral as well as spatial information from the input images to reconstruct high spectral and spatial resolution hyperspectral image.

Yang et.al in 2019 [22] introduced a deep two-branch CNN for HS–MS fusion. This method uses a two-branch CNN architecture for extracting spectral and spatial features from LR–HSI and HR–MSI. These extracted features from two branches of CNN are concatenated and then passed to the fully connected convolution layer to obtain HR–HSI. In all the conventional fusion methods, HR–HSI is reconstructed in a band-by-band fashion whereas in CNN concepts all bands are reconstructed jointly. Therefore, it helps to reduce the spectral distortion that occurs in the fused image. But this method uses fully connected layer for image reconstruction that is heavily weighted layer and it increases the network parameters.

Chen et al in 2020 [23], introduced a spectral–spatial features extraction fusion-CNN (S2FEF- CNN) which extracts joint spectral and spatial features by using three -S2FEF blocks. The S2FEF method use 1D and 2D convolution network to extract spectral and spatial features and fuse these spectral and spatial features. This method uses fully connected network layer for dimensionality reduction, and it further reduces the network parameters during the fusion. This method shows good results with less computational complexity compared to all other deep learning-based fusion method.

Although the deep learning-based fusion methods achieved tremendous improvement in their implementation, however, all these methods still possess many drawbacks [24]. As the network goes deeper, its performance gets saturated and then rapidly degrades. This is because, in DL method, each convolution layer takes inputs from the output of the previous layers, so when it reaches the last layer, a lot of meaningful information obtained from the initial layers will be lost. The information loss tends to get worse when the network is going deeper in architecture. This will bring some negative effects such as overfitting of data and this effect is called vanishing gradient problem [25].

Due to the vanishing gradient problem, the existing deep learning-based fusion could not be able to extract the detailed features from high dimensional images. He et al in ref., [26], introduced a deep network with residual learning to address the vanishing gradient problem. In this framework, a residual block is added between the layers to diminish the performance degradation. The networks with these concepts are called residual networks or ResNets. Therefore, in this work, our aim is to invoke this ResNet architecture into the standard CNN to exploit more detailed features from both spatial and spectral data of HSI.

Advertisement

3. Materials and methods

3.1 Dataset

The four real datasets such as Washington DC mall, Botswana, Pavia University, and Indian Pines are used in this work. The Washington DC Mall dataset is a well-known dataset captured by HYDICE sensor, which acquired a spectral range from 400 to 2500 nm have 1278×307 pixel size and 191 bands. The Botswana dataset which is captured by Hyperion sensor acquired over the Okavango delta in Botswana, which acquired a spectral range from 400 to 2500 nm with 1476 × 256 pixel size and 145 bands. The Pavia University dataset was captured by the reflective optics spectrographic imaging system (ROSIS-3) at the University of Pavia, northern Italy, in 2003. It has a spectral range from 430 to 838 nm and has a 610 × 340 pixel size and 103 bands. Finally, the dataset AVIRIS Indian Pines was captured by AVIRIS sensor over the Indian Pines test site in northwestern Indiana, USA, in 1992. It acquired a spectral range from 4 to 2500 μm having 512 × 614 pixel size and 192 bands [26]. All these datasets have been widely used in earlier spectral unmixing-based fusion research.

3.2 Convolution neural networks

Convolutional neural networks (CNN) have an important role in deep learning models. CNN specially built an algorithm that is designed to work with images to extract deep features from the image through convolution. The convolution is a process that applies a kernel filter across every element of an image to understand and react to each element within the images. This concept of convolution is more helpful during the extraction of specific features from high dimensional images. A convolutional network architecture is composed of an input layer, an output layer, and one or more hidden layers. The hidden layers are combination of convolution layers, pooling layers, activation layers, and normalization layers. These layers automatically detect essential features without any human supervision. So it is considered as a powerful tool for image processing [27].

  1. Convolution layer

    The convolution layer is used to extract various features from the input image with the help of filters. In convolution layer, mathematical operation is performed between the input image and the filter with m × m kernel size. This filter is sliding across the input image to calculate the dot product of the filter and part of the image. This process is repeated for convolving the kernel to all over the image and the output of the convolution operation is called a feature map. This feature map includes all essential information about the image such as the boundary and edges of objects etc. [28].

  2. Pooling layer

    The convolution layer is followed by a pooling layer, which reduces the size of the feature map by maintaining all the essential features. There are two types of pooling layers such as max pooling and average pooling. In Max pooling, the largest element is taken from the feature map whereas in the average pooling calculates the average of the element in the feature map [28].

  3. Activation function

    One of the most important characteristics of any CNN is its activation function. There are several activation functions such as sigmoid, tanH, softmax, and ReLU, and all these functions have their own importance. The ReLU is the most commonly used activation function in DL that accounts for the nonlinear nature of the input data [28].

3.3 Residual network (ResNet)

A residual network is formed by stacking several residual blocks together. Each residual block consists of convolution layers, batch normalization, and activation layers. The batch normalization process the data and brings numerical stability by using some scaling techniques without distorting the structure of the data. The activation layer is added into the residual network to help the neural network to learn more complex data. The CNN or deep learning method uses ReLU (rectified linear unit) function in the activation layer to accommodate the nonlinearity nature of the image data while providing the output. The residual blocks allow to flow information from the first layer to the last layers of the network by adding residual or skip connection strategy. Therefore, ResNet can effectively utilize features of the input data to the output of the network and thus alleviate gradient vanishing problems.

Let x be the input to the residual block, after processing the information x with two-stacked convolution layers of a residual unit, obtains F(W1x), where W is the weight given to the convolution layer. In ResNet, before giving an output of one convolution layer F(W1x) as input of the next layer by adding the x term, which is the input parameters of previous residual block, to provide an additional identity mapping information called as skip connection. Therefore the general formulation of a residual block can be represented as follows:

y=FWix+xE2

Here x is an input and y is the output of the residual unit. Then y is a guaranteed input to the next residual block. The function F(Wi x) represents the output of each convolution layer, and Wi is the weight associated with ith residual blocks. Figure 1 uses two convolution layers for the residual unit, so the output from this residual layer can be written as:

Figure 1.

HS–MS fusion using CNN.

FxW=W2ReLUW1xE3

Where ReLU represents the nonlinear activation function rectified linear unit (ReLU), W1 and W2 are the weight associated with convolution layers 1 and 2 of the residual block A. Deep residual networks consist of many stacked residual blocks and each block can be formulated in general as follows:

xi+1=FxlWl+xiE4

Where F is the output from residual block with l stacked convolution layer and xi is the residual connection to the ith residual block, then xi+1 become the output of the ith residual block, which is calculated by a skip connection and element-wise multiplication. After passing through the ReLU activation layer, the output residual network can be represented as:

y=ReLUxi+1E5
Advertisement

4. Problem formulation

A high-resolution hyperspectral image ZL×N with L spectral band and N pixels. The observed LR–HSI is obtained by downsampling the spatial quality of Z with a gaussian blur factor d is represented as YhL×N/d with L bands and N/d pixels. Similarly, the observed HR–MSI is obtained by downsampling the spectral quality of Z and it is represented as YmLm×N with Lm bands and N pixels, where Lm< L [27]. Therefore, the hyperspectral image can be mathematically modeled as:

Z=EA+RE6

Where, Z is the original referenced images, E and A are the endmember, abundance matrices, and R is the residual matrix respectively.

The observed YhandYm are spectrally and spatially degraded versions of image Z is further mathematically represented by:

YmSZ+RmE7
YhZB+RhE8

Where BN×N/d is a Gaussian blur filter with blurring factor d used to blur the spatial quality of the referenced hyperspectral image Z to obtain LR–HSI, Yh. The spectral response function, SLm×L is used to downsampling the spectral quality of the referenced hyperspectral image Z to obtain HR–MSI, Ym. The term Lmmeans the number of spectral bands used in the multispectral image after downsampling. In this work, referenced image Z is downsampled by its spectral values using standard L and sat 7 multispectral image that contains a high-quality visual image of Earth’s surface as HR–MSI with Lm=7 [28]. Both B and S are spared matrices containing zeros and ones. In general, the residual matrix RmandRh are assumed as zero-mean Gaussian noises in the literature, Therefore, the original CNMF method is shown as:

CNMFEA=YhEAhF2+YmEmAF2E9

However, in this work, we make use of the residual term RmandRh as a nonnegative residual matrix to account for the nonlinearity effects in the image fusion [29]. Since the objective function for the original CNMF method expressed in the Eq. (9) can be re-written as:

CNMFEAR=YhEAh+RhF2+YmEmA+RmF2E10

Therefore the Eq. (10) represents the proposed model of the HS–MS fusion by including the nonlinearity nature of the image. To implement this model, we use standard deep neural network architecture CNN and ResNet. For further enhancement of the proposed method, we implemented modified architecture of ResNet with different stacked layers and multiple skip connections.

Advertisement

5. Problem implementation

5.1 CNN fusion architecture

In CNN architecture, 1D CNN convolution operation is performed over the observed HS image Yh of dimension LhxNh with Lh spectral band and Nh number of pixels in the image with the help of filter to obtain the spectral data. In the same way, 2D CNN convolution operation is performed over the observed MS image is denoted by Ym of dimension Lm x Nm, with Lm spectral bands and Nm number of pixels in the image to obtain the spatial data. Finally, the high spectral component obtained from Yh and high spatial component obtained from Ym are fused together to reconstruct a high HR-HSI. The entire deep neural network-based HS–MS fusion is shown in Figure 1.

In CNN architecture, the Conv1D() convolution filter with kernel size r having weight v are used for extracting spectral data from LR–HSI, Yh are represented as follows:

fspec=Conv1DReLUFviYhE11

Similarly, the Conv2D() convolution filter with kernel size r × r having weight w are used for extracting spatial data from HR–MSI, Ym image are represented as:

fspat=Conv2DReLUFwijYmE12

The two convolutional layers use ReLU (rectified linear unit) activation functions, i.e., ReLU (x) = max(x, 0), to provide nonlinear mapping of data. Finally, fuse the extracted spatial and spectral features to get high-quality reconstructed image as shown in Eq. (4).

F=ReLUfspec×fspatE13

To implement this CNN fusion architecture, we use two convolution networks such as 1D and 2D convolution. Both 1D and 2D convolution uses the same number of convolution layers and kernel size. Each network uses four convolution layers with 32, 64, 128, and 256 filters. The kernel size of 3 × 3 and 1 × 3 are used for 2D CNN and 1D CNN for extracting spatial and spectral information about the image. Therefore, the architecture and parameters of CNN HS-MS fusion are shown in Table 1.

LayerFilterKernel sizeStridePaddingActivation
Conv 1Conv 1D321 × 31SameReLU
Conv 2D323 × 3
Conv 2Conv 1D641 × 31SameReLU
Conv 2D643 × 3
Conv 3Conv 1D1281 × 31SameReLU
Conv 2D1283 × 3
Conv 4Conv 1D2561 × 31SameReLU
Conv 2D2563 × 3
Output layerConv 1D11 × 11SameReLU
Conv 2D11 × 1

Table 1.

The Simple CNN Fusion Architecture.

In CNN, each layer takes its input as the output from the previous layer and it introduces lose information as the network architecture goes in deeper. This problem in deep neural network leads to overfitting of data, and it is known as vanishing gradient problem [24]. To overcome this, we implemented HS-MS fusion using an alternative ResNet-based network architecture. In ResNet, we introduced the skip connection between two convolution layers. This skip connection helps to map the identity of information throughout the deep convolution network.

5.2 Resnet fusion architecture

The ResNet fusion architecture for HS–MS fusion uses residual or skip connection which helps to improve the feature extraction capability from the images. For implementation, we use 1D ResNet to extract the spectral features from the LR–HSI and 2D ResNet for extracting spatial features from HR–MSI. Both 1D and 2D ResNet architecture consists of three residual blocks each having two convolutional layers and 64 filters as shown in Figure 2. A3×3 kernel size for 2D Resnet and 1×3 kernel size for 1D Resnet are used for extracting the spatial and spectral data from MSI and HSI. Each residual block has ReLU activation layer to accommodate the nonlinearity constraints included in the proposed hyperspectral image fusion model as explained in Eq. (10). Finally, the feature embedding and image reconstruction process are performed using another 2D CNN.

Figure 2.

Residual block with two stacked layer.

  1. Spectral generative network

    The spectral data from hyperspectral image Yh is extracted using 1D ResNet connection. Initially, spectral data are extracted from LR–HSI using 1D CNN and then mapping the residual connection r(Yh) with the stacked convolution layers. Finally, the output from ID CNN and r(Yh) are given to the input of the next residual block and this process is repeated for an entire residual block in the ResNet. The entire process in 1D ResNet is shown mathematically as:

    fYhl=ReLUWlYhlE14
    fspecYhl=fYhl+rYhlE15

    Therefore, output of ith residual block is represented as:

    fspeci=fspeci1Yhl+ri1YhlE16

    Where, Yh denotes the input LR- HSI data, i is the number of residual units i = 1,2,3…..I and l are the number of convolution layer l = 1,2,3…..l. The weight of convolution kernel is represented as W. Finally, ReLU an activation functions are exploited to introduce nonlinearities in the output of deep network as follows:

    Fspec=ReLUfspecE17

  2. Spatial generative network

    The spatial data from HR–MSI, Ym is extracted using 2D ResNet. Initially, spatial data are extracted from HR–MSI using 2D CNN and then mapping the residual connection r(Ym) with the stacked convolution layers. Finally, the output from 2D CNN to r(Ym) is given to the input of the next residual block and this process is repeated for an entire residual block in the ResNet. The entire process in 2D ResNet is shown mathematically as:

    fYml=ReLUWlYmlE18
    fspatYml=fYml+rYmlE19

    Therefore, output of the ith residual block is represented as:

    fspati=fspati1Yml+ri1YmlE20

    Where, Ym denotes the input HR- MSI data, i is the number of residual blocks i = 1,2,3…..I and l are the number of convolution layer l = 1,2,3…..L. The weight of the convolution kernel is represented as W. Finally, similar to spectral extraction ReLU is exploited to introduce nonlinearities in the spatial output of a deep network as follows:

    Fspat=ReLUfspatE21

  3. Fusion of spectral-spatial data

    The spectral data from LR–HSI and spatial data from HR–MSI are extracted using ResNet with size as (1x1x Spec) and (Spat x Spat x 1). After obtaining the spatial and spectral features, next step is to fuse this information by element-wise multiplication.

    ϜZ=ϜspecxϜspatE22

    Then, the feature embedding and image reconstruction are performed by using ReLU activation layer. The proposed ResNet Fusion framework is shown in Figure 3. Therefore, the final generated HR-HSI, Z can be written as:

    Z=ReLUϜZE23

  4. Different stacked layers and skip connection

    We also proposed an extension to the ResNet fusion architecture by varying the number of stacked convolution layers (2 to 4) in the residual block to increase the performance of the fusion using deep network. The 2-layer residual block contains two stacked convolution l ayer followed by ReLU activation layer. Similarly, in three-layer and four-layer residual blocks contain three and four-stacked convolution layers followed by ReLU activation layer. In addition to this, we utilize the ResNet fusion architecture by including different skip connections. The skip connection helps us to regulate the flow of information to a deeper network more effectively. For this, we use long skip and dense skip connections as shown in Figure 4. The long skip connections are designed by creating a connection between alternate residual layer ith and (i + 2)th along with a short skip connection between every layer in the ResNet. In dense skip connection, each layer i obtain an additional input from all the preceding layers. Then, the layer i pass its own feature maps to all the subsequent layers. Using the dense skip connection, each layer in the ResNet receives feature maps from all the preceding layers and that limits the number of filters and network parameters for extracting deep features. In order to obtain high fidelity reconstructed image, we proposed a modified version of ResNet with long and dense skip connections shown in Figure 4.

Figure 3.

The framework of the proposed ResNet Fusion architecture.

Figure 4.

Representation of short, long, and dense skip connection on ResNet.

In the Figure 4 show three Resnet architecture, having three- residual blocks (Res Block), with three different types of skip connections. Algorithm 1 summarizes the procedures of our proposed ResNet fusion method.

Algorithm 1: Resnet Fusion
Input: LR-Hyperspectral image Yh and HR-Multispectral image Ymbegin
  1. Extract spectral features from Yh and spatial features from Ym using ResNet

  2. r(Yh) ← Yh and r(Ym) ← Ym

  3. For each residual block in ResNet i = 1,2,3….I # for each residual block

  4. for each convolution layer l in the residual block l = 2,3,4 # for stacked convolution layer

    fYhl=ReLUWlYhl

    fYml=ReLUWlYml

    end for

    # add the residual connection

    fspecYhl=fYhl + rYhl

    fspatYml=fYml + rYml

    r(Yh) ← fspecYhl

    r(Ym) ← fspatYml

    end for

  5. The extracted spectral features Ϝspec of size (1x1x Spec) and spatial features Ϝspat of size (Spat x Spat x1) are fused together by element-wise multiplication.

  6. ϜZ = ϜspecxϜspat

  7. Finally, generated HR-HSI after feature embedding and image reconstruction using relu activation layer.

  8. Z = ReLUϜZ

End
Output: HR- Hyperspectral image, Z

Advertisement

6. Results and discussion

In this paper, intially we implemented CNN-based fusion by extracting the spectral data from LR–HSI using 1D convolution network and spectral data from HR–MSI using 2D convolution network. These extracted spatial and spectral features are then fused together to obtain HR–HSI. To extract more detailed features from HS and MS, it requires deep CNN architecture. As CNN architecture become deeper, it introduced vanishing gradient problem. To overcome this, we implemented an unsupervised ResNet Fusion network by using skip connections. The proposed ResNet fusion inherits all the advantages of standard CNN. In addition to this, ResNet allows the designing of a deeper network without any performance degradation during the feature extraction process. Therefore, the proposed ResNet Fusion architecture extracts more discriminative features from both HSI and MSI and finally reconstruct a high-resolution HSI by fusing these extracted high-quality features from the ResNet.

The performance of CNN and ResNet fusion method is evaluated on four benchmark data sets using standard quality measures namely SAM, ERGAS, PSNR, and UIQI [30]. Further, we also compared the performance of CNN and ResNet fusion against the baseline fusion methods namely, CNMF [9], FCN-CNMF, and S2FEF- CNN [22]. Out of these, CNN shows better performance compared to CNMF and FCN-CNMF. The ResNet-based fusion shows outstanding performance compared to all other methods including CNN. The results obtained by CNN and ResNet fusion method against the baseline methods on four benchmark datasets are shown in Table 2. The low SAM indicates the good spectral data in the fused image and low ERGAS shows the statistical quality of the reconstructed image. The high PSNR and UIQI show good spatial quality and high fidelity reconstructed image with less spectral distortion. From Table 2, it is further clear that good spectral preservation is obtained in Botswana dataset on analyzing the SAM value, which is reduced by more than 0.02 dB. Simultaneously, significant spatial preservation is achieved in the Indian Pine database revealed by the PSNR value increased by 1.5 dB.

DatasetMethodsCNMFFC-CNMFCNNS2FEF-CNNResNet
Pavia universitySAM0.06330.06520.04510.04410.0409
EARGAS0.54230.45020.43110.49010.4029
PSNR64.450264.892365.129964.491566.1127
UIQI0.87790.93160.92620.96650.9872
Indian pinesSAM0.51130.39760.45250.41180.3896
EARGAS0.87330.69910.64340.71920.6170
PSNR62.677963.107663.131164.816565.2971
UIQI0.79880.84320.81180.87760.8991
Washington DC mallSAM0.56090.59980.59560.55190.5171
EARGAS0.57410.50340.49930.48860.4850
PSNR64.0964.1264.1965.1165.1358
UIQI0.91990.94090.92130.93650.9656
BotswanaSAM0.25410.21790.22330.21080.1908
EARGAS0.51940.49890.50340.49920.4698
PSNR63.112363.432163.901964.011664.8798
UIQI0.97030.9772097150.98270.9960

Table 2.

The performance evaluation of different fused algorithms on four hyperspectral datasets.

The above work is extended by introducing different stacked convolution layers in the residual block of the ResNet. The experimental results obtained after stacked convolution layer in the ResNet are shown in Table 3. From the SAM value in Table 3, it is clear that the spectral information of the image is reducing as and when the number of stacked layers in the residual block increases. The UIQI value from the Table 3 also reveals that quality of the reconstructed image is also diminishing as the number stacked layer increases in the ResNet. The PSNR and EARGS show a stable performance, which ensure the spatial consistency of our proposed method. So, we concluded that ResNet Fusion network with two-stacked convolution layer acquires more discriminative features from the source images and guarantee the quality of the reconstructed image on analyzing the results obtained in Table 3.

DatasetMethodsNumber of stacked convolution layers
2 layers3 layers4 layers
Pavia universitySAM0.04090.0650.069
EARGAS0.40290.40290.4029
PSNR66.112766.112766.1127
UIQI0.98720.97130.9622
Indian pinesSAM0.38960.41860.4553
EARGAS0.61700.61700.6170
PSNR65.297165.297165.2971
UIQI0.89910.89040.8801
Washington DC mallSAM0.51710.55290.5721
EARGAS0.48500.48500.4850
PSNR65.135865.135865.1358
UIQI0.96560.94320.9209
BotswanaSAM0.19080.19780.2085
EARGAS0.46980.46980.4698
PSNR64.879864.879864.8798
UIQI0.99600.98220.9589

Table 3.

The performance of ResNet fusion by varying the stacked layers.

Figure 1 shown below is the visual representation of the output provided by our proposed ResNet fusion method on four benchmark datasets against all other baseline methods. From the figure, it is evident that ResNet Fusion with two-stacked convolution layers produces better performance in most of the areas in the image (highlighted) of the four datasets (Figure 5).

Figure 5.

The ground truth and fused image of different methods using four benchmark datasets.

We further extend the Resnet fusion architecture to reduce the number of parameters to make our proposed method more efficient and effective to handle high dimensional data. For that, we used short skip, long skip, and dense skip connection to the ResNet architecture with two-stacked convolution layers. Table 4 gives the total number of network parameters required for this ResNet architecture in each skip connection. From Table 4, it is clear that ResNet architecture with dense skip connection provides very less network parameters compared to ResNet with short and long skip connections.

ArchitectureNumber of parameters
CNN31,586,081million
ResNet with Short Skip8,045,825 million
ResNet with Long Skip390,529 million
ResNet with Dense Skip19,393 million

Table 4.

The performance of different skip connection.

  1. Time complexity

    Comparing the performance and running time of all the proposed algorithms on four benchmark datasets are shown in Figure 6. From this figure, it is evident that ResNet fusion with dense skip connection took very less running time and showed good performance in reconstructing high-fidelity hyperspectral image. On comparing the performance and running time of ResNet with long skip and short skip connection, long skip connection ResNet fusion architecture shows good performance and running time than short skip connection. On evaluating the performance and running time of all ResNet fusion architectures, ResNet with dense skip connection outperformed compared to the other two ResNet fusion architectures. While comparing the performance and running time, the FCN-CNMF method showed better performance and time than CNN-based fusion. Finally, we concluded that, ResNet with dense skip connection with less network parameter shown highlighting performance for reconstructing good spatial and spectral quality HR-HSI compared to all other proposed methods. However, all our proposed methods show good in performance but the cost incurred in terms of time is high.

  2. Resnet HS-MS fusion model

    The experimental analysis of our ResNet fusion architecture with various parameters is done to build a general model for our proposed HS-MS ResNet fusion algorithm. For this purpose, we trained the network by using cropped HSI and MSI image pairs from each dataset. That means each dataset is cropped into several patches and then divided into training and testing data. In the case of Indian Pine dataset with size 610 × 340 × 103 are cropped into several patches of size M × N × L. The patch size was M× N ×L = 15 × 15 × 103 for Indian Pine dataset showing high performance to our network model. Similarly, we create training and testing samples for all three datasets. The patch size for Washington Dc Mall dataset was M × N × L = 19 × 19 × 191, for Botswana dataset, were M × N × L = 17 × 17 × 145 and for Pavia University dataset were M × N × L = 19 × 19 × 192 gives a network model with good running time and network parameters.

    We measure the quality matrix value of our ResNet fusion by varying the number of stacked layers and found that residual blocks, each having two-stacked convolution layers is performing better than the others. The most significant part of ResNet is skipped connection, which helps for the flow of information through the network more efficiently and effectively. So, we also experimented with three skip connections: short skip, long skip, and dense skip connection. From this experiment, we found that ResNet with a dense skip connection reduces the number of network parameters to a large extent.

    Finally, we built a generative ResNet model for the fusion of HS–MS image as shown in Table 5. The ResNet fusion model uses ID and 2D convolution networks. These two convolution networks consist of three residual blocks, each residual block contains two convolution layers with 64 filters, 3x3 kernel size, stride = 1, max-pooling, and padding = same. To make the information flow accurately throughout the network, we use dense skip connection. At last, it uses a 2D convolution to decode the reconstructed image into the original format.

Figure 6.

The running time of traditional and deep learning HS-MS image fusion.

NameLayerKernel sizeInput sizeInput contentStridePaddingActivationOutput sizeOutput content
Input LayerConv 1ID-CNN1*311D Image(Spectral)1sameReLU641DConv1
2D-CNN3*322D Image(Spatial)1sameReLU642DConv1
Residual Block1Conv 2ID-CNN1*3641DConv11sameReLU641DConv2
2D-CNN3*3642DConv11sameReLU642DConv2
Conv 3ID-CNN1*3641DConv21sameReLU641DConv3
2D-CNN3*3642DConv21sameReLU642DConv3
Skip ConnectionAdd 11DConv1 + 1DConv31DResB1
2DConv1+ 2DConv32DResB1
Residual Block2Conv 4ID-CNN1*3641DResB11sameReLU641DConv4
2D-CNN3*3642DResB11sameReLU642DConv4
Conv 5ID-CNN1*3641DConv41sameReLU641DConv5
2D-CNN3*3642DConv41sameReLU642DConv5
Skip ConnectionAdd 21DConv1 + 1DResB1 + 1DConv51DResB2
2DConvt1 + 2DResB1 + 2DConv52DResB2
Residual Block3Conv 6ID-CNN1*3641DResB21sameReLU641DConv6
2D-CNN3*3642DResB21sameReLU642DConv6
Conv 7ID-CNN1*3641DConv61sameReLU641DConv7
2D-CNN3*3642DConv61sameReLU642DConv7
Skip ConnectionAdd 31DConv1 + 1DResB1 + 1DResB2 + 1DConv71DResB3
2DConv1 + 2DResB1 + 2DResB2 + 2DConv72DResB3
Max poolingConv 8ID-CNN1*3641DResB31sameReLU321DConv8
2D-CNN3*3642DResB31sameReLU322DConv8
Flatten layerConv 9ID-CNN1*1321DConv81sameReLU1Spectral data
2D-CNN1*1322DConv81sameReLU1Spatial data
Upsampling layerConv 102D-CNN3*31Spectral/Spatial data1sameReLU32Spectral*Spatial
Output layerConv 112D-CNN3*332Spectral * Spatial1sameReLU64Fused Image

Table 5.

ResNet-dense skip Architecture of HS-MS image fusion.

Advertisement

7. Conclusion

In this work, we implemented HS–MS fusion on deep learning method because of its strong ability to extract features from the image. At first, we implemented the HS–MS fusion process in conventional CNN method. But in CNN, each layer takes the output from the previous layer, which tends to lose information as the network goes into deeper architecture. So we further implemented the fusion process in ResNet by adding the skip connection between the convolution layers. This skip connection helps to extract more detailed features from the images without any degradation problems. Our constructed ResNet fusion architecture includes three-residual blocks, and each block is a combination of stacked convolution layer and skip connections. Moreover, we modify the ResNet fusion architecture with different stacked layers and found that ResNet with two-stacked layer gives more accurate results. Finally, we extend ResNet architecture to reduce the number of parameters by using different skip connections like short ship, long skip, and dense skip connections. From the experimental analysis, it is found that the ResNet- dense skip improve the performance in image reconstruction with very less network parameters and running time compared to other fusion methods. This deep residual network helps to extract nonlinearity features with the help of the ReLU activation layer. The experiment and performance analysis of our algorithm is done effectively and quantitatively on four benchmark datasets. The fusion results indicate that ResNet with dense skip fusion method shows outstanding performance over traditional and DL methods by keeping the spatial and spectral data to a large extent in the reconstructed image.

References

  1. 1. Michael NH, Kudenov W. Review of snapshot spectral imaging technologies. Optical Engineering. 2013;52(10):090901
  2. 2. Feng F, Zhao B, Tang L, Wang W, Jia S. Robust low-rank abundance matrix estimation for hyperspectral unmixing. IET International Radar Conference (IRC 2018). 2019;2019(21):6406-6409
  3. 3. Dhore AD, Veena CS. Evaluation of various pansharpening methods using image quality metrics. 2nd International Conference on Electronics and Communication Systems (ICECS). IEEE. 18 June 2015:2015. DOI: 10.1109/ecs.2015.7125039
  4. 4. Wang Z, Chen B, Ruiying L, Zhang H, Liu H, Varshney PK. FusionNet: “An unsupervised convolutional variational network for hyperspectral and multispectral image fusion”. IEEE Transactions on Image Processing. 2020;29:7565-7577
  5. 5. He K, Zhang X, Ren H, Sun J. Deep residual learning for image recognition. Proceedings of the IEEE conference on computer vision and pattern Recognition. IEEE. 12 December 2016:770-778
  6. 6. Loncan L, de Almeida LB, Bioucas-Dias JM, Briottet X, et al. Hyperspectral pansharpening: A review. In: IEEE Geoscience and Remote Sensing Magazine. IEEE; September 2015;3(3):27-46
  7. 7. Vivone G et al. A critical comparison among pansharpening algorithms. IEEE Transactions on Geoscience and Remote Sensing. 2015;53(5):2565-2586
  8. 8. Wei Q, Bioucas-Dias J, Dobigeon N, Tourneret JY. Hyperspectral and multispectral image fusion based on a sparse representation. IEEE Transactions on Geoscience and Remote Sensing. 2015;53:3658-3668
  9. 9. Patero and U. Tapper. Positive matrix factorization: A non-negative factor model with optimal utilization of error estimates of data values. Environmetrics. 1994;5:111-126
  10. 10. Lee DD, Seung HS. Algorithms for non-negative matrix factorization. In: Advances in Neural Information Processing Systems. Denver. Cambridge, MA, United States: MIT press; 2001. pp. 556-562
  11. 11. Tong L, Zhou J, Qian B, Yu J, Xiao C. Adaptive graph regularized multilayer nonnegative matrix factorization for hyper-spectral unmixing. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing. 2020;13:434-447
  12. 12. Cao J et al. An endmember initialization scheme for nonnegative matrix factorization and its application in hyper-spectral unmixing. ISPRS International Journal of Geo-Information. 2018;7:195. DOI: 10.3390/ijgi7050195
  13. 13. José M, Nascimento P, Bioucas Dias JM. Vertex component analysis: A fast algorithm to unmix hyperspectral data. IEEE Transactions on Geoscience and Remote Sensing. 2005;43(4)
  14. 14. Yokoya N, Yairi T. Iwasaki, “A Coupled nonnegative matrix factorization unmixing for hyperspectral and multispectral data fusion”. IEEE Transactions on Geoscience and Remote Sensing. 2012;50:528-537
  15. 15. Simoes M, Bioucas-Dias J, Almeida L, Chanussot J. A convex formulation for hyperspectral image super resolution via subspace-based regularization. IEEE Transactions on Geoscience and Remote Sensing. 2015;53:3373-3388
  16. 16. Lin C-H, Ma F, Chi C-Y, Hsieh C-H. A convex optimization-based coupled nonnegative matrix factorization algorithm for hyperspectral and multispectral data fusion. IEEE Transactions on Geoscience and Remote Sensing. 2018;56(3):1652-1667. DOI: 10.1109/tgrs.2017.2746078
  17. 17. Yang F, Ma F, Ping Z, Guixian X. Total variation and signature-based regularizations on coupled nonnegative matrix factorization for data fusion. Digital Object Identifier. 2019;7:2695-2706. DOI: 10.1109/ACCESS.2018.2857943. IEEE Access
  18. 18. Yang F, Ping Z, Ma F, Wang Y. Fusion of hyperspectral and multispectral images with sparse and proximal regularization. IEEE Access Digital Object Identifier. 2019;2019:2961240. DOI: 10.1109/ACCESS
  19. 19. Palsson F, Sveinsson JR, Ulfarsson MO. Multispectral and hyperspectral image fusion using a 3-D convolutional neural network. IEEE Geoscience and Remote Sensing Letters. 2017;14:639-643
  20. 20. Masi G, Cozzolino D, Verdoliva L, Scarpa G. Pansharpening by convolutional neural networks. Remote Sensing. 2017;8(7):594
  21. 21. Shao Z, Cai J. Remote sensing image fusion with deep convolutional neural network. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing. May 2018;11(5):1656-1669
  22. 22. Yang J, Zhao Y-Q, Chan J. Hyperspectral and multispectral image fusion via deep two-branches convolutional neural network. Remote Sensing. 2019;10(5):800
  23. 23. Chen L, Wei Z, Xu Y. A lightweight spectral–spatial feature extraction and fusion network for hyperspectral image classification. Remote Sensing. 2020;12:1395. DOI: 10.3390/rs12091395. 28 April
  24. 24. Song W, Li S, Fang L, Lu T. Hyperspectral image classification with deep feature fusion network. IEEE Transactions on Geoscience and Remote Sensing. 2018;56(7):3173-3184
  25. 25. Available from: http://lesun.weebly.com/hyperspectral-data-set.html
  26. 26. Ian Goodfellow, Yoshua Bengio, Aaron Courville. Deep Learning. Available from: https://www.deeplearningbook.org/
  27. 27. Ma F, Yang F, Ping Z, Wang W. Joint spatial-spectral smoothing in a minimum-volume simplex for hyperspectral image super-resolution. Applied Sciences. 2019;10(1)
  28. 28. Available from: https://www.usgs.gov/landsat-missions/landsat-7
  29. 29. Hong D, Yokoya N, Chanussot J, Zhu X. An augmented linear mixing model to address spectral varialbilty for hyperspectral unmixing, geography, computer science. In: IEEE Transactions on Image Processing. 2018
  30. 30. Wang ACBZ. A universal image quality index. IEEE Signal Proessing Letters. 2002;9:81-84

Written By

K. Priya and K.K. Rajkumar

Submitted: 06 May 2022 Reviewed: 18 May 2022 Published: 19 July 2022