Quantitative comparison results of the self-similarity constrained sparse representation with the state-of-the-art methods on the CAVE dataset.

## Abstract

Reconstructing a high-resolution (HR) hyperspectral (HS) image from the observed low-resolution (LR) hyperspectral image or a high-resolution multispectral (RGB) image obtained using the exiting imaging cameras is an important research topic for capturing comprehensive scene information in both spatial and spectral domains. The HR-HS hyperspectral image reconstruction mainly consists of two research strategies: optimization-based and the deep convolutional neural network-based learning methods. The optimization-based approaches estimate HR-HS image via minimizing the reconstruction errors of the available low-resolution hyperspectral and high-resolution multispectral images with different constrained prior knowledge such as representation sparsity, spectral physical properties, spatial smoothness, and so on. Recently, deep convolutional neural network (DCNN) has been applied to resolution enhancement of natural images and is proven to achieve promising performance. This chapter provides a comprehensive description of not only the conventional optimization-based methods but also the recently investigated DCNN-based learning methods for HS image super-resolution, which mainly include spectral reconstruction CNN and spatial and spectral fusion CNN. Experiment results on benchmark datasets have been shown for validating effectiveness of HS image super-resolution in both quantitative values and visual effect.

### Keywords

- hyperspectral imaging
- image super-resolution
- optimization-based approach
- deep convolutional neural network (DCNN)
- spectral reconstruction
- spatial and spectral fusion

## 1. Introduction

Hyperspectral (HS) imaging simultaneously obtains a set of images of the same scene on a large number of narrow-band wavelengths which can effectively describe the spectral distribution for every scene point and provide intrinsic and discriminative spectral information of the scene. The acquired dense spectral bands of data are capable to benefit for numerous applications, including object recognition and segmentation [1, 2, 3, 4, 5, 6, 7, 8, 9], medical image analysis [10], and remote sensing [11, 12, 13, 14, 15], to name a few. Although with the availability of the abundant spectral information with HS imaging, it generally results in much low spatial resolution compared with ordinary panchromatic and RGB images since photon collection in HS sensors is performed in a much larger spatial region for guaranteeing sufficiently high signal-to-noise ratio. The low spatial resolution in the HS images leads to high spectral mixing of different materials in a scene and greatly affects the performance of scene analysis and understanding. Therefore, the reconstruction of high-resolution hyperspectral (HR-HS) image using image processing and machine leaning techniques has attracted a lot of attention.

Especially in remote sensing field, a low-resolution (LR) multispectral or HS image is usually available accompanying with a HR single-channel panchromatic image, and the fusion of these two images is generally known as the pan-sharpening technique [16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39]. Motivated by the fact that human vision is more sensitive to luminance, traditional pan-sharpening technique mainly concentrated the reliable illumination restoration via substituting the calculated component of the LR-HS image with the HR information of panchromatic image via sue saturation exploring and principle component analysis. However, these simple approaches avoidably cause spectral distortion in the resulting image. Recently, the HS image super-resolution actively investigates the optimization methods for minimizing the reconstruction error of the available LR-HS and HR-MS (HR-RGB) images [16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30], which manifested impressive performance. The basic idea of these optimization-based approaches assumes that the spectrum can be represented as matrix decomposition with different constraints such as representation sparsity, spectral physical properties, spatial context similarity, and composited matrixes, which are iteratively optimized for more accurate approximating the observed images. Recently, the matrix factorization and spectral unmixing [40, 41, 42, 43]-based HS image super-resolution, which are mainly motivated by the fact the HS observations can be represented by a linear combination of the reflectance function basis (the spectral signatures of the pure materials) and the weight vector denoting the fractions of the pure materials on the spectral response is assumed sparse, have been actively investigated [16, 17, 27, 28]. A coupled nonnegative matrix factorization (CNMF) by Yokoya et al. [19], inspired by the physical property of nonnegative weights for the linear combination, has been proposed to estimate the HR-HS image from a pair of HR-MS and LR-HS images. Although the CNMF approach provided acceptable spectral recovery performance, its solution is usually not unique [44], which cannot always lead to unsatisfied spectral recovery results. Lanaras et al. [10] proposed to integrate coupled spectral unmixing strategy into HS super-resolution and conducted optimization procedure with the proximal alternating linearized minimization method, which requires the good initial points of the two decomposed reflectance signatures and the fraction vectors for providing impressive results. Furthermore, taking consideration of the physical meaning of the spectral linear combination on the reflectance signatures and the implementation effectiveness, most work generally assumes that the number of the pure materials in the observed scene is smaller than the spectral band number, which is not always satisfied in the real application.

Motivated by the successful applications of the sparse representation on the natural image analysis [14, 15] such as image de-noising, super-resolution, and representation, the sparsity-promoting approaches without considering explicitly the physical meaning constraint on the reflection signature (basis) and thus permitting over-complete basis have widely been applied for HS super-resolution [18, 19]. Inspired by the work in the general RGB image analysis with sparse representation, Grohnfeldt et al. [11] explored a joint sparse representation for HS image super-resolution. Via learning the corresponding HS and MS (RGB) patch dictionaries using the prepared pairs, this work assumed the same sparse coefficients of the corresponding MS and HS patch dictionary, and thus, these can be calculated with only the MS input patch. However the above procedure was conducted on each individual band, which mainly considered the well reconstruction of the local structure (patch) and completely ignored the spectral correlation between channels. Therefore, several other works [19, 22] investigated the sparse spectral representation via conducting reconstruction of all band spectra instead of the local structure on each individual band. Akhtar et al. [13] explored a sparse spatiospectral representation via calculating the optimized sparse coefficients of each spectral pixel but assuming the same used atoms for the pixels in a local grid region to integrate the spatial structure. For calculation effectiveness, a generalized simultaneous orthogonal matching pursuit (G-SOMP) was proposed for estimating the sparse coefficients in [22]. Later, the same research group integrated the sparse representation and the Bayesian dictionary learning algorithm for improving the HS image super-resolution performance and manifested its effectiveness. Dong et al. [21] proposed a nonnegative structured sparse representation (NSSR) approach for taking consideration of the spatial structure and then conducted optimization procedure with the alternative direction multiplier method (ADMM) technique. NSSR achieved a large margin on HS image recovery performance compared with the other state-of-the-art approaches. Furthermore, Han et al. [45] proposed to recover the HR-HS output via minimizing the coupled reconstruction error of the available LR-HR and HR-RGB images with the following constraints, (1) the sparse representation with over-complete spectral dictionary in the coupled unmixing strategy [17] and (2) the self-similarity of the sparse spectral representation in the global structures and the local spectra existed in the available HR-RGB image, which further improved the HS image recovery performance in both visual and quality aspects.

Deep convolutional neural networks (CNNs) have recently shown great success in various image processing and computer vision applications. CNN has also been applied to RGB image super-resolution and achieved promising performance. Dong et al. [46] proposed a three-layer CNN architecture (SRCNN), which demonstrates about 0.5–1.5 db improvement and much lower computational cost compared with the popularly used sparse-based methods, and they further extended SRCNN to be capable of directly dealing with the available LR images without mathematical upsampling operation, called as fast SRCNN. Kim et al. [47] exploited a very deep CNN architecture based on VGG-net architecture and concentrated on only estimating the missing high-frequency image (residual image). Ledig et al. integrated two different types of networks, generate network and discriminate network (called as GAN), for estimating much sharper HR image. For applying CNN to HSI SR, Li et al. [48] applied similar structures of SRCNN to super-resolve HSI only from the LR-HS image. These CNN architectures take only the LR image as input, and the expanding factor of resolution enhancement is theoretically limited to be lower than 8 in both height and width. There are also several works exploring CNN-based method with variant backbone architectures to expand the spectral resolution with only HR-RGB image as input [49, 50]. This chapter introduces several research works based on DCNN learning for HS image reconstruction.

On the other hand, regarding to the use of the observed data, the HR-HS image reconstruction can be divided into three research directions: (1) spatial resolution enhancement from hyperspectral imaging, (2) spectral resolution enhancement from RGB imaging, and (3) fusion method based on the observed HR-RGB and low-resolution (LR) HS images of the same scene. Spatial resolution enhancement has popularly been used on single natural image super-resolution [46, 47], and impressive performance has been achieved especially with the deep learning method in the resolution expanding factor from 2 to 4. The deep convolutional neural network (DCNN) has also been adopted for predicting the HR-HS image from a single LR-HS image [48] and validated feasibility of HS image super-resolution for small expanding factor. However, the spatial resolution of the available HS image is considerably low compared with the commonly observed RGB image, and then the expanding factor for HR-HS image reconstruction is required to be large enough, for example, more than 10 in horizontal and vertical directions, respectively. Thus, the reconstructed HS image with acceptable quality usually cannot reach the required spatial resolution for different applications. The spectral resolution enhancement for RGB-to-spectrum reconstruction [49, 50] has recently become a hot research line with a single RGB image, which can be lightly collected with a low-price visual sensor. Although the impressive potential of the RGB-spectrum reconstruction is evaluated, there has still large space for performance improving in real applications. Fusing a LR-HS image with the corresponding HR-RGB image to obtain a HR-HS image has shown promising performance [18, 19, 22, 30] compared to spatial and spectral resolution enhancement methods. It is usually solved as an optimization problem with prior knowledge such as sparsity representation and spectral physical properties as constraints, which needs comprehensive analysis of the target scene previously and would be varied scene by scene. Motivated by the amazing performance of the DCNN in natural image super-resolution, Han etc. [51] proposed a spatial and spectral fusion network (SSF-Net) for the HR-HS image reconstruction and validated the better results of the SSF-Net in spite of the simple concatenation of the upsampled LR-HS image and the HR-RGB image. However, the upsampling of the LR-HS image and the simple concatenation cannot effectively integrate the existed spatial structure and spectral property but would lead to computational cost. In addition, precise alignment is needed for the input of LR-HS and HR-RGB images and is extremely difficult due to the large difference of spatial resolution in the LR-HS and HR-RGB images. This chapter introduces several advanced DCNN-based learning methods for hyperspectral image super-resolution and manifests the impressive performance for benchmark datasets. The basic concept of the hyperspectral image super-resolution is shown in Figure 1.

## 2. Problem formulation of HS image super-resolution

The goal of HS image super-resolution is to recover a HR-HS image *L* denotes the spectral band number and *W* and *H* denote the image width and height, respectively, from a HR-MS image

where

where *t*-*th* step is formulated as:

where

Since the number of the unknowns (*N*L*) is much larger than the number of available measurements *(M*L + 3*N*), the above optimization problem is highly ill-posed, and proper regularization terms are required to narrow the solution space and ensure stable estimation. A widely adopted constraint is that each pixel spectral

where *k-th* endmember) of *K* distinct materials. *K* materials for the *n*-th pixel. Taking consideration of the physical property on the spectral reflectance, the elements in the spectral signatures and the fractional abundance are nonnegative as shown in the first and second constraint terms of Eq. (4), and the summation of abundance vector for each pixel is one.

According to

where

The matrix representation forms of Eqs. (4) and (5) can be formulated as:

where

The goal of Eq. (7) is to solve both spectral dictionary

## 3. Self-similarity constrained sparse representation for HS image super-resolution

The complete pipeline of self-constrained sparse representation for HS image super-resolution is illustrated in Figure 2. The main contribution of this method is to propose a nonnegative sparse representation coupled with self-similarity constraint to regularize the solution of Eq. (7). Denoting

where

### 3.1 Online HS dictionary learning

Since different materials would have very large variety of the HS reflectance, learning a common HS dictionary for various scenes with different materials would lead to considerable spectral distortion. In order to obtain a set of adaptive HS dictionary for well reconstructing the pixel spectra, this study conducts the learning procedure directly using the observed LR-HS image

where

### 3.2 Extraction of self-similarity constraint

The regularization term

• **Global-structure self-similarity**: Since pixels with similar spatial structure, which are represented as the concatenated RGB spectra within a local square windows, share similar hyperspectral information, thus the sparse vectors for reconstructing the hyper-spectra of these pixels would also be similar; this applies for both nearby patches and nonlocal patches in the whole image plane, and we name these as global-structure self-similarity.

• **Local-spectral self-similarity**: Since pixels in a local region have the same material with RGB values in the HR-RGB image, the sparse vector for different HR pixels is similar in a local region (superpixel). Note the superpixel is usually not a square patch.

The global-structure self-similarity is represented by global-structure groups *P* groups), which are obtained by clustering all similar patches (spatial structure) in the HR-RGB image with *K*-means; *p*-th group. The local-spectral self-similarity is formulated as the superpixels *Q* superpixels) obtained via SLIC superpixel segmentation method; *q*-th superpixel. Since the pixels in the same global-structure group have similar spectral-spatial structure, we calculate the sparse vector of any pixel in a given group by a weighted average of the sparse matrix for all pixels in this group. Similarly, the sparse vector of a pixel can also be approximated by a weighted average of the sparse matrix for all pixels in the same local-spectral superpixel. With both self-similarity constraints, the sparse vector for the *n*-th pixel can be formulated as:

where *n*-th sparse vector *i*-th sparse vector *j*-th sparse vector

To be more specific, *n*-th and *i*-th pixels. Each patch is a set of pixels in a **p** is a

where *n*-th and *i*-th pixels (so each

We then build affinity matrices

With the self-similarity constraints of the global-structure and local-spectral, the sparse representation will be more robust and prospected to be similar for the locations in the same clustered global group and local superpixel. Given the HS dictionary

### 3.3 Experimental results

We evaluate the self-similarity constrained sparse representation method using two publicly released hyperspectral imaging databases: the CAVE and Harvard datasets. The CAVE dataset includes 32 indoor images consisting of paintings, toys, food, and so on, which are captured under controlled illumination. The Harvard dataset has 50 indoor and outdoor images captured under daylight illumination. The image size in the CAVE dataset is 512 **Z** and resize them by a factor of 32 to create 16

#### 3.3.1 Compare results with the state-of-the-art methods

Firstly, we manifest the compared recovery performance of the HR-HS images with our proposed method (including the online dictionary learning procedure and self-similarity constraints) and the state-of-the-art HS image SR methods including matrix factorization (MF) method [18], coupled nonnegative matrix factorization method [19], sparse nonnegative matrix factorization (SNNMF) method [20], generalization of simultaneous orthogonal matching pursuit method [13], Bayesian sparse representation (BSR) method [9], couple spectral unmixing (CSU) method [10], and nonnegative structured sparse representation method [21]. Table 1 manifests the average RMSE, PSNR, SAM, and ERGAS results of the 32 images in the CAVE dataset [32], while Table 2 shows the average results of the 50 images from the Harvard dataset [33].

MF [18] | CNMF [19] | SNMF [20] | GSOMP [13] | BSR [9] | CSU [10] | NNSR [21] | Our | |
---|---|---|---|---|---|---|---|---|

RMSE | 3.03 ± 0.97 | 2.93 ± 1.30 | 3.26 ± 1.57 | 6.47 ± 2.53 | 3.13 ± 1.57 | 3.0 ± 1.40 | 2.21 ± 1.19 | 2.17 ± 1.08 |

PSNR | 39.37 ± 3.76 | 39.53 ± 3.55 | 38.73 ± 3.79 | 32.48 ± 3.08 | 39.16 ± 3.91 | 39.50 ± 3.63 | 42.26 ± 4.11 | 42.28 ± 3.86 |

SAM | 6.12 ± 2.17 | 5.48 ± 1.62 | 6.50 ± 2.32 | 14.19 ± 5.42 | 6.75 ± 2.37 | 5.8 ± 2.21 | 4.33 ± 1.37 | 3.98 ± 1.27 |

ERGAS | 0.40 ± 0.22 | 0.39 ± 0.21 | 0.44 ± 0.23 | 0.77 ± 0.32 | 0.37 ± 0.22 | 0.41 ± 0.27 | 0.30 ± 0.18 | 0.28 ± 0.18 |

MF [18] | CNMF [19] | SNMF [20] | GSOMP [13] | BSR [9] | CSU [10] | NNSR [21] | Our | |
---|---|---|---|---|---|---|---|---|

RMSE | 1.96 ± 0.97 | 2.08 ± 1.34 | 2.20 ± 0.94 | 4.08 ± 3.55 | 2.10 ± 1.60 | 1.7 ± 1.24 | 1.76 ± 0.79 | 1.64 ± 1.20 |

PSNR | 43.19 ± 3.87 | 43.00 ± 4.44 | 42.03 ± 3.61 | 38.02 ± 5.71 | 43.11 ± 4.59 | 43.40 ± 4.10 | 44.00 ± 3.63 | 45.20 ± 4.56 |

SAM | 2.93 ± 1.06 | 2.91 ± 1.18 | 3.17 ± 1.07 | 4.99 ± 2.99 | 2.93 ± 1.33 | 2.9 ± 1.05 | 2.64 ± 0.86 | 2.63 ± 0.97 |

ERGAS | 0.23 ± 0.14 | 0.23 ± 0.11 | 0.26 ± 0.27 | 0.41 ± 0.24 | 0.24 ± 0.15 | 0.24 ± 0.20 | 0.21 ± 0.12 | 0.16 ± 0.15 |

It can be seen from Tables 1**and**2 that our approach obtains the best recovery performance for all quantitative metrics, and the performance improvement on the CAVE dataset is more significant than on the Harvard dataset. The NNSR method [21] has the closest performance to ours, and both methods show relatively larger advantage over other methods. In addition, our method shows the best improvement on SAM values over NNSR [21]. This is because for SAM, a slight spectral distortion of the pixels with small magnitudes affects its value greatly. Thus, we can conclude that our proposed approach not only robustly recovers the HS image but also suppresses the noise and artifacts, especially for those pixels with small spectral magnitudes, due to the imposed constraints of the global-structure and local-spectral self-similarities.

#### 3.3.2 Compared results without self-similarity constraints

One of the key differences of our method from existing ones (such as MF [18]) is the two types of imposed self-similarities formulated by the regularized term,

CAVE dataset | Harvard dataset | |||||
---|---|---|---|---|---|---|

Without both | Local simil. only | Global simil. only | Without both | Local simil. only | Global simil. only | |

RMSE | 2.81 ± 1.42 | 2.25 ± 1.15 | 2.32 ± 1.20 | 1.83 ± 1.30 | 1.66 ± 1.20 | 1.88 ± 1.32 |

PSNR | 40.05 ± 3.97 | 42.00 ± 3.91 | 41.78 ± 4.05 | 44.16 ± 4.39 | 45.01 ± 4.51 | 44.02 ± 4.56 |

SAM | 5.46 ± 1.89 | 4.24 ± 1.36 | 4.59 ± 1.46 | 2.86 ± 1.06 | 2.69 ± 1.00 | 2.99 ± 1.09 |

ERGAS | 0.37 ± 0.20 | 0.30 ± 0.18 | 0.31 ± 0.19 | 0.23 ± 0.16 | 0.19 ± 0.15 | 0.18 ± 0.16 |

#### 3.3.3 Evaluation results by changing parameter γ

In addition, we evaluate the HR-HS image recovery performance via changing the parameter

#### 3.3.4 Visual quality comparison

Figures 5 and 6 manifest the recovered HS images and the difference images with respect to the ground-truth, which includes one example from the CAVE and Harvard dataset, respectively. Since including our method, the CSU [10] and NNSR [21] methods provide the impressive performance compared with all other evaluated methods as shown in Tables 1 and 2, we only give the compared results of our method, the CSU [10] and NNSR [21] methods for checking the differences in visual quality. It is obvious that the recovered HS images by our approach have smaller absolute difference magnitude for most pixels than the result by the CSU and NNSR method. It is also worth noting that when self-similarity is not applied, our results manifest quite similar appearance to those from the NNSR method [21], which also reflects the effectiveness of imposing the self-similarity constraint.

## 4. DCNN-based HS image super-resolution

Motivated by the success for image super-resolution and simply formulation, our previous work explored a simple DCNN-based HS image super-resolution method following the similar CNN structure as in [46], which mainly consists of three convolutional layers and was explained as three operations for the mapping process from LR images to HR images. This explanation follows the schematic concept in sparse coding-based SR: patch extraction, representation learning, nonlinear mapping, and reconstruction. Patch extraction obtains the overlapping patches from the input image and represents each patch as a high-dimensional vector. The convolution layers in CNN are used as feature learning and act as a nonlinear function, which maps a high-dimensional vector (conceptually the patch representation) to another high-dimensional vector (the feature map in the middle-layer of CNN). Reconstruction process combines the mapped CNN features into the final HR image. The above CNN architecture for Y-component recovery of natural image SR adopts the spatial filters in three convolutional layers with sizes 9

The intuitive way to apply the above baseline architecture of CNN for HSI SR is to learn the HR-HS image, **Z** directly from the available LR-HS image **X**, called as spatial CNN. Another research line exploits CNN architecture for learning HSI SR **Z** from the available HR-MS (RGB) image **X**, named as spectral CNN. However, spatial CNN and spectral CNN take only one domain data of the available LR-HS and HR-MS images, **X** or **Y** as input, and completely exclude the other domain data. Therefore, this chapter introduces a spatial and spectral fusion architecture of CNN, named as SSF-CNN for recovering the HR-HS image. Recent CNN work incorporates shorter connections between layers for more accurate and efficient training of substantially deeper architectures such as ResNets and Highway Networks, or exploits concatenation between different layer for information and feature reuse such as Densenet, which manifest considerable improvements in different applications. In the scenario of our HSI SR application, since the available HR-RGB image has the same high spatial resolution and the expanding factor (about 10 from 3 to 31) in spectral domain is much smaller than those in spatial domain (32 times from 16/32 to 512/1024 in horizontal and vertical directions, respectively), we concatenate the available HR-RGB image (a part data of the input: Partial) to the outputs of the Conv and RELU blocks (Densely) in the CNN structure for transferring the available maximum spatial information, and name this new CNN architecture as PDCon-SSF. The schematic structures of the spatial CNN, spectral CNN, SSF-CNN and PDCon-SSF are shown in Figure 7.

Recently, we also investigated a residual network architecture for HS image super-resolution. The residual network takes the concatenated cubic data of both available HR-RGB and upsampled LR-HS images as input, and simultaneously maintains spectral attribute in LR-HS image and spatial context in HR-RGB image to estimate a more robust HS-HS image. Taking consideration of the characteristic in HS image super-resolution, we modified the ResNet architecture, which is originally proposed for solving higher-level computer vision problems such as image classification and detection, via removing unnecessary modules to simplify the network architecture for this low-level vision problem. Furthermore, as evidenced in pansharping research that the estimated HR-HS image should have similar spatial structure information with HR-RGB image, we utilize the input RGB image to guide the spatial structure of the learned feature maps in our proposed ResNet. We firstly upsample the LR-HS image to the same size with the HR-RGB image, and stack them together with a “Concat” layer in our method. Multiple residual layer modules with alternately conjuncted spectral and spatial reconstruction layers, which are implemented with convolutional kernel size 1 and *n* (*n* > 1), are used for effectively investigating the nonlinear spectral mapping and spatial structure. Our constructed ResNet architecture consists of 5 residual blocks and each block includes a set of the conjuncted spectral and spatial reconstruction layers as shown in Figure 8. In Figure 8, the first 3 residual blocks have 128 feature maps, and the last 2 residual blocks are with 256 feature maps. The output of the m-th residual block is expressed as:

where

The guidance connections of the HR-RGB image are shown in dot lines in Figure 7. Our ResNet-based HR-HS image recovery model is trained by minimizing the Mean Square Error (MSE) between the estimated HR-HS image and the ground-truth **Z**.

### 4.1 Experimental results

We also validate the performance of the HR image reconstruction with the DCNN-based method using CAVE and Harvard datasets. We have randomly selected 20 HSIs from CAVE database to train CNN model, and the remainder is used for validation of the performance of the proposed CNN method. For Harvard database, 10 HSIs have been randomly selected for CNN model training, and the remainder 40 HSIs are as test for validation. Figure 9 manifests the HR-RGB images of the test samples from CAVE database and several test samples from Harvard databases.

#### 4.1.1 Compared results of different CNN models

As we introduced above, the CNN-based method can be used for recovering the HR-HS image from either of the available LR-HS, HR-RGB images or the concatenated cubic data of the LR-HS, HR-RGB images, which are named as spatial CNN, spectral CNN, Spatial and spectral Fusion CNN (SSF-CNN) and an extended version of SSF-CNN, PDCon-SSF. The baseline network is a three-layer convolution architecture. For CAVE database, we randomly select 20 images for learning the different types of CNN models, and save the CNN model parameters after 0.5 and 1 million iterations. The remainder 12 images in CAVE database are used for evaluating the recovering performance of different CNN models. The average and the standard deviation of RMSE, PSNR, SAM, and ERGAS of the 12 test images in CAVE database are shown in Table 4, which manifests much better results of the spectral CNN than spatial CNN due to the smaller expanding factor in spectral domain (about 10 from 3 to 32) than spatial domain (32 from 16 to 512 for horizontal and vertical directions, respectively) and significant performance improvement using SSF-CNN and PDCon-SSF-CNN models. One recovered HS image example and the corresponding residual images with the ground-truth HR images from CAVE database are visualized in Figure 10 using different CNN models.

From Table 4 and Figure 10, it can be seen that the SSF-based CNN models provide significant performance improvement compared with the spatial CNN and the spectral CNN, and thus for Harvard database, we only train the SSF-CNN and PDCon-SSF models with 1 million iterations using 10 randomly selected 10 images, and the remainder 40 images are used for evaluation. In addition, in order to validate the generation of the learned CNN model, we predict the HR-HS image of the Harvard test samples according to the parameters of the learned SSF-CNN and PDCon-SSF-CNN with the CAVE training samples. The average and the standard deviation of RMSE, PSNR, SAM, and ERGAS of the 40 test images in Harvard database are shown in Table 5, which shows that the learned SSF-CNN and PDCon-SSF models even with the training samples from CAVE database can provide reasonable recovery performance and the quantitative measures can further be improved using the learned SSF-CNN and PDCon-SSF models even with 10 training images only. One recovered HS image example and its corresponding residual images with the ground-truth HR image from Harvard database are visualized in Figure 11 using the learned SSF and PDCon-SSF-CNN models with the CAVE and Harvard training samples, respectively.

#### 4.1.2 Compared results of different baseline CNN architectures

As mentioned above, we also investigated a residual network architecture for HS image super-resolution, which has different baseline CNN architecture with the SSF-CNN. Under the same experimental results, we implemented the DCNN-based HS image reconstruction using three-layer CNN and the ResNet architecture with five residual blocks. The compared quantitative results are shown in Table 6 for both CAVE and Harvard datasets. One recovered HS image example and the corresponding residual images with the ground-truth HR image from CAVE database are visualized in Figure 12 using the ResNet-RGB, SSF-Net, and the ResNet-based fusion models.

## 5. Conclusions

This chapter introduced recently research on HS image super-resolution. We firstly described the problem formulation for HS image super-resolution and provided the mathematical model between the observed HR-RGB, LR-HS images, and the required HR-HS image. Then we gave the detail description for an optimization-based method: self-similarity constrained sparse representation and the recently proposed DCNN-based method. Experimental results validated that the recently proposed HR image super-resolution methods manifest promising performance on benchmark datasets.