Hyperspectral Image Super-Resolution Using Optimization and DCNN-Based Methods

Reconstructing a high-resolution (HR) hyperspectral (HS) image from the observed low-resolution (LR) hyperspectral image or a high-resolution multispectral (RGB) image obtained using the exiting imaging cameras is an important research topic for capturing comprehensive scene information in both spatial and spectral domains. The HR-HS hyperspectral image reconstruction mainly consists of two research strategies: optimization-based and the deep convolutional neural network-based learning methods. The optimization-based approaches estimate HR-HS image via minimizing the reconstruction errors of the available low-resolution hyperspectral and high-resolution multispectral images with different constrained prior knowledge such as representation sparsity, spectral physical properties, spatial smoothness, and so on. Recently, deep convolutional neural network (DCNN) has been applied to resolution enhancement of natural images and is proven to achieve promising performance. This chapter provides a comprehensive description of not only the conventional optimization-based methods but also the recently investigated DCNN-based learning methods for HS image super-resolution, which mainly include spectral reconstruction CNN and spatial and spectral fusion CNN. Experi-ment results on benchmark datasets have been shown for validating effectiveness of HS image super-resolution in both quantitative values and visual effect.


Introduction
Hyperspectral (HS) imaging simultaneously obtains a set of images of the same scene on a large number of narrow-band wavelengths which can effectively describe the spectral distribution for every scene point and provide intrinsic and discriminative spectral information of the scene. The acquired dense spectral bands of data are capable to benefit for numerous applications, including object recognition and segmentation [1][2][3][4][5][6][7][8][9], medical image analysis [10], and remote sensing [11][12][13][14][15], to name a few. Although with the availability of the abundant spectral information with HS imaging, it generally results in much low spatial resolution compared with ordinary panchromatic and RGB images since photon collection in HS sensors is performed in a much larger spatial region for guaranteeing sufficiently high signal-to-noise ratio. The low spatial resolution in the HS images leads to high spectral mixing of different materials in a scene and greatly affects the performance of scene analysis and understanding. Therefore, the reconstruction of highresolution hyperspectral (HR-HS) image using image processing and machine leaning techniques has attracted a lot of attention.
Especially in remote sensing field, a low-resolution (LR) multispectral or HS image is usually available accompanying with a HR single-channel panchromatic image, and the fusion of these two images is generally known as the pan-sharpening technique . Motivated by the fact that human vision is more sensitive to luminance, traditional pan-sharpening technique mainly concentrated the reliable illumination restoration via substituting the calculated component of the LR-HS image with the HR information of panchromatic image via sue saturation exploring and principle component analysis. However, these simple approaches avoidably cause spectral distortion in the resulting image. Recently, the HS image super-resolution actively investigates the optimization methods for minimizing the reconstruction error of the available LR-HS and HR-MS (HR-RGB) images [16][17][18][19][20][21][22][23][24][25][26][27][28][29][30], which manifested impressive performance. The basic idea of these optimization-based approaches assumes that the spectrum can be represented as matrix decomposition with different constraints such as representation sparsity, spectral physical properties, spatial context similarity, and composited matrixes, which are iteratively optimized for more accurate approximating the observed images. Recently, the matrix factorization and spectral unmixing [40][41][42][43]-based HS image super-resolution, which are mainly motivated by the fact the HS observations can be represented by a linear combination of the reflectance function basis (the spectral signatures of the pure materials) and the weight vector denoting the fractions of the pure materials on the spectral response is assumed sparse, have been actively investigated [16,17,27,28]. A coupled nonnegative matrix factorization (CNMF) by Yokoya et al. [19], inspired by the physical property of nonnegative weights for the linear combination, has been proposed to estimate the HR-HS image from a pair of HR-MS and LR-HS images. Although the CNMF approach provided acceptable spectral recovery performance, its solution is usually not unique [44], which cannot always lead to unsatisfied spectral recovery results. Lanaras et al. [10] proposed to integrate coupled spectral unmixing strategy into HS super-resolution and conducted optimization procedure with the proximal alternating linearized minimization method, which requires the good initial points of the two decomposed reflectance signatures and the fraction vectors for providing impressive results. Furthermore, taking consideration of the physical meaning of the spectral linear combination on the reflectance signatures and the implementation effectiveness, most work generally assumes that the number of the pure materials in the observed scene is smaller than the spectral band number, which is not always satisfied in the real application.
Motivated by the successful applications of the sparse representation on the natural image analysis [14,15] such as image de-noising, super-resolution, and representation, the sparsity-promoting approaches without considering explicitly the physical meaning constraint on the reflection signature (basis) and thus permitting over-complete basis have widely been applied for HS super-resolution [18,19]. Inspired by the work in the general RGB image analysis with sparse representation, Grohnfeldt et al. [11] explored a joint sparse representation for HS image super-resolution. Via learning the corresponding HS and MS (RGB) patch dictionaries using the prepared pairs, this work assumed the same sparse coefficients of the corresponding MS and HS patch dictionary, and thus, these can be calculated with only the MS input patch. However the above procedure was conducted on each individual band, which mainly considered the well reconstruction of the local structure (patch) and completely ignored the spectral correlation between channels. Therefore, several other works [19,22] investigated the sparse spectral representation via conducting reconstruction of all band spectra instead of the local structure on each individual band. Akhtar et al. [13] explored a sparse spatiospectral representation via calculating the optimized sparse coefficients of each spectral pixel but assuming the same used atoms for the pixels in a local grid region to integrate the spatial structure. For calculation effectiveness, a generalized simultaneous orthogonal matching pursuit (G-SOMP) was proposed for estimating the sparse coefficients in [22]. Later, the same research group integrated the sparse representation and the Bayesian dictionary learning algorithm for improving the HS image super-resolution performance and manifested its effectiveness. Dong et al. [21] proposed a nonnegative structured sparse representation (NSSR) approach for taking consideration of the spatial structure and then conducted optimization procedure with the alternative direction multiplier method (ADMM) technique. NSSR achieved a large margin on HS image recovery performance compared with the other state-of-the-art approaches. Furthermore, Han et al. [45] proposed to recover the HR-HS output via minimizing the coupled reconstruction error of the available LR-HR and HR-RGB images with the following constraints, (1) the sparse representation with over-complete spectral dictionary in the coupled unmixing strategy [17] and (2) the self-similarity of the sparse spectral representation in the global structures and the local spectra existed in the available HR-RGB image, which further improved the HS image recovery performance in both visual and quality aspects.
Deep convolutional neural networks (CNNs) have recently shown great success in various image processing and computer vision applications. CNN has also been applied to RGB image super-resolution and achieved promising performance. Dong et al. [46] proposed a three-layer CNN architecture (SRCNN), which demonstrates about 0.5-1.5 db improvement and much lower computational cost compared with the popularly used sparse-based methods, and they further extended SRCNN to be capable of directly dealing with the available LR images without mathematical upsampling operation, called as fast SRCNN. Kim et al. [47] exploited a very deep CNN architecture based on VGG-net architecture and concentrated on only estimating the missing high-frequency image (residual image). Ledig et al. integrated two different types of networks, generate network and discriminate network (called as GAN), for estimating much sharper HR image. For applying CNN to HSI SR, Li et al. [48] applied similar structures of SRCNN to super-resolve HSI only from the LR-HS image. These CNN architectures take only the LR image as input, and the expanding factor of resolution enhancement is theoretically limited to be lower than 8 in both height and width. There are also several works exploring CNNbased method with variant backbone architectures to expand the spectral resolution with only HR-RGB image as input [49,50]. This chapter introduces several research works based on DCNN learning for HS image reconstruction.
On the other hand, regarding to the use of the observed data, the HR-HS image reconstruction can be divided into three research directions: (1) spatial resolution enhancement from hyperspectral imaging, (2) spectral resolution enhancement from RGB imaging, and (3) fusion method based on the observed HR-RGB and lowresolution (LR) HS images of the same scene. Spatial resolution enhancement has popularly been used on single natural image super-resolution [46,47], and impressive performance has been achieved especially with the deep learning method in the resolution expanding factor from 2 to 4. The deep convolutional neural network (DCNN) has also been adopted for predicting the HR-HS image from a single LR-HS image [48] and validated feasibility of HS image super-resolution for small expanding factor. However, the spatial resolution of the available HS image is considerably low compared with the commonly observed RGB image, and then the expanding factor for HR-HS image reconstruction is required to be large enough, for example, more than 10 in horizontal and vertical directions, respectively. Thus, the reconstructed HS image with acceptable quality usually cannot reach the required spatial resolution for different applications. The spectral resolution enhancement for RGB-to-spectrum reconstruction [49,50] has recently become a hot research line with a single RGB image, which can be lightly collected with a lowprice visual sensor. Although the impressive potential of the RGB-spectrum reconstruction is evaluated, there has still large space for performance improving in real applications. Fusing a LR-HS image with the corresponding HR-RGB image to obtain a HR-HS image has shown promising performance [18,19,22, 30] compared to spatial and spectral resolution enhancement methods. It is usually solved as an optimization problem with prior knowledge such as sparsity representation and spectral physical properties as constraints, which needs comprehensive analysis of the target scene previously and would be varied scene by scene. Motivated by the amazing performance of the DCNN in natural image super-resolution, Han etc. [51] proposed a spatial and spectral fusion network (SSF-Net) for the HR-HS image reconstruction and validated the better results of the SSF-Net in spite of the simple concatenation of the upsampled LR-HS image and the HR-RGB image. However, the upsampling of the LR-HS image and the simple concatenation cannot effectively integrate the existed spatial structure and spectral property but would lead to computational cost. In addition, precise alignment is needed for the input of LR-HS and HR-RGB images and is extremely difficult due to the large difference of spatial resolution in the LR-HS and HR-RGB images. This chapter introduces several advanced DCNN-based learning methods for hyperspectral image super-resolution and manifests the impressive performance for benchmark datasets. The basic concept of the hyperspectral image super-resolution is shown in Figure 1.

Problem formulation of HS image super-resolution
The goal of HS image super-resolution is to recover a HR-HS image Z 0 ∈  WÂHÂL , where L denotes the spectral band number and W and H denote the image width and height, respectively, from a HR-MS image Y 0 ∈  WÂHÂl (l ≪ L) and a LR-HS image X 0 ∈  wÂhÂL (w ≪ W, h ≪ H). The common used HR-MS image in the HS image SR scenario is generally a RGB image with l ¼ 3 spectral bands. The matrix forms of Z 0 , X 0 , and Y 0 are denoted as , and Y ∈  3ÂN , respectively. Both X (LR-HS) and Y (HR-RGB) can be expressed as a linear transformation from Z (the desired HS image) as: where D ∈  NÂM is the decimation matrix, which blurs and down-samples the HR-HS image to form the LR-HS image, and R ∈  3ÂL represents the RGB camera spectral response functions that maps the HR-HS image to the HR-RGB image. With the given X and Y, Z can be estimated by minimizing the following reconstruction error:Ẑ where Á k k F denotes the Frobenius norm. Via minimizing the reconstruction errors of the observed LR-HSI, X, and the HR-RGB image, Y, in Eq. (2), we attempt to recover the HR-HSI, Z. The intuitive way to solve Eq. (2) is to adopt an optimization-based strategy to minimize Eq. (2) for providing an estimation of the HR-HSI, Z. This chapter firstly explores the alternative back-projection (ABP) algorithm to iteratively update the HR-HSI, Z, aiming at minimizing Eq. (2). Backprojection [12] is well-known as the efficient iterative procedure to minimize the reconstruction error. Since the back-projection requires an initial estimation for updating the next Z t , we simply upsample the LR-HS image X as the initial state, The alternative update for Z t at the t-th step is formulated as: where Á ð Þ T denotes the transpose operation of a matrix and Á ð Þ À 1 represents the inverse operation of a matrix. λ 1 and λ 2 denote the hyper-parameters for controlling the updating weights. After the predefined number of alternative iterations, it is prospected to obtain an estimated HR-HSI. Z, for well reconstructing the observed LR-HSI, X, and HR-RGB image, Y.
Since the number of the unknowns (N*L) is much larger than the number of available measurements (M*L + 3*N), the above optimization problem is highly illposed, and proper regularization terms are required to narrow the solution space and ensure stable estimation. A widely adopted constraint is that each pixel spectral z n ∈  L of Z lies in a low-dimensional space, and it can be decomposed as [30]: is the set of all spectral signatures (b k , also called as the k-th endmember) of K distinct materials. α n represents the fractional abundance of all K materials for the n-th pixel. Taking consideration of the physical property on the spectral reflectance, the elements in the spectral signatures and the fractional abundance are nonnegative as shown in the first and second constraint terms of Eq. (4), and the summation of abundance vector for each pixel is one.
According to Y ¼ RZ, each pixel y n ∈  3 in the HR-RGB image can be decomposed as: whereB denotes the RGB spectral dictionary obtained via transforming the HS dictionary B with camera spectral function R. With a corresponding set of the previously learned spectral dictionaries,B and B, the sparse fractional vector α n is able to be estimated from the HR-RGB pixel y n only.
The matrix representation forms of Eqs. (4) and (5) can be formulated as: (2), we obtain the nonnegative constrains on both B andBA, which are applied in the same manner as in Eq. (2). Unless otherwise noted, the nonnegative constraint is imposed on both dictionary and sparse matrix in the following deductions: The goal of Eq. (7) is to solve both spectral dictionary B and coefficient matrix A with proper regularization terms to achieve stable and accurate solution.

Self-similarity constrained sparse representation for HS image super-resolution
The complete pipeline of self-constrained sparse representation for HS image super-resolution is illustrated in Figure 2. The main contribution of this method is to propose a nonnegative sparse representation coupled with self-similarity Schematics of self-similarity constrained sparse representation for HS image super-resolution: (1) learn the HS dictionary B from the input LR-HS image X, (2) explore self-similarity of the global-structure and localspectral, (3) convex optimization of the objective function with sparse and self-similarity constrains on the sparse matrix A for estimating the required HR-HS image. constraint to regularize the solution of Eq. (7).
, two additional terms are added to Eq. (7) as: where A k k 1 denotes the sparse constrained term on the coefficient matrix and Ω A ð Þ represents the self-similarity regularized term. λ and η are the hyperparameters, for controlling the contribution of the two constrained terms. Our study solves Eq. (8) with the following three steps: (1) online learning the HS dictionary from the input LR-HS image, (2) exploring the self-similarity properties of the global-structure and local-spectral self-similarity from the input HR-RGB image, and (3) conducting the convex optimization with the previously learned HS dictionary and the extracted self-similarity for estimating the HR-HS image. Next, we will describe the details of the above procedures in the following three subsections.

Online HS dictionary learning
Since different materials would have very large variety of the HS reflectance, learning a common HS dictionary for various scenes with different materials would lead to considerable spectral distortion. In order to obtain a set of adaptive HS dictionary for well reconstructing the pixel spectra, this study conducts the learning procedure directly using the observed LR-HS image X in an online manner. The objective function to build the HS dictionary for representing the pixel spectra is formulated as follows: whereÂ is the sparse matrix for the pixels in the LR-HS image. In our study, we also impose the nonnegative constraints on both sparse matrixÂ and spectral dictionary B, and thus, the existing dictionary learning method such as K-SVD cannot be applied for our optimization problem. We follow the optimization algorithm [21] and adopt ADMM technique to transform the constrained dictionary learning problem into an unconstrained version. The unconstrained dictionary learning problem is then solved with alternative optimization algorithm. After obtaining the HS dictionary B * via optimizing Eq. (9) with the observed LR-HS image, we would only optimize A to solve Eq. (8) via fixing B * .

Extraction of self-similarity constraint
The regularization term Ω A ð Þ in Eq. (8) is formulated with two types of self-similarities, which are extracted from the HR-RGB image (see Figure 2 for illustration): • Global-structure self-similarity: Since pixels with similar spatial structure, which are represented as the concatenated RGB spectra within a local square windows, share similar hyperspectral information, thus the sparse vectors for reconstructing the hyper-spectra of these pixels would also be similar; this applies for both nearby patches and nonlocal patches in the whole image plane, and we name these as global-structure self-similarity.
• Local-spectral self-similarity: Since pixels in a local region have the same material with RGB values in the HR-RGB image, the sparse vector for different HR pixels is similar in a local region (superpixel). Note the superpixel is usually not a square patch.
The global-structure self-similarity is represented by global-structure groups g ¼ g 1 ; g 2 ; ⋯; g P È É (in total P groups), which are obtained by clustering all similar patches (spatial structure) in the HR-RGB image with K-means; g p (each g p may have different length) is a vector consisting of the pixel indices in the p-th group. The local-spectral self-similarity is formulated as the superpixels L ¼ l 1 , l 2 , ⋯ , l Q È É (in total Q superpixels) obtained via SLIC superpixel segmentation method; l q is also a vector composed with the pixel indices in the q-th superpixel. Since the pixels in the same global-structure group have similar spectral-spatial structure, we calculate the sparse vector of any pixel in a given group by a weighted average of the sparse matrix for all pixels in this group. Similarly, the sparse vector of a pixel can also be approximated by a weighted average of the sparse matrix for all pixels in the same local-spectral superpixel. With both self-similarity constraints, the sparse vector for the n-th pixel can be formulated as: where w g n, i is the global-structure weight for the n-th sparse vector α n ; it adjusts and merges the contribution of the i-th sparse vector α i belonging to the same global-structure group. Analoguely, w L n,j weights the j-th sparse vector α j belonging to the same local-spectral superpixel. And γ is a parameter for balancing the contribution between the global-structure and local-spectral self-similarity.
To be more specific, w g n, i (0< w g n, i <1 and P i w g n, i ¼ 1) measures the similarity between the RGB intensities of patches p n and p i centered around the n-th and i-th pixels. Each patch is a set of pixels in a R Â R window, so each p is a 3R 2 -dimensional (R Â R Â RGB) vector. It is a decreasing function of the Euclidean distance between the spatial RGB values as: where z g n is a normalization factor defined as z to guarantee and ensure that P i ∈ g p w g n, i ¼ 1 and h g are a smoothing kernel for 3R 2 -dimensional vectors. The local-spectral weight w L n,j is defined in the exactly same format but with p n and p i being the RGB values of the n-th and i-th pixels (so each p is a threedimensional vector here) and a smoothing kernel h L for three-dimensional vectors.
We then build affinity matrices W g ∈ R NÂN and W L ∈  NÂN , whose element encodes the pairwise similarity calculated using Eq. (11). Finally, the regularization term constrained by two types of self-similarities is represented as: With the self-similarity constraints of the global-structure and local-spectral, the sparse representation will be more robust and prospected to be similar for the locations in the same clustered global group and local superpixel. Given the HS dictionary B * pre-learned using Eq. (9) and the regularization term with selfsimilarity in Eq. (12), Eq. (8), is convex and can be efficiently solved by optimization algorithm. We apply the ADMM technique to solve Eq. (8), and please refer to [45] for detail optimization procedure.

Experimental results
We evaluate the self-similarity constrained sparse representation method using two publicly released hyperspectral imaging databases: the CAVE and Harvard datasets. The CAVE dataset includes 32 indoor images consisting of paintings, toys, food, and so on, which are captured under controlled illumination. The Harvard dataset has 50 indoor and outdoor images captured under daylight illumination. The image size in the CAVE dataset is 512 Â 512 pixels, and 31 spectral bands of 10 nm wide, which covers the visible spectrum from 400 to 700 nm. The image size in the Harvard dataset is 1392 Â 1040 pixels, and 31 spectral bands of width 10 nm, basically covering the visible spectrum from 420 to 720 nm. In our experiments, we extract the top left 1024 Â 1024 pixels as the understudying HR images. We take the original images in the datasets as ground-truth Z and resize them by a factor of 32 to create 16 Â 16 images in the CAVE dataset and 32 Â 32 images in the Harvard dataset, which is implemented by averaging over 32 Â 32 pixel blocks as done in [10,21]. The observed HR-RGB images Y are generated by multiplying the spectral channels of the ground-truth image with the spectral response R of a Nikon D700 camera. We evaluate the recovery performance of the estimated HS images using four quantitative metrics including root-mean-square error (RMSE), peak-signalto-noise ratio (PSNR), spectral angle mapper (SAM) [9], and relative dimensionless global error in synthesis (ERGAS) [34]. The quantitative metric, SAM [9], gives the spectral distortion degree of the pixel spectrum in the estimated HR-HS image with the corresponding one in the ground-truth HR-HS image. We calculate the overall SAM metric of one understudying by averaging the SAMs computed from all pixels. The value of SAM is expressed in degrees and thus normalized into the range (À90, 90). The smaller the absolute value of SAM, the less the spectral distortion is. The ERGAS [34] calculates the average amount of the relative difference error, where the absolute difference error is normalized by intensity mean in each band. The smaller the ERGAS, the smaller the relative difference error is.

Compare results with the state-of-the-art methods
Firstly, we manifest the compared recovery performance of the HR-HS images with our proposed method (including the online dictionary learning procedure and self-similarity constraints) and the state-of-the-art HS image SR methods including matrix factorization (MF) method [18], coupled nonnegative matrix factorization method [19], sparse nonnegative matrix factorization (SNNMF) method [20], generalization of simultaneous orthogonal matching pursuit method [13], Bayesian sparse representation (BSR) method [9], couple spectral unmixing (CSU) method [10], and nonnegative structured sparse representation method [21]. Table 1 manifests the average RMSE, PSNR, SAM, and ERGAS results of the 32 images in the CAVE dataset [32], while Table 2 shows the average results of the 50 images from the Harvard dataset [33].
It can be seen from Tables 1 and 2 that our approach obtains the best recovery performance for all quantitative metrics, and the performance improvement on the CAVE dataset is more significant than on the Harvard dataset. The NNSR method [21] has the closest performance to ours, and both methods show relatively larger advantage over other methods. In addition, our method shows the best improvement on SAM values over NNSR [21]. This is because for SAM, a slight spectral distortion of the pixels with small magnitudes affects its value greatly. Thus, we can conclude that our proposed approach not only robustly recovers the HS image but also suppresses the noise and artifacts, especially for those pixels with small spectral magnitudes, due to the imposed constraints of the global-structure and localspectral self-similarities.

Compared results without self-similarity constraints
One of the key differences of our method from existing ones (such as MF [18]) is the two types of imposed self-similarities formulated by the regularized term, Ω A ð Þ in Eq. (8). Without the Ω A ð Þ term, Eq. (8) can still be solved by an optimization method such as the ADMM. In addition, we can also adopt either global or local selfsimilarity separately, i.e., by taking only the W g or W L terms in Eq. (12). We conduct such experiments under the same experimental conditions, and the same quantitative metrics as in Tables 1 and 2 for both datasets are shown in Table 3 Taking local self-similarity only into consideration significantly improves the results on both datasets for all quantitative metrics which shows relatively larger contribution than considering global self-similarity only, but integrating global selfsimilarity as our complete approach could further improve the results.

Evaluation results by changing parameter γ
In addition, we evaluate the HR-HS image recovery performance via changing the parameter γ for adjusting the contribution of global-structure and local-spectral self-similarity. For CAVE dataset, the parameter γ is changed from 0 (local-spectral self-similarity only) to 1 (global-structure self-similarity only) with interval 0.1, and apply the same measure metrics for manifesting the contribution of the global and local self-similarity. Figure 3 (a)-(d) gives the curves of the quantitative measures, RMSE, PSNR, SAM, and ERGAS, respectively, which manifests that γ = 0.3 gives the best performances. For Harvard dataset, we also conducted experiments with the parameter γ, 0, 0.1, 0.2, ⋯, and the curves of the quantitative measures, RMSE, PSNR, SAM, and ERGAS, are given in Figure 4.

Visual quality comparison
Figures 5 and 6 manifest the recovered HS images and the difference images with respect to the ground-truth, which includes one example from the CAVE and Harvard dataset, respectively. Since including our method, the CSU [10] and NNSR [21] methods provide the impressive performance compared with all other evaluated methods as shown in Tables 1 and 2, we only give the compared results of our method, the CSU [10] and NNSR [21] methods for checking the differences in visual quality. It is obvious that the recovered HS images by our approach have smaller absolute difference magnitude for most pixels than the result by the CSU and NNSR method. It is also worth noting that when self-similarity is not applied,   The visualized results of the recovered HR images from the "cloth" image in the CAVE dataset. The first column shows the ground-truth HR image and the input LR image, respectively. The second to fifth columns show results from CSU [10], NNSR [21], and our method with and without self-similarity, where the upper part provides the recovered images and the lower part gives the absolute difference maps w.r.t. ground-truth. Close-up views are provided below each full resolution image.
our results manifest quite similar appearance to those from the NNSR method [21], which also reflects the effectiveness of imposing the self-similarity constraint.

DCNN-based HS image super-resolution
Motivated by the success for image super-resolution and simply formulation, our previous work explored a simple DCNN-based HS image super-resolution method following the similar CNN structure as in [46], which mainly consists of three convolutional layers and was explained as three operations for the mapping process from LR images to HR images. This explanation follows the schematic concept in sparse coding-based SR: patch extraction, representation learning, nonlinear mapping, and reconstruction. Patch extraction obtains the overlapping patches from the input image and represents each patch as a high-dimensional vector. The convolution layers in CNN are used as feature learning and act as a nonlinear function, which maps a high-dimensional vector (conceptually the patch representation) to another high-dimensional vector (the feature map in the middlelayer of CNN). Reconstruction process combines the mapped CNN features into the final HR image. The above CNN architecture for Y-component recovery of natural image SR adopts the spatial filters in three convolutional layers with sizes 9 Â 9, 1 Â 1, and 5 Â 5. Since HSI SR attempts to recover high resolution in not only spatial but also spectral domain, which has been proven that the spectral response is more important Figure 6. The visualized results of the recovered HR images from the "imgf1" image in the Harvard dataset. The first column shows the ground-truth HR image and the input LR image, respectively. The second to fifth columns show results from CSU [10], NNSR [21], and our method with and without self-similarity, with the upper part showing the recovered images and the lower part showing the absolute difference maps w.r.t. ground-truth. Close-up views are provided below each full resolution image.
in HIS SR, we set the spatial filter sizes as 3 Â 3, 3 Â 3, and 5 Â 5 with full connection in spectral domain from either one of the available LR-HS and HR-RGB images or the concatenated LR-HS and HR-RGB cubic data.
The intuitive way to apply the above baseline architecture of CNN for HSI SR is to learn the HR-HS image, Z directly from the available LR-HS image X, called as spatial CNN. Another research line exploits CNN architecture for learning HSI SR Z from the available HR-MS (RGB) image X, named as spectral CNN. However, spatial CNN and spectral CNN take only one domain data of the available LR-HS and HR-MS images, X or Y as input, and completely exclude the other domain data. Therefore, this chapter introduces a spatial and spectral fusion architecture of CNN, named as SSF-CNN for recovering the HR-HS image. Recent CNN work incorporates shorter connections between layers for more accurate and efficient training of substantially deeper architectures such as ResNets and Highway Networks, or exploits concatenation between different layer for information and feature reuse such as Densenet, which manifest considerable improvements in different applications. In the scenario of our HSI SR application, since the available HR-RGB image has the same high spatial resolution and the expanding factor (about 10 from 3 to 31) in spectral domain is much smaller than those in spatial domain (32 times from 16/32 to 512/1024 in horizontal and vertical directions, respectively), we concatenate the available HR-RGB image (a part data of the input: Partial) to the outputs of the Conv and RELU blocks (Densely) in the CNN structure for transferring the available maximum spatial information, and name this new CNN architecture as PDCon-SSF. The schematic structures of the spatial CNN, spectral CNN, SSF-CNN and PDCon-SSF are shown in Figure 7.
Recently, we also investigated a residual network architecture for HS image super-resolution. The residual network takes the concatenated cubic data of both available HR-RGB and upsampled LR-HS images as input, and simultaneously maintains spectral attribute in LR-HS image and spatial context in HR-RGB image to estimate a more robust HS-HS image. Taking consideration of the characteristic in HS image super-resolution, we modified the ResNet architecture, which is originally proposed for solving higher-level computer vision problems such as image classification and detection, via removing unnecessary modules to simplify the network architecture for this low-level vision problem. Furthermore, as evidenced in pansharping research that the estimated HR-HS image should have similar spatial structure information with HR-RGB image, we utilize the input RGB image to guide the spatial structure of the learned feature maps in our proposed ResNet. We firstly upsample the LR-HS image to the same size with the HR-RGB image, and stack them together with a "Concat" layer in our method. Multiple residual layer modules with alternately conjuncted spectral and spatial reconstruction layers, which are implemented with convolutional kernel size 1 and n (n > 1), are used for effectively investigating the nonlinear spectral mapping and spatial structure. Our constructed ResNet architecture consists of 5 residual blocks and each block includes a set of the conjuncted spectral and spatial reconstruction layers as shown in Figure 8. In Figure 8, the first 3 residual blocks have 128 feature maps, and the last 2 residual blocks are with 256 feature maps. The output of the m-th residual block is expressed as: where Spec 1 Á ð Þ denotes the spectral reconstruction layer with convolutional kernel size 1, and Spat 3 Á ð Þ denotes the spatial reconstruction layer with convolutional kernel size 3. F m À 1 is the input of the residual block. Furthermore, considering the HR spatial structure in the observed HR-RGB image, we use the HR-RGB image to guide the spatial structure of the learned feature maps in the  residual blocks, which is modeled by stacking the input HR-RGB image and the input feature map F m À 1 . Thus, with the added guidance connection, the output of a residual block is modified as: The guidance connections of the HR-RGB image are shown in dot lines in Figure 7. Our ResNet-based HR-HS image recovery model is trained by minimizing the Mean Square Error (MSE) between the estimated HR-HS image and the ground-truth Z.

Experimental results
We also validate the performance of the HR image reconstruction with the DCNN-based method using CAVE and Harvard datasets. We have randomly selected 20 HSIs from CAVE database to train CNN model, and the remainder is used for validation of the performance of the proposed CNN method. For Harvard database, 10 HSIs have been randomly selected for CNN model training, and the remainder 40 HSIs are as test for validation. Figure 9 manifests the HR-RGB images of the test samples from CAVE database and several test samples from Harvard databases.

Compared results of different CNN models
As we introduced above, the CNN-based method can be used for recovering the HR-HS image from either of the available LR-HS, HR-RGB images or the concatenated cubic data of the LR-HS, HR-RGB images, which are named as spatial CNN, spectral CNN, Spatial and spectral Fusion CNN (SSF-CNN) and an extended version of SSF-CNN, PDCon-SSF. The baseline network is a three-layer convolution architecture. For CAVE database, we randomly select 20 images for learning the different types of CNN models, and save the CNN model parameters after 0.5 and 1 million iterations. The remainder 12 images in CAVE database are used for evaluating the recovering performance of different CNN models. The average and the standard deviation of RMSE, PSNR, SAM, and ERGAS of the 12 test images in CAVE database are shown in Table 4, which manifests much better results of the spectral CNN than spatial CNN due to the smaller expanding factor in spectral domain (about 10 from 3 to 32) than spatial domain (32 from 16 to 512 for horizontal and vertical directions, respectively) and significant performance improvement using SSF-CNN and PDCon-SSF-CNN models. One recovered HS image example and the corresponding residual images with the ground-truth HR images from CAVE database are visualized in Figure 10 using different CNN models.
From Table 4 and Figure 10, it can be seen that the SSF-based CNN models provide significant performance improvement compared with the spatial CNN and the spectral CNN, and thus for Harvard database, we only train the SSF-CNN and PDCon-SSF models with 1 million iterations using 10 randomly selected 10 images, and the remainder 40 images are used for evaluation. In addition, in order to validate the generation of the learned CNN model, we predict the HR-HS image of the Harvard test samples according to the parameters of the learned SSF-CNN and Table 4. The average and standard deviation of RMSE, PSNR, SAM, and ERGAS using different CNN models of three-layer architecture under 0.5 and 1 million iteration training on CAVE database. Figure 10.
The "superballs" image example from the CAVE database. The first row shows the ground-truth HR image and the recovered images by spatial CNN, spectral CNN, CSU [22], NNSR [12], and the proposed spatial and spectral CNN architectures, SSF-CNN and PDCon-SSF-CNN, respectively. The second row gives the input LR image, the absolute difference images between the ground-truth image, and the recovered HR-HS images in the first row.
PDCon-SSF-CNN with the CAVE training samples. The average and the standard deviation of RMSE, PSNR, SAM, and ERGAS of the 40 test images in Harvard database are shown in Table 5, which shows that the learned SSF-CNN and PDCon-SSF models even with the training samples from CAVE database can provide reasonable recovery performance and the quantitative measures can further be improved using the learned SSF-CNN and PDCon-SSF models even with 10 training images only. One recovered HS image example and its corresponding residual images with the ground-truth HR image from Harvard database are visualized in Figure 11 using the learned SSF and PDCon-SSF-CNN models with the CAVE and Harvard training samples, respectively.

Compared results of different baseline CNN architectures
As mentioned above, we also investigated a residual network architecture for HS image super-resolution, which has different baseline CNN architecture with the SSF-CNN. Under the same experimental results, we implemented the DCNN-based HS image reconstruction using three-layer CNN and the ResNet architecture with five residual blocks. The compared quantitative results are shown in Table 6 for both CAVE and Harvard datasets. One recovered HS image example and the corresponding residual images with the ground-truth HR image from CAVE database are visualized in Figure 12 using the ResNet-RGB, SSF-Net, and the ResNetbased fusion models. Table 5. The average and standard deviation of RMSE, PSNR, SAM, and ERGAS of the test samples of Harvard database using different CNN models, where "SSF-CNN-CAVE" and "PDCon-SSF-CAVE" denote the learned CNN models using the training images from CAVE database. Figure 11. An image example from the Harvard database. The first row shows the ground-truth HR image and the recovered images by CSU [22], NNSR [12], and the proposed PDCon-SSF-CNN using CAVE training images, SSF-CNN, and PDCon-SSF-CNN using Harvard training images, respectively. The second row gives the input LR image, the absolute difference images between the ground-truth image, and the recovered HR-HS images in the first row.

Conclusions
This chapter introduced recently research on HS image super-resolution. We firstly described the problem formulation for HS image super-resolution and provided the mathematical model between the observed HR-RGB, LR-HS images, and the required HR-HS image. Then we gave the detail description for an optimizationbased method: self-similarity constrained sparse representation and the recently proposed DCNN-based method. Experimental results validated that the recently proposed HR image super-resolution methods manifest promising performance on benchmark datasets. Table 6. The compared average and standard deviation of RMSE, PSNR, SAM, and SSIM using the ResNet-RGB, SSF-Net [51], and the ResNet-based fusion methods on both CAVE and Harvard databases.