Symbols and abbreviations.
There are two mast cameras (Mastcam) onboard the Mars rover Curiosity. Both Mastcams are multispectral imagers with nine bands in each. The right Mastcam has three times higher resolution than the left. In this chapter, we apply some recently developed deep neural network models to enhance the left Mastcam images with help from the right Mastcam images. Actual Mastcam images were used to demonstrate the performance of the proposed algorithms.
- Curiosity rover
- image fusion
- deep learning
- transition learning
The Curiosity rover (Figure 1) has several instruments that are used to characterize the Mars surface. For example, the Alpha Particle X-Ray Spectrometer (APXS)  can analyze rock samples collected from the robotic arm and extract compositions of rocks; the Laser Induced Breakdown Spectroscopy (LIBS)  can extract spectral features from the vaporized fumes and deduce the rock compositions at a distance of 7 m; and the Mastcam imagers  can perform surface characterization from 1 km away.
The two Mastcam multispectral imagers are separated by 24.2 cm . As shown in Figure 2, the left Mastcam (34 mm focal length) has three times the field of view of the right Mastcam (100 mm focal length). In other words, the right imager has three times higher resolution than that of the left. To generate stereo image or construct a 12-band image cube by fusing bands from the multispectral imagers from the left and right Mastcams [4, 5, 6], a practical solution is to downsample the resolution of the right images to that of the left images, which would avoid the artifacts caused by Bayer pattern  or the JPEG compression loss . Although this approach has practical merits, it may restrict the potential ability of Mastcams. First, downsampling the right images will throw away those high spatial resolution pixels in the right bands. Second, the lower resolution of the current stereo images may degrade the augmented reality or virtual reality experience of users. If one can apply some advanced pansharpening algorithms to the left bands, then one can have 12 bands of high-resolution image cube for the purpose of stereo vision and image fusion.
In the past two decades, there have been many papers discussing the fusion of a high resolution panchromatic (pan) image with a low-resolution multispectral image (MSI) [10, 11, 12, 13, 14]. This is known as pansharpening. In our recent papers [15, 16], we proposed an unsupervised network structure to address the image fusion/super-resolution (SR) problem for hyperspectral image (HSI), referred to as HSI-SR, where a low-resolution (LR) HSI with high spectral resolution and a high-resolution (HR) MSI with low spectral resolution are fused to generate an HSI with high-resolution in both spatial and spectral dimensions. Similar to MSI, HSI has found extensive applications [17, 18, 19, 20, 21]. In this chapter, we adopt the innovative approaches designed in [15, 16], referred to as unsupervised sparse Dirichlet Network (uSDN), to enhance Mastcam images, where we treat the right Mastcam image as MSI with higher spatial resolution and the left Mastcam image as HSI with low spatial resolution.
In this chapter, we focus on the application of uSDN to enhance Mastcam images. In Section 2, we first introduce the problem of HSI-SR and then briefly summarize the key ideas of uSDN. In Section 3, we apply uSDN on actual Mastcam images. In Section 4, we include some further enhancements of uSDN and experiments. In Section 5, we introduce a transition learning concept, which is a natural extension of uSDN. Some preliminary results are also included. Finally, we conclude the chapter with some remarks.
2. The uSDN algorithm for HSI-SR
In this section, we describe the uSDN algorithm developed in [15, 16]. For more details, please refer to the reference. First of all, we will formulate the problem of HSI-SR to facilitate the discussion of Mastcam enhancement. Table 1 summarizes the mathematical symbols used in this chapter.
|/||3D/2D LR HSI|
|3D/2D HR MSI|
|/||3D/2D Reconstructed HR MSI|
|Spectral bases of HSI|
|Spectral bases of MSI|
|Coefficients/Representations of HSI|
|Coefficients/Representations of MSI|
|Reconstructed 2D HSI|
|Network weights and bias|
|/||Encoder of the HIS/MSI|
|Decoder of the HIS and MSI|
|/||Encoder weights of HIS/MSI|
|Decoder weights of HSI and MSI|
|Representations vector of a single pixel|
The basic idea of uSDN is illustrated in Figure 3. First, the LR HSI, with its width, height, and number of spectral bands denoted as and respectively, is unfolded into a 2D matrix, . Similarly, the HR MSI, with its width, height, and number of spectral bands denoted as and , respectively, is unfolded into a 2D matrix . And the SR HSI, is unfolded into a 2D matrix . Note that, generally, the spatial resolution of the MSI is much higher than that of the HSI, that is, and the spectral resolution of HSI is much higher than that of the MSI, that is, . The objective is to reconstruct the high spatial and spectral resolution HSI, , with LR HSI and HR MSI.
Due to the limitation of hardware, each pixel in an HSI or MSI may cover more than one constituent materials, leading to mixed pixels. These mixtures can be assumed to be a linear combination of a few basis vectors (or source signatures). Both LR HSI and HR MSI can be assumed to be a linear combination of basis vectors with their corresponding proportional coefficients (referred to as representations in deep learning), as expressed in Eqs. (1) and (2), where and denote the spectral basis of and , respectively. They preserve the spectral information of the images. and are the proportional coefficients of and , respectively. Since the coefficients indicate how much each spectral basis has in constructing the mixed pixel at specific spatial locations, they preserve the spatial structure of HSI. The relationship between HSI and MSI bases can be expressed in the right part of Eq. (2), where is the transformation matrix given as a prior from the sensor [22, 23, 24, 25, 26, 27, 28, 29].
With carrying the high spectral information and carrying the high spatial information, the desired HR HSI, is generated by Eq. (3). See Figure 3. Since the ground truth is not available, the problem has to be solved in an unsupervised fashion. In addition, the linear combination assumption enforces the representation vectors of HSI or MSI to be non-negative and sum-to-one, that is, where is the row vector of either or[24, 29].
The uSDN unsupervised architecture is shown in Figure 4. It has three unique structures. First, the network consists of two encoder-decoder networks, to extract the representations of the LR HSI and HR MSI, respectively. The two networks share the same decoder, such that both the spectral and spatial information from multi-modalities can be extracted with unsupervised settings. Second, the representations of both modalities, and , are enforced to follow a Dirichlet distribution where the sum-to-one and non-negative properties are naturally incorporated into the network [30, 31, 32, 33, 34]. The solution space is further regularized with a sparsity constraint. Third, the angular difference of the representations from two modalities is minimized to preserve the spectral information of the reconstructed HR HSI.
3. Mastcam image enhancement using uSDN with improvements
3.1 Applying uSDN for Mastcam enhancement
uSDN has been thoroughly evaluated with two widely used benchmark datasets, CAVE  and Harvard . Details can be found in [15, 16]. Here, we adopt uDSN to enhance the resolution of Mastcam images. As mentioned earlier, the right Mastcam has high resolution than the left. Hence, we treat the right Mastcam images are HR MSI and the left images as LR HSI. Although uSDN was introduced to deal with the general HSI super-resolution problem, we can treat the Mastcam image enhancement simply as a special case of HSI-SR.
For quantitative comparison, the root mean squared error (RMSE) and spectral angle mapper (SAM) are applied to evaluate the reconstruction error and the amount of spectral distortion, respectively.
The results are shown in Figure 5. The reconstructed image is very close to the ground truth. Most methods require that the size of high-resolution image should be equal to an integer multiplication of the size of low-resolution image. Thus, we only compare the method with CNMF  which works for arbitrary image size. The results are shown in Table 2. We observe that uSDN is able to outperform the CNMF.
3.2 Improvement based on uSDN
In this section, we summarize some further improvement of uDSN by fine-tuning the existing network structure in uSDN in order to further enhance the fusion performance.
The existing structure of uDSN described in Section 3.1 is improved in two ways. First, in Section 3.1, the architecture consists of two deep networks, for the representation learning of the LR HSI and HR MSI, respectively. And only the decoders of the LR HSI and HR HSI networks are shared. The spectral information (i.e., the decoder of the LR HSI network) is extracted through the LR HSI network. Then the representation layer of the HR HSI is optimized by enforcing the spectral angle similarity. However, this introduces additional cost function, that is, angular difference minimization, and the optimization procedure is time consuming. In the improved uDSN, for the HR HSI network, most of the encoder weights are shared with the weights of the LR HSI encoder. Only a couple of encoder weights are updated during the HR HSI optimization. In this way, both the representations of the LR HSI and HR HSI networks are reinforced to follow Dirichlet distributions with parameters following the same trends. And the representations extracted from the LR HSI matches the patterns of that extracted from the HR HSI as shown in Figure 6.
Second, to further reduce the spectral distortion of the estimated HR HSI, instead of using loss, we adopt the loss, which encourages the network to reduce the spectral loss of each pixel. Compared to the network with loss, the network with loss is able to extract spectral information of images more accurately. The loss can not only reduce the spectral distortion of the estimated HR HSI, but also improve the convergence speed of the network.
The result of the proposed method on individual HSI is visualized in Figure 7. When we optimize the network with loss, we can observe that the difference between the estimated MSI and the ground truth MSI is very small, with RMSE of 1.7428 and SAM of 0.25615.
4. Combination of Dirichlet-Net and U-Net
In this section, we propose to combine Dirichlet-Net with U-Net  to mitigate the mis-registration issue in the left and right Mastcam images.
Since in real scenarios, the images from the left and right cameras may not match each other perfectly even after registration, we propose a combination of Dirichlet-Net and U-Net to further improve the fusion performance using non-perfectly registered patches. We propose an unsupervised architecture as shown in Figure 8, which consists of two deep networks, an improved Dirichlet-Net for the representation learning of the MSI, and a U-Net for switching the low-resolution spatial information patches with high-resolution spatial information patches. Then the HR MSI of the left Mastcam image is generated by combining its spectral information with the spatial information of improved resolution.
From the last step in Figure 8, we are able to extract both the spectral and spatial information from LR MSI (left Mastcam) and HR MSI (right Mastcam). Although the scenes from the left and right camera are not the same, we assume they share the same group of spectral bases. And if we could improve the spatial information of the LR MSI using HR MSI, the quality of the LR MSI can be enhanced.
The architecture of the U-Net is illustrated in the lower part of Figure 8. We first learn a U-Net to recover the extracted spatial information, of HR MSI, by convolution and deconvolution layers. The convolution layers extract HR spatial features from and the de-convolutional layers take these extracted features to rebuild the spatial information of . Then we extract features from the spatial patches of the LR HSI with the same convolution layers and switch these feature patches with their most similar feature patches in the HR spatial features . Finally, the left Mastcam image with enhanced resolution, X, is generated by feeding the switching patches into de-convolutional layers of U-Net and the decoder of the Dirichlet-Net.
Here, we show experimental results from the proposed combination (Dirichlet-Net and U-Net) approach in Figures 9 and 10. We can observe that the reconstructed left Mastcam image is sharper than the raw MSI captured from the left camera directly and the spectral distortion of the recovered MSI is small, although only part of the high resolution MSI (right Mastcam image) is given from the right camera. Note that, due to the memory constraint, only a small patch can be recovered every time, thus there exist some disconnected parts in the results. This issue will be addressed in Section 5.
5. Spatial representation improvement with transition learning
High spatial resolution images have one natural property, that is, the transitions among pixel values are smooth. The patch-based method aims to replace the LR patches from the LR MSI representations with the most similar HR patches from the HR MSI representation, . Since the LR MSI and HR MSI are unregistered and there is no ground truth of enhanced MSI, the patch-based improvement could not guarantee the smooth transitions in the reconstructed images, that is, the replaced patches may not match their neighbors. Therefore, in this section, we propose another structure based on transition-learning, to further improve the spatial resolution of LR HSI. The main structure is shown in Figure 11.
To learn smooth transitions between pixels, we first extract sub-images from the representations of HR MSI with stride 3, as shown in the lower part of Figure 11. For example, since the super-resolution factor is 3, we extract 9 sub-images from . Then the network learns the transitions between the center sub-image with the other 8 sub-images. Since the LR MSI and HR MSI have similar statistic distributions, we assume that the transitions among pixels in both modalities are the same. Therefore, the representations of LR MSI can be treated as the center sub-image of enhanced MSI and the other 8 sub-images of enhanced MSI can be estimated by feeding the representations of LR MSI into the network trained by There are still residuals between the reconstructed and the ideal representations of This time, we adopt the principle described earlier to add high frequency residuals on the enhanced MSI.
Here, the experimental results of the proposed approaches are compared with the results from Bicubic and the state-of-the-art single image super-resolution method EnhanceNet , as shown in Figures 12–14. Note that, since the EnhanceNet only offers the 4X pre-trained weights, we show its 4X reconstruction results for fair comparison, in case the down-sampling procedure reduces the quality of the reconstructed images. The Bicubic does not improve the resolution much. The EnhanceNet was trained on natural image dataset; thus it works poorly on remote sensing images. Compared to the bicubic or EnhanceNet methods, we can observe that the proposed methods can not only improve the spatial resolution of LR MSI, but also preserve the spectral information well, even though the images from the left and right camera are not registered. The transition-based approach works better than the patch-based one, because it learns the relationship between the reconstructed pixels.
In this chapter, we summarize the application of several deep learning-based image fusion algorithms to enhance Mastcam images from Mars rover. The first algorithm termed as uDSN is based on the Dirichlet-Net, which incorporates the sum-to-one and sparsity constraints. Two improvements of the uDSN were then investigated. Finally, a transition learning-based approach was developed. Promising results using actual Mastcam images are presented. More research will be carried out in the future to continue the above investigations.
This work was supported in part by NASA NNX12CB05C and NNX16CP38P.