Remote scene classification results of the five well-known pre-trained deep CNNs on three different remote sensing datasets.
Deep convolutional neural networks (CNNs) have been widely used to obtain high-level representation in various computer vision tasks. However, for the task of remote scene classification, there are no sufficient images to train a very deep CNN from scratch. Instead, transferring successful pre-trained deep CNNs to remote sensing tasks provides an effective solution. Firstly, from the viewpoint of generalization power, we try to find whether deep CNNs need to be deep when applied for remote scene classification. Then, the pre-trained deep CNNs with fixed parameters are transferred for remote scene classification, which solve the problem of time-consuming and parameters over-fitting at the same time. With five well-known pre-trained deep CNNs, experimental results on three independent remote sensing datasets demonstrate that transferred deep CNNs can achieve state-of-the-art results in unsupervised setting. This chapter also provides baseline for applying deep CNNs to other remote sensing tasks.
- convolutional neural network
- remote sensing
- scene classification
- deep learning
- generalization power
Remote sensing image processing achieves great advances in recent years, from low-level tasks, such as segmentation, to high-level ones, such as classification. [1, 2, 3, 4, 5, 6, 7] However, the task becomes incrementally more difficult as the level of abstraction increases, going from pixels, to objects, and then scenes. Classifying remote scenes according to a set of semantic categories is a very challenging problem, because of high intra-class variability and low interclass distance. [5, 6, 7, 8, 9] Therefore, the more representative and higher-level representations are desirable and will certainly play a dominant role in scene-level tasks. The deep convolutional neural network (CNN), which is acknowledged as the most successful and widely used deep learning model, attempts to learn high-level features corresponding to high level of abstraction . Its recent impressive results for classification and detection tasks bring dramatic improvements beyond the state-of-the-art records on a number of benchmarks [11, 12, 13, 14]. In theory, considering the subtle differences among categories in remote scene classification, we may attempt to form high-level representations for remote scenes from CNN activations. However, the acquisition of large-scale well-annotated remote sensing datasets is costly, and it is easy to over-fit when we try to train a high-powered deep CNN with small datasets in practice . In other words, with limited remote sensing dataset, deep CNNs work perfectly on the training data but do not generalize well to test data, resulting in poor performance eventually.
ImageNet1 is a large-scale dataset, which offers a very comprehensive database of more than 1.2 million categorized natural images of 1000+ classes . Deep CNN models trained upon this dataset serve as the backbone for many segmentation, detection, and classification tasks on other datasets. Moreover, some very recent works have demonstrated that the representations learned from deep CNNs pre-trained on large datasets such as ImageNet can be transferable to image classification task . Some works also start to apply them to remote sensing field and obtain state-of-the-art results for some specific datasets [15, 18, 19]. However, the generalization power of features learned from deep CNNs fades evidently when the features of remote sensing images become different with that of natural images in the ImageNet dataset [15, 18]. Therefore, to solve the problem discussed above, the generalization power of deep CNNs plays the key role. We find that the generalization power of a deep CNN is relative to its depth. A deeper architecture trained by large-scale dataset may lead to a more general hypothesis for remote scenes. To our surprise, features learned from deeper layers are more general than that learned from shallower layers in a deep CNN when we transfer them for remote scene classification. This overturns the traditional view that features in shallow layers of a deep CNN are composed of basic visual patterns (e.g., salient edges and borders) and they are more general for test data. Inspired by this, we evaluate the generalization power of transferred deep CNN for remote scenes in different conditions and explore the proper way to apply deep CNNs to remote scene classification with limited remote sensing data.
We conduct extensive experiments with transferred deep CNN and evaluate the generalization power of it on different remote sensing datasets that vary in space information. The results show that the depth of CNNs contributes to the generalization power of them. Features from deeper layers are more general for test data and brings better performance in remote scene classification. Then, we conduct extensive experiments with different pre-trained deep CNNs such as CaffeNet , GoogLeNet , and ResNet . This chapter hardly contains any deep or new techniques, and our study so far is mainly empirical. However, a thorough report on generalization power of deep CNNs for remote scene classification has tremendous value for applying deep CNNs to remote sensing images. A satisfied answer to this question would not only help to make features of remote scenes more interpretable in deep CNNs, but it might also lead to more principled and reliable deep architecture design. Our main contributions are summarized as follows:
We thoroughly investigate how transferred deep CNNs work for remote scene classification with limited remote sensing data and how the generalization power of them affect their performance.
This chapter challenges the classical view of features learned in deep CNNs by showing that high-level features learned in deeper layers are more general than basic features (e.g., salient edges and borders) learned in shallower layers. Features learned in shallow layers of deep CNNs are not general enough for remote scenes. This leads us to believe that depth of CNNs enhances the generalization power of the learned features and it is essential for remote scene classification.
Based on various pre-trained deep CNNs, we evaluate our proposed method on different remote sensing datasets that vary in space and spectrum. The results show that our proposed method can learn better features for remote scenes. With “unsupervised settings,” our proposed method achieves state-of-the-art performance on some public remote scene datasets.
The rest of the chapter is organized as follows. Section 2 presents successful pre-trained deep CNNs nowadays and the way to transfer them for remote scene classification. Section 3 analyzes the generalization power of features in different layers of transferred deep CNN. Experiments are presented in Section 4, and we conclude the chapter in Section 5.
2. Transferred deep CNNs for remote scene classification
Convolutional neural networks are generally presented as systems of interconnected processing units which can compute values from inputs leading to an output that may be used on further units. The typical architecture of a deep CNN is composed of multiple cascaded layers with various types.
Among the different types of layers, the convolutional one is the responsible for capturing the features from the images. The first layers usually obtain low-level features (like edges, lines, and corners), while the others get high-level features (like structures, objects, and shapes). The process made in this type of layer can be decomposed into two phases: (i) the convolution step, where a fixed-size window runs over the image, with some stride, defining a region of interest and (ii) the processing step that uses the pixels inside each window as input for the neurons that, finally, perform the feature extraction from the region. The continuous form and discrete form of convolutional operation can be expressed as Eqs. (1) and (2), respectively:
As to the input map, the convolutional operation can be further illustrated by Figure 1:
Conventionally, a nonlinear function is provided after the convolutional operation, which is usually called activation function. There are a lot of alternatives for activation function, such as sigmoid function and tanh function . The most popular activation function nowadays is called rectified linear unit (ReLU). ReLU has several advantages when compared to others: (i) works better to avoid saturation during the learning process, (ii) induces the sparsity in the hidden units, and (iii) does not face gradient vanishing problem as with sigmoid and tanh function. The mathematic form of the ReLU can be shown as follows:
Typically, after obtaining the convolved feature activations, we would next like to aggregate statistics of these features at various locations, and this aggregation operation is called pooling operation. Pooling operation within the pooling region translates convolved feature activations into pooled features, which are much lower in dimension and can improve classification results (i.e., less over-fitting). Pooling regions are usually contiguous areas in the convolved feature maps, and the pooled features are usually generated from the same filter. Then these pooled features would be “translation invariant.” Although several novel pooling approaches have been proposed, max pooling and average pooling are still the most commonly used approaches as shown in Figure 2.
After several convolutional and pooling layers, there are the fully connected ones, which take all neurons in the previous layer and connect them to every single neuron in its layer. Since a fully connected layer occupies most of the parameters, over-fitting can easily happen. To prevent this, the dropout method was employed as shown in Figure 3. This technique randomly drops several neuron outputs, which do not contribute to the forward pass and backpropagation anymore. This neuron drops are equivalent to decreasing the number of neurons of the network, improving the speed of training, and making model combination practical, even for deep networks.
Finally, after all the convolution, pooling, and fully connected layers, a classifier layer may be used to calculate the class probability of each instance.
Based on the typical architecture of deep CNN, AlexNet , CaffeNet , VGG-VD , MSRA-Net , NIN , GoogLeNet , Inception V3 , Inception V4 , and ResNet  all proved to be effective in detection or classification tasks and achieve state-of-the-art performance.
In summary, we demonstrate the evolution of deep CNNs’ structure in Figure 4:
However, these successful deep CNNs discussed above do not achieve good performance as we expected, when we directly apply them for remote scene classification. An effective solution, recently explored in [15, 18, 20], is to transfer deep features trained on ImageNet dataset to remote sensing images. Deep CNNs pre-trained by ImageNet dataset can be treated as fixed feature extractors. In a feedforward way, they extract global feature representation of the remote sensing images. With the global representation, a simple classifier can implement remote scene classification. Taking a step further, fine-tuning strategy is usually used for deeper layers of transferred deep CNNs to further improve the performance of them for remote scene classification. Typically, the first few layers are frozen, because low-level features can better fit remote scenes, and deeper layers are allowed to keep learning by training them with remote sensing images. Taking AlexNet, for example, we show the fine-tuning strategy in Figure 5.
Although the strategy of fine-tuning deeper layers of transferred deep CNNs with remote sensing images achieves near-perfect performance in remote scene classification , we challenge the theory basis of this strategy by showing that not all low-level features in shallow layers are general enough for remote scenes; some of them even shows very poor generalization power in transferring process. We find that the depth of transferred CNNs enhances the generalization power of them and guarantees a general hypothesis for remote scene classification. The detailed results are discussed in Section 3. This find in transferred deep CNNs gives an answer to the very recent discussion about whether generalization power of deep CNNs comes from sheer memorization or available hypothesis.
3. Generalization power of features in different layers of transferred deep CNN
As mentioned in Section 2, when transferring deep CNNs pre-trained by ImageNet for remote scene classification, we typically assume that features (e.g., salient edges and borders) in the shallow layers are generic, while features in the deep layers are more specific to the dataset used for pre-training and thus need to be fine-tuned by the target dataset. Therefore, the traditional strategy of transferring pre-trained deep CNNs for remote scene classification is to freeze the shallow layers and fine-tune the last deep layers. However, this assumption drives us to the question that how the “depth” of transferred deep CNNs affect the features of remote scenes in the transferring process. To answer this question, we take CaffeNet pre-trained by ImageNet, for example, and thoroughly analyze features of remote scenes in different layers of it when we transfer it for remote scene classification on UC Merced dataset2.
Firstly, we take a close look into features of remote sensing image in the first convolutional layer of the pre-trained CaffeNet. In Figure 6, we visualize the convolutional filters of the first convolutional layer. These convolutional filters are learned by pre-training the CaffeNet with ImageNet dataset. We can see that the former filters are learned for extracting edges in different directions and the later filters are learned for extracting different colors. For example, the first, fifth, and ninth filters are mainly used to extract features in the right lower oblique direction, while the second, sixth, and eighth filters are mainly used to extract features in the left lower oblique direction. Based on the architecture of pre-trained CaffeNet, we can obtain 96 feature maps in the first convolutional layer by applying these convolutional filters to remote sensing image. In Figure 6(a), we find that the first, fifth, and ninth feature maps contain features of the input image in the right lower oblique direction, while the second, sixth, and eighth feature maps contain features of the input image in the left lower oblique direction. However, in Figure 6(b), we cannot see obvious features in these two directions in the corresponding feature maps. The input images in Figure 6(a) and (b) belong to the same remote scene class. However, features of them extracted by filters in the first convolutional layer of pre-trained CaffeNet are very different from each other. Compared with daily optical images in ImageNet dataset, remote sensing images are much more sophisticated. Some convolutional filters in shallow layers of pre-trained CaffeNet may be effective for some remote sensing image while affecting little about some other remote sensing images. Not all features in shallow layers of pre-trained CaffeNet are general for remote sensing images.
Furthermore, we try to visualize features of the input remote sensing image learned in deeper layers of the pre-trained CaffeNet. However, as we can see in Figure 7, feature maps of the remote sensing image become increasingly fuzzy from the second convolutional layer. With the increase of depth, representations of remote scene become more and more abstract. In order to reveal how the depth of pre-trained CaffeNet affects the generalization power of features in it, we intuitively reflect the distribution of features learned from the two input remote sensing images in Figure 8 by using the t-SNE algorithm. [26, 27] In Figure 8, we use the t-SNE algorithm to visualize feature maps in different convolutional layers by giving each datapoint a location in a 2-D map.
Figure 8 shows the separability of features learned in different convolutional layers of pre-trained CaffeNet when we apply it on two different remote sensing images that belong to the same remote scene class. In Figure 8(a), the 2-D features of the two input images are separated to each other obviously in the first convolutional layer. Notably, from Figure 8(a)–(e), the deeper the layer, the more overlap between features of the two remote sensing images we can observe. Therefore, in contrast to common belief that features in shallow layer are more generic, they are susceptible to changes in input remote sensing images. Indeed, filters of the first convolutional layer are similar to HOG, SURF, or SIFT (edge detectors, color detectors, texture, etc.). They give representative information for different input images. However, this information also conveys the specific characteristics of the dataset used to pre-train the CaffeNet. As a result, features extracted in shallow layers of pre-trained CaffeNet may be not general enough for remote scene classification in the transferring process. On the other hand, it seems that the depth of pre-trained CaffeNet enhances the generalization power of features in it. Regardless of the specific meaning of edges or colors, high-level features in deeper layer represent the sematic meaning of the input remote sensing image. Based on this analysis of features in pre-trained CaffeNet, we believe that depth of pre-trained CNNs brings general hypothesis for remote scene classification. It plays an important role when we apply pre-trained CNNs to the task of remote scene classification.
The main objective of this chapter is to evaluate different deep CNNs transferred for remote scene classification. Therefore, we organize the experiments for transferred deep CNNs with various deep CNN architectures and various remote sensing datasets. We try to explore the answer for the problem where the generalization power comes from in deep CNNs and find the proper way to apply deep CNNs for remote scene classification. All the developed codes rely on the MatConvNet3 framework which provides a complete deep learning toolkit for training and testing models. In addition, it should be noted that all the experiments are performed on HP z820 with two Intel (R) Xeon (R) CPUs with 2.60 GHz of clock and 32GB of RAM memory. NVIDIA Quadro K2000 series is used as graphic processing units.
4.1 Experimental setup
In this section, we carry out a number of experiments based on different architectures of deep CNNs. To evaluate the effectiveness of pre-trained deep CNNs transferred for remote scene classification, we conduct experiments on three remote sensing datasets. These three datasets are different in spatial and spectral information. We compare the performance of pre-trained deep CNNs with the state-of-the-art results in these three datasets. We must note that except learning the classifier, all the experiments are unsupervised.
The three publicly available datasets used in our experiments are as follows:
UC merced land use dataset. This dataset is composed of 2100 overhead scene images divided into 21 land use scene classes. Each class consists of 100 aerial images measuring 256 × 256 pixels, with a spatial resolution of 0.3 m per pixel in the red-green-blue color space. The example images for each class are shown in Figure 9. This dataset was extracted from aerial orthoimagery downloaded from the United States Geological Survey (USGS) National Map of the following US regions: Birmingham, Boston, Buffalo, Columbus, Dallas, Harrisburg, Houston, Jacksonville, Las Vegas, Los Angeles, Miami, Napa, New York, Reno, San Diego, Santa Barbara, Seattle, Tampa, Tucson, and Ventura. So far, this dataset is the most popular and has been widely used for the task of remote scene classification and retrieval. 
WHU-RS dataset4. Collected from Google Earth, this dataset is composed of 950 aerial scene images with 600 × 600 pixels, which are uniformly distributed in 19 scene classes, 50 for each class. With spatial resolution up to 0.5 m and spectral bands of red, green, and blue, the example images for each class are shown in Figure 10. This dataset is challenging due to the high variations in resolution, scale, orientation, and illuminations of the images.
Brazilian coffee scenes dataset5. This dataset consists of only two scene classes (coffee class and non-coffee class), and each class has 1438 image tiles with a size of 64 × 64 pixels cropped from SPOT satellite images over four counties in the State of Minas Gerais, Brazil: Arceburgo, Guaranesia, Guaxupe, and Monte Santo. This dataset considered the green, red, and near-infrared bands because they are the most useful and representative ones for distinguishing vegetation areas. Figure 11 shows three example images for each of the coffee and non-coffee classes in false colors.
In the experiments, we divide all the datasets in fivefolds. For UC Merced dataset, WHU-RS dataset, and Brazilian coffee scenes dataset, each of the five folds contains 420 images, 190 images, and 600 images, respectively. Then, the classification accuracy and standard deviation are calculated with fivefold cross-validation. Five well-known pre-trained deep CNNs (AlexNet , CaffeNet , VGG-VD16 , GoogLeNet , and ResNet ) descripted in Section 2 are used to test the effectiveness of pre-trained deep CNNs in the experiments. As we analyzed before, all the experiments are in unsupervised framework except learning the classifier.
4.2 Experiment results of remote scene classification
We evaluate transferred deep CNNs for the task of remote scene classification based on the five well-known deep CNN architectures (AlexNet, CaffeNet, VGG-VD16, GoogLeNet, and ResNet) pre-trained by ImageNet. For the strategy of transferring deep CNNs for remote scene classification, we use the five pre-trained deep CNNs to extract high-level features from input images. These input images are resized to 227 × 227 for pre-trained AlexNet and CaffeNet and 224 × 224 for pre-trained VGG-VD16, GoogLeNet, and ResNet by down-sampling or up-sampling operation. Linear SVM is used as classifier.
With various pre-trained deep CNN models and remote sensing datasets, the remote scene classification performances are shown in Table 1. In Table 1, Ac and SD denote accuracy and standard deviation, respectively.
|Pre-trained deep CNN||UC merced||WHU-RS||Brazilian coffee scenes|
|Ac (%)||SD||Ac (%)||SD||Ac (%)||SD|
In the experiment, pre-trained deep CNNs are directly used as feature extractors in an unsupervised manner. By removing the last fully connected layer, the rest parts of pre-trained deep CNNs extract high-dimensional feature vectors of remote sensing images. These feature vectors are considered as final image representation followed by a linear SVM classifier. From Table 1, we can see that all transferred deep CNNs generated from AlexNet, CaffeNet, VGG-VD16, and GoogLeNet achieve state-of-the-art performance. Pre-trained deep CNNs show strong generalization power in the transferring process. In addition to our surprise, the most successful deep CNNs to date, ResNets fail to obtain a good experiment result, no matter their layers are 50, 101, or 152. In ResNets, shortcut connections bring less parameters and make the network much easier to optimize. At the same time, the direct connection between input and output brings poor generalization ability when we transfer them for other tasks. On the other hand, as shown in Figure 11, the spatial information of remote sensing images in the Brazilian coffee scene dataset is very simple. However, these remote sensing images are not optical (green-red-infrared). In Table 1, the relatively poor performance on this dataset comes from the difference in spectral information when we are transferring pre-trained deep CNNs for remote scene classification.
In order to test the performance of transferred deep CNNs for each remote scene class, in Figure 12, we draw the confusion matric of the experiment results on UC Merced dataset based on pre-trained CaffeNet.
In Figure 12, the experiment results in perfect or near-perfect accuracy for most of the scene categories. The relatively lower classification accuracy lies in the categories of building, dense residential, medium residential, and tennis court. However, all these classes have some very “close” neighbors. Taking dense residential as example, it suffers the presence of very close classes, like buildings and medium residential, which we cannot even distinguish by eyes. Taking pre-trained CaffeNet, for example, Figure 13 shows the detail changes of an optical remote sensing image.
Abbreviated as “conv” and “fc,” reconstructions of convolutional feature maps in the former network layers and that of fully connected layers are shown in Figure 13. Figure 13 shows that the representations of convolutional layers are still photographically similar with the remote sensing image to some extent, although they become fuzzier and fuzzier from “conv1” to “conv5.” In addition, the fully connected layers rearrange the information from lower layers to generate representations that are more abstract. They compose of parts (e.g., the wings of airplanes) similar but not identical to the ones found in the original image.
In Table 2, we compare our best result achieved via transferred deep CNNs with various state-of-the-art methods on the UC Merced dataset. With a straightforward and simple framework, transferred deep CNN achieves outstanding performance on this dataset. We must note that our proposed method just provides basic framework to directly transfer pre-trained deep CNNs for remote scene classification in an unsupervised manner. The effectiveness of fine-tuning approach is much dependent on the amount of images in remote sensing dataset, and the computation time of it is more demanding compared with our proposed strategy .
To solve the problem that deep CNNs tend to over-fit when trained with limited remote sensing dataset, generalization power of deep CNNs plays the key role. In this chapter, we try to transfer deep CNNs pre-trained by daily images for remote scene classification and provide an insight for the generalization power of features in the transferred deep CNNs. From the extensive experiments above, the deep architecture of CNNs, which extracts semantic features of remote scenes, has been proven to be critical for remote scene classification. Specifically, several practical observations from the experiments and some limitations of our study are summarized as follows:
From Table 1, we can see that with our proposed method the classification accuracies of UC Merced dataset and WHU-RS dataset can both achieve state-of-the-art results which are near 95%. In addition, small standard deviation of classification accuracy suggests that our proposed method is stable when applied for remote scene classification. To our surprise, the most successful deep CNNs to date, ResNets, fail to obtain good experiment result when we transfer it for remote scene classification, no matter their layers are 50, 101, or 152. Shortcut connections in ResNets bring poor generalization ability when we transfer them to remote scenes.  This phenomenon indicates that not all successful deep CNNs are suitable for transferring to the task of remote scene classification.
Different from the traditional view that all basic features (e.g., salient edges and borders) in shallow layers of a deep CNN are more general than that learned in deep layers, we find some features in shallow layer of deep CNNs show poor generalization power when we transfer them for remote scene classification. High-level features learned in deeper layers of transferred deep CNNs are more general than these basic features.
In the remote sensing field, the scale of remote sensing datasets will be larger and larger. On the other hand, the structure of deep CNN will be optimized, and the parameters in it will be less and less.  Therefore, we could get more and more useful information from remote sensing datasets, which provide a priori knowledge for pre-trained deep CNNs and result in better generalization power.
Based on our study, the future research directions of applying deep CNNs for remote scene classification may be as follows. Firstly, as we discussed above, when transferring the most successful ResNet for remote scene classification, it does not work as we expected. What is the proper architecture of deep CNN that is suitable to transfer to remote scenes? Secondly, instead of directly transferring pre-trained deep CNNs for remote scene classification, could we replace some basic features that show poor generalization power in shallow layers of transferred deep CNN? Finally, with more and more remote sensing information coming into our sight, how can we use these a priori knowledge when we apply deep CNNs for remote scene classification?
In this chapter, we have presented a framework to investigate the effectiveness of transferred deep CNNs for remote scene classification. We test transferred deep CNNs for different remote sensing datasets and take a close look into the generalization power of features in them.
The two main conclusions of this work are that (1) without shortcut connections in the deep architecture as ResNet dose, most CNNs transferred from well-known pre-trained deep CNNs achieve state-of-the-art performance in remote scene classification. (2) We further confirm the conclusion in the background of remote scene classification that the generalization power derived from deep architectures brings general hypothesis. Compared with basic features (e.g., salient edges and borders), features in deeper layers are more general for remote scenes. Experiments on three remote sensing datasets with different image resolutions have provided insightful information. Transferred deep CNN improves the classification accuracy of remote scenes on UC Merced dataset with a gain up to 1.49% compared with other methods. High-level feathers in deeper layers of transferred deep CNNs are more general for remote scene classification and result in satisfied performance in unsupervised setting.
We believe our work in this chapter provides a thorough analysis about the generalization power of transferred deep CNNs for remote scene classification. It can serve as a good baseline for people to apply deep CNNs to other remote sensing datasets.
This work was supported by the National Natural Science Foundation of China under Grant No. 61601499 and No. 61601505. All the funds above can cover the costs to publish in open access.
Conflict of interest
The authors declare no conflict of interest. The founding sponsors had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript, and in the decision to publish the results.