Satellite sensors’ characteristics and resolutions.
Remote sensing images and techniques are powerful tools to investigate earth’s surface. Data quality is the key to enhance remote sensing applications and obtaining clear and noise-free set of data is very difficult in most situations due to the varying acquisition (e.g., atmosphere and season), sensor and platform (e.g., satellite angles and sensor characteristics) conditions. With the increasing development of satellites, nowadays Terabytes of remote sensing images can be acquired every day. Therefore, information and data fusion can be particularly important in the remote sensing community. The fusion integrates data from various sources acquired asynchronously for information extraction, analysis, and quality improvement. In this chapter, we aim to discuss the theory of spatiotemporal fusion by investigating previous works, in addition to describing the basic concepts and some of its applications by summarizing our prior and ongoing works.
- spatiotemporal fusion
- satellite images
- depth images
- pixel-level spatiotemporal fusion
- feature-level spatiotemporal fusion
- decision-level spatiotemporal fusion
Obtaining a high-quality satellite image with a complete representation of earth’s surface is crucial to get clear interpretability of data, which can be used for monitoring and managing natural and urban resources. However, because of the internal and external influences of the imaging system and its surrounding environment, the quality of remote sensing data is often insufficient. The internal imaging system conditions include the spectral characteristics, resolution and other factors of the sensor, algorithms used to calibrate the images, etc. The surrounding environment refers to all external/environmental influences such as weather and season. These influences can cause errors and outliers within the images; for instance, shadow and cloud may cause obstructions in the scene and may occlude part of the information regarding an object. These errors must be resolved in order to produce high-quality remote sensing product (e.g., land-cover maps).
With the rapid and increasing development of satellite sensors and their capabilities, studies have shown that fusion of data from multisource, multitemporal images, or both is the key to recover the quality of a satellite image. Image fusion is known as the task of integrating two or more images into a single image [1, 2, 3]. The fusion of data essentially utilizes redundant information from multiple images to resolve or minimize uncertainties associated with the data, with goals such as to reject outliers, to replace and fill missing data points, and to enhance spatial and radiometric resolutions of the data. Fusion has been used in a wide range of remote sensing applications such as radiometric normalization, classification, change detection, etc. In general, there are two types of fusion algorithms: spatial-spectral [4, 5, 6, 7] and spatiotemporal fusion [8, 9, 10]. Spatial-spectral fusion uses the local information in a single image to predict the pixels’ true values based on spectrally similar neighboring pixels. It is used for various types of tasks and applications such as filling missing data (also known as image inpainting) and generating high-resolution images (e.g., pan-sharpening  and super-resolution ). It can include filtering approaches such as fusing information within a local window using methods such as interpolation [13, 14], maximum a posteriori (MAP), Bayesian model, Markov random fields (MRFs), and Neural Networks (NN) [4, 12, 15, 16, 17, 18]. Although spatial-spectral fusion is efficient, it is not able to incorporate information from temporal images, which produce dramatic radiometric differences such as those introduced by meteorological, phenological, or ecological changes. For instance, radiometric distortions and impurities in an image due to metrological changes (e.g., heavy cloud cover, haze, or shadow) cannot be entirely detected and suppressed by spatial-spectral fusion since it only operates locally within a single image. To address this issue, researchers suggested spatiotemporal fusion, which encompasses spatial-spectral fusion and offers a filtering algorithm that is invariant to dynamic changes over time, in addition to being robust against noise and radiometric variations. Identifying spatiotemporal patterns is the core to spatiotemporal fusion, where the patterns are intended to define a correlation between shape, size, texture, and intensity of adjacent pixels across images taken at different times, of different types, and from different sources.
Spatiotemporal fusion has been an active area of study over the last few decades . Many studies have shown that maximizing the amount of information through integrating the spatial, spectral, and temporal attributes can lead to accurate stable predictions and enhance the final output [8, 9, 19, 20, 21]. Spatiotemporal fusion can be applied within local and global fusion frameworks, where locally it can be performed using weighted functions and local windows around all pixels [22, 23, 24], and globally using optimization approaches [25, 26]. Additionally, spatiotemporal fusion can be performed on various data processing levels depending on the desired techniques and applications to be used . It also can depend on the type of data used; for instance, per-pixel operations are well suited for images acquired from the same imaging system (i.e., same sensor) since they undergo similar calibration process and minimum spectral differences in terms of having the same number of bands and bandwidth ranges in the spectrum, whereas feature- or decision-level fusion is more flexible and able to handle heterogeneous data such as combing elevation data (e.g., LiDAR) with satellite images . Fusion levels include:
Pixel-level image fusion: This is a direct low-level fusion approach. It involves pixel-to-pixel operation, where the physical information (e.g., intensity values, elevation, thermal values, etc.) associated with each pixel within two or more images is integrated into a single value . It includes methods such as spatial and temporal adaptive reflectance fusion model (STARFM), Spatial and Temporal Reflectance Unmixing Model (STRUM), etc. [22, 23, 24].
Feature-level image fusion: It involves extracting and matching distinctive features from two or more overlapping images using methods such as dimensionality reduction like principal component analysis (PCA), linear discriminant analysis (LDA), SIFT, SURF, etc. [2, 28]. Fusion is then performed using the extracted features and the coefficients corresponding to them [2, 29]. Some other common methods that include spatiotemporal fusion on feature-level are sparse representation and deep learning algorithms [10, 30, 31, 32, 33, 34, 35, 36, 37, 38].
Decision-level image fusion is a high-level of fusion method that requires each image to be processed individually until an output (e.g., classification map). The outputs are then postprocessed using decision-level fusion techniques [2, 39]. This level of fusion can include the previous two levels of fusion (i.e., per-pixel operations or extracted features) within its operation [40, 41].
In this chapter, we will focus on the concept, methods, and applications of the spatiotemporal-based fusion at all levels of fusion. We will discuss all aspects of spatiotemporal fusion starting from its concepts, preprocessing steps, the approaches, and techniques involved. We will also discuss some examples that apply spatiotemporal fusion for remote sensing applications.
This book chapter introduces the spatiotemporal analysis in fusion algorithms to improve the quality of remote sensing images. We will explore spatiotemporal fusion advantages and limitations, as well as, their applications and associated technicalities under three scenarios:
Pixel-level spatiotemporal fusion
Feature-level spatiotemporal fusion
Decision-level spatiotemporal fusion
The organization of this chapter is as follows: Section 2 describes remote sensing data and acquisition and generation processes and necessary preprocessing steps for all fusion levels. Section 3 talks about spatiotemporal fusion techniques under the three levels of fusion: pixel-level, feature-level, and decision-level, which can be applied to either multisource, multitemporal, or multisource multitemporal satellite images. Section 4 describes some applications applying spatiotemporal fusion, and finally Section 5 concludes the chapter.
2. Generic steps to spatiotemporal fusion
Spatiotemporal analysis allows investigation of data from various times and sources. The general workflow for any spatiotemporal fusion process is shown in Figure 1. The process description toward a fused image is demonstrated in Figure 1(a), where it describes the process of input acquisition, preprocessing steps, and finally the fusion. Data in remote sensing are either acquired directly from a sensor (e.g., satellite images) or indirectly generated using algorithms (e.g., depth image from dense image matching algorithms ) (see Figure 1(b)). It also includes data from single or multiple sources (see Figure 1(b)); however, combing multisource and multitemporal images requires preprocessing steps to assure data consistency for analyses. The preprocessing steps can include radiometric and geometric correction and alignment (see Figure 1(a)). The main spatiotemporal fusion algorithm is then performed using one or more of the three levels of fusion as a base for their method. In this section, we will discuss the most common preprocessing steps in spatiotemporal fusion, as well as, the importance and previous techniques used in spatiotemporal fusion in the three levels of fusion to improve the quality of images and remote sensing applications.
2.1 Data acquisition and generation
Today, there exists a tremendous number of satellite sensors with varying properties and configurations providing researchers with access to a large amount of satellite data. Remote sensing images can be acquired directly from sensors or indirectly using algorithms. It is also available with a wide range of properties and resolutions (i.e., spatial, spectral, and temporal resolutions), which are described in detail in Table 1.
|Type of resolution||Spatial resolution||Spectral resolution||Temporal resolution|
|Definition||Describes the ground area covered by a single pixel in the satellite images. It is also known as the ground sampling distance (GSD) and can range from a few hundreds of meters to sub-meters. Satellite sensors like Moderate Resolution Imaging Spectroradiometer (MODIS) produce coarse-resolution images with 250, 500, and 1000 meters, while fine-resolution images are produced by satellites like very high-resolution (VHR) satellites at the sub-meter level .||Refers to the ability of satellite sensors to capture images with wide ranges of the spectrum. It includes hyperspectral (HS) images with thousands of bands or multispectral (MS) images with few numbers of bands (up to 10–15 bands) . It may also include task-specific bands that are beneficial to study the environment and weather, like the thermal band as in Landsat 7 thematic mapper plus (ETM+) . Spectral resolution also refers to the wavelength interval in the spectral signal domain; for instance, MODIS has 36 bands falling between 0.4 and 14.4 μm, whereas Landsat 7 (ETM+) has 7 bands ranging from 0.45 to 0.9 μm.||It is the ability of satellite sensors to capture an object or phenomena in certain periods of time, also known as the revisiting time of sensor at a certain location on the ground. Today, modern satellite systems allow monitoring earth’s surface over short and regular periods of time; for instance, MODIS provides almost a daily coverage, while Landsat covers the entire earth surface every 16 days.|
2.1.1 Data acquisition
Generally, there exist two types of remote sensing sensor systems: active and passive sensors . Active sensors record the signal that is emitted from the sensor itself and received back when it reflects off the surface of the earth. They include sensors like Light Detection and Ranging (LiDAR) and Radar. Passive sensors record the reflected signal off the ground after being emitted from a natural light source like the Sun. They include satellite sensors that produce satellite images such as Landsat, Satellite Pour l’Observation de la Terre (SPOT), MODIS, etc.
2.1.2 Data generation
Sometimes in remote sensing, the derived data can be also taken as measurements. Examples include depth images with elevation data derived through photogrammetric techniques on satellite stereo or multi-stereo images , classification maps, change detection maps, etc. In this section, we will discuss two important examples of the commonly fused remote sensing data and their generation algorithms:
2.1.3 Depth maps (or digital surface model (DSM))
3D geometric elevation information can either be obtained directly using LiDAR or indirectly using dense image matching algorithms such as Multiview stereo (MVS) algorithms. However, because LiDAR data are expensive and often unavailable for historic data (before 1970s when LiDAR was developed), generating depth images using MVS algorithms is more convenient and efficient. MVS algorithms include several steps:
Images acquisition and selection to perform MVS algorithm requires having at least a pair or more of overlapping images captured from different viewing angles that assure selecting an adequate number of matching features. Specifically, this refers to the process of feature extraction and matching, where unique features are being detected and matched in pairs of images using feature detectors and descriptors methods such as Harris, SIFT, or SURF .
Dense image matching and depth map generation: Dense image matching refers to the process of producing dense correspondences between two or among multiple images, and with their pre-calculated geometrical relationship, depth/height information can be determined through ray triangulation . The dense correspondences problem, with pre-calculated image geometry, turns to a 1-D problem in rectified image (also called epipolar image) , called disparity computation, which is basically the difference between the left and right views as shown below:
where and are distance of pixel in the left and right images accordingly, f is the focal length, T is the distance between the cameras, and z is the depth. The depth (z) is then estimated from Eq.  by taking the focal length times the distance between the cameras divided by the disparity as follows:
2.1.4 Classification maps
Image classification can be divided into two categories: 1) Supervised classification is a user-guided process, where classification depends on a prior knowledge about the data that are extracted from the predefined training samples by the user; some popular supervised classification methods include support vector machine (SVM), random forest (RF), decision trees DT, etc. [49, 50, 51]. 2) Unsupervised classification is a machine-guided process, where the algorithms classify the pixels in the image by grouping similar pixels to come up with specific patterns that define each class. These techniques include segmentation, clustering, nearest neighbor classification, etc. .
2.2 Preprocessing steps
2.2.1 Geometric correction
Image registration and alignment is an essential preprocessing step in any remote sensing application that processes two or more images. For accurate analyses of multisource multitemporal images, it is necessary that overlapping pixels in the images correspond to the same coordinates or points on the earth’s surface. Registration can be performed manually by selecting control points (CPs) between a pair of images to determine the transformation parameters and wrap the images with respect to a reference image . An alternative approach is an automated CP extraction that operates based on mutual information (MI) and similarity measures of the intensity values . According to , there are a few common and sequential steps for image registration including the following steps:
Unique feature selection, extraction, and matching refers to the process where unique features are detected using feature extraction methods, then matched to their correspondences in a reference image. A feature can be a shape, texture, intensity of a pixel, edge, or an index such as vegetation and morphological index. According to [54, 55], features can be extracted based on the content of a pixel (e.g., intensity, depth value, or even texture) using methods such as SIFT, difference of Gaussian (DOG), Harris detection, and Histogram of oriented gradient (HOG) [53, 56, 57, 58] or based on patch of pixels [59, 60, 61] like using deep learning methods (e.g., convolutional neural networks (CNNs)), which can be used to extract complete objects to be used as features.
Transformation refers to the process of computing the transformation parameters (e.g., rotation, translation, scaling, etc.) necessary to convolve an image to a coordinate system that matches a reference image. The projection and transformation methods include similarity, affine, projective, etc. .
Resampling is the process where an image is converted into the same coordinate system as the reference image using the transformation parameters; it includes methods such as interpolation, bilinear, polynomial, etc. .
2.2.2 Radiometric correction
Radiometric correction is essential to remove spectral distortion and radiometric inconsistencies between the images. It can be performed either using absolute radiometric normalization (ARN) or relative radiometric normalization (RRN) [62, 63, 64]. ARN requires prior knowledge of physical information related to the scene (e.g., weather conditions) for normalization [63, 65, 66, 67], while, RRN radiometrically normalizes the images based on a reference image using methods such as dark object subtraction (DOS), histogram matching (HM), simple regression (SR), pseudo-invariant features (PIF), iteratively re-weighted MAD transformation, etc. [62, 64, 68].
3. Data analysis and spatiotemporal fusion
Pixels in remote sensing data are highly correlated over space and time due to earth’s surface characteristics, repeated patterns (i.e., close pixels belong to the same object/class), and dynamics (i.e., season). The general algorithm for spatiotemporal fusion is demonstrated in Figure 2, where all levels of fusion follow the same ideology. The minimum image requirement for spatiotemporal fusion is a pair of images whether they are acquired from multiple sources or time, the input images are represented with t1 to tn in Figure 2. The red square can be either a single raw pixel, an extracted feature vector, or processed pixel with valuable information (e.g., probability value indicating the class of a pixel). The fusion algorithm then finds the spatiotemporal patterns over space (i.e., the coordinates (x, y), and pixel content) and time (t) to predict accurate and precise values of the new pixels (see Figure 2). In this section, we will provide an overview of some previous works regarding spatiotemporal image fusion that emphasize on the importance of space-time correlation to enhance image quality and discuss this type of fusion in the context of three levels of fusion: pixel-level, feature-level, and decision-level.
3.1 Pixel-level spatiotemporal fusion
As mentioned in the introduction, pixel-based fusion is the most basic and direct approach to fuse multiple images by performing pixel-to-pixel operations; it has been used in a wide range of applications and is preferred because of its simplicity. Many studies performing pixel-level fusion algorithms realized the power of spatiotemporal analysis in fusion and used it in a wide range of applications such as monitoring, assessing, and managing natural sources (e.g., vegetation, cropland, forests, flood, etc.), as well as, urban areas . Most of the pixel-level spatiotemporal fusion algorithms operate as a filtering or weighted-function method; they process a group of pixels in a window surrounding each pixel to compute the corresponding spatial, spectral, and temporal weights (see Figure 3). A very popular spatiotemporal fusion method that set the base for many other fusion methods is spatial and temporal adaptive reflectance fusion model (STARFM); it is intended to generate a high-resolution image with precise spectral reflectance by merging multisource fine- and coarse-resolution images . Their method resamples the coarse-resolution MODIS image to have a matching resolution as the Landsat TM image, after that it computes the overall weight by calculating the spectral and temporal differences between the images. STARFM is highly effective in detecting phenological changes, but it fails to handle heterogeneous landscapes with rapid land-cover changes and around mixed pixels . To address this issue,  have proposed Enhanced STARFM (ESTARFM); it applies a conversion coefficient to assess the temporal differences between fine- and coarse-resolution images. In , Hilker also addressed the problem of drastic land-cover change by proposing Spatial Temporal Adaptive Algorithm for mapping Reflectance Change (STAARCH), which applies Tasseled cap transformation  to detect the seasonal changes over a landscape. For further improvement of these algorithms, studies have suggested using machine learning methods to identify similar pixels by their classes . The authors also show an example on using machine learning unsupervised classification within the spatiotemporal fusion to enhance its performance. They used clustering on one of the images using the ISODATA method , where pixels are considered similar if the difference between the current and central pixel in the window is less than one standard deviation of the pixels in the cluster. Other methods use filtering algorithms to enhance the spatial and spectral aspects of images, in addition to embedding the temporal analysis to further enhance the quality and performance of an application. For instance,  proposed a method that combines the basic bilateral filter with STARFM to estimate land surface temperature (LST). In , they proposed a 3D spatiotemporal filtering as a preprocessing step for relative radiometric normalization (RRN) to enhance the consistency of temporal images. Their idea revolves around finding the spatial and spectral similarities using a bilateral filter, followed by assessing the temporal similarities for each pixel against the entire set of images. The temporal weight, which assesses the degree of similarity, is computed using an average Euclidean distance using the multitemporal data. In addition to the weighted-based functions, approaches such as unmixing-based and hybrid-based methods are also common in spatiotemporal fusion . The unmixing-based methods predict the fine-resolution image reflectance by computing the mixed pixels from coarse-resolution image , while hybrid-based methods use a color mapping function that computes the transformation matrix from the coarse-resolution image and apply it on the finer resolution image .
3.2 Feature-level spatiotemporal fusion
Feature-level fusion is a more complex level of fusion, unlike pixel-based operations, it can efficiently handle heterogeneous data that vary in modality and source. According to , feature-based fusion can either be conducted directly using semantically equivalent features (e.g., edges) or through probability maps that transform images into semantically equivalent features. This characteristic allows fusion to be performed regardless of the type and source of information . Fusion can then be performed using arithmetic (e.g., addition, division, etc.) and statistical (e.g., mean, median, maximum, etc.) operations; the general process of feature-based fusion is shown in Figure 4. The approach in  demonstrates a simple example on feature-level spatiotemporal fusion to investigate and monitor deforestation; in their method, they combined data from meduim-resolution synthetic aperture radar (SAR) and MS Landsat data, they extracted features related to vegetation and soil location (using scattering information and Normalized Difference Fraction Index (NDFI) respectively), finally, fusion was performed through decision tree classifier. Both [26, 62] point out to the most popular methods in feature-level fusion, which include Laplacian pyramid, gradient pyramid, morphological pyramid, high-pass filter, and wavelet transform methods [77, 78, 79, 80, 81]. A very famous fusion example in this category is inverse discrete wavelet (IDW) transform, which is a wavelet transform fusion approach; it uses temporal images with varying spatial resolutions to down-sample the coarse-resolution image. It basically extracts the wavelet coefficients from the fine-resolution image and uses them to down-sample the coarse-resolution image . Sparse representation is another widely used learning-based method in feature-level fusion due to its good performance [8, 10, 30, 31]. All sparse representation algorithms share the same concept and core idea, where the general steps include: 1) dividing the input images into patches, 2) extracting distinctive features from the patches (e.g., high-frequency feature patches), 3) generating coefficients from the feature patches, 4) training jointly using dictionaries to find similar structures by extracting and matching feature patches, and finally, 5) fusion using the training information and extracted coefficients [8, 10, 30, 79, 83, 84].
Another state-of-the-art approach in feature- and decision-level fusion is deep learning or artificial neural networks (ANNs). They are currently a very active area of interest in many remote sensing fields (especially image classification) due to their outstanding performance that surpasses traditional methods [32, 33, 34, 35, 36, 37, 38, 76, 82, 83, 84]. They are also capable of dealing with multi-modality like images from varying sources and heterogeneous data, for instance, super-resolution and pan-sharpening images from different sensors, combining HS and MS images, combining images with SAR or LiDAR data, etc. [32, 33, 34, 35, 36, 37, 38]. In feature-level fusion, the ANN is either performed on the images for feature extraction or to learn from the data itself . The extracted features from the temporal images or classification map are used as an input layer, which are then weighted and convoluted within several intermediate hidden layers to result in the final fused image [32, 33, 34, 35, 37, 79]. For instance,  uses neural networks (CNN) to extract features from RGB image and a DSM elevation map, which are then fed into the SVM training model to generate an enhanced semantic labeling map. ANNs have also been widely used to solve problems related to change detection of bi-temporal images such as comparing multi-resolution images  or multisource images , which can be solved in a feature-learning representation fashion. For instance, the method in  directly compares stacked features extracted from a registered pair of images using deep belief networks (DBNs).
3.3 Decision-level spatiotemporal fusion
The decision-level fusion operates on a product level, where it requires images to be fully and independently processed until the meaningful output is obtained (e.g., classification or change detection maps) (see Figure 5). Decision-level fusion can adapt to different modularities like combing heterogeneous data such as satellite and depth images, which can be processed to common outputs (e.g., full/partial classification maps) for fusion . Additionally, the techniques followed by this fusion type are often performed under the umbrella of Boolean or statistical operations using methods like likelihood estimation, voting (e.g., majority voting, Dempster-Shafer’s estimation, fuzzy Logic, weighted sum, etc.) [88, 89, 90]. In , they provide an example on the mechanism of decision-level fusion; they developed a fusion approach to detect cracks and defects on ground surface, they first convert multitemporal images into spatial density maps using kernel density estimation (KDE), then, fused the pixels density values using a likelihood estimation method. In general, most of the decision-level fusion techniques rely on probabilistic methods, where they require generating an initial label map with each pixel upholding a probability value and indicating its belonging to a certain class, which can be generated using traditional classification methods like the supervised (e.g., random forest) or unsupervised (e.g., clustering or segmentation) classification (see Section 2.1.2.). Another advantage of the decision-level fusion is that it can be implemented while incorporating both levels of fusion, the pixel- and feature-level fusion. The method in  shows a spatiotemporal fusion algorithm that includes all levels of fusion, where they propose a post-classification refinement algorithm to enhance the classification maps. First, they generate probability maps for all temporal images using random forest classifier (as an initial classification map); then they use a recursive approach to iteratively process every pixel in the probability maps by fusing the multitemporal probability maps with the elevation from the DSMs using a 3D spatiotemporal filtering. Similarly,  have also proposed fusion of probability maps for building detection purposes, where they first generate the probability maps, then fuse them using a simple 3D bilateral filter.
Recently, more focus has been driven toward using spatiotemporal fusion to recover the quality of 3D depth images generated from MVS (e.g., DSM fusion). Median filtering is the oldest and most common fusion approach for depth images; it operates by computing the median depth of each pixel from a group of pixels at the same location in the temporal images . The median filtering is robust to outliers and is efficient in filling missing depth values. However, the median filter only exploits the temporal domain; to further enhance its performance and the precision of the depth values, studies suggest spatiotemporal median filtering. In , the authors have proposed an adaptive median filtering that operates based on the class of the pixels; they use an adaptive window to isolate pixels belonging to the same class, then choose the median pixel based on the location (i.e., adaptive window) and temporal images. In , the authors also show that spatiotemporal median filtering can be improved by adopting an adaptive weighing-filtering function that involves assessing the uncertainty of each class in the spatial and temporal domains in the depth images using standard deviation. The uncertainty will then be used as the bandwidth parameter to filter each class individually. The authors in  also suggested a per-pixel fusing technique to select the depth value for each pixel by using a recursive K-median clustering approach that generates one to eight clusters until it reaches the desired precision.
Other complex yet efficient methods used in decision-level fusion are deep learning algorithms as mentioned previously in Section 2.3.2. . They are either used as postprocessing refinement approaches or to learn end-to-end from a model . For example, the method in  used a postprocessing enhancement step for semantic labeling, where they first generate probability maps using two different methods, RF and CNN using multimodal data (i.e., images and depth images), then they fused the probability maps using Conditional random fields (CRFs) as postprocessing approach. In , on the other hand, the authors used a model learning-based method, where they first semantically segment multisource data (i.e., image and depth image) using a SegNet network, then fuse their scores using a residual learning approach.
4. Examples on spatiotemporal fusion applications
4.1 Spatiotemporal fusion of 2D images
4.1.1 Background and objective
A 3D spatial-temporal filtering algorithm is proposed in  to achieve relative radiometric normalization (RRN) by fusing information from multitemporal images. RRN is an important preprocessing step in any remote sensing application that requires image comparison (e.g., change detection) or matching (e.g., image mosaic, 3D reconstruction, etc.). RRN is intended to enhance the radiometric consistency across set of images, in addition to reducing radiometric distortions that result due to sensor and acquisition conditions (as mentioned in Section 1.1.). Traditional RRN methods use a single reference image to radiometrically normalize the rest of the images. The quality of the normalized images highly depends on the reference image, which requires the reference image to be noise-free or to have minimum radiometric distortions. Thus, the objective of  is to generate high-quality radiometrically consistent images with minimum distortion by developing an algorithm that fuses the spatial, spectral, and temporal information across a set of images.
The core of the 3D spatiotemporal filter is based on the bilateral filter, which is used to preserve the spectral and spatial details. It is a weighting function that applies pixel-level fusion on images from multiple dates (see Figure 4(a)). The general form of this filter is as follows:
where the original and filtered images are indicated using I and . The weight for every pixel at point j into the fused pixel is indicated using . The filtering is carried out on the entire space of the set of imagesincluding all domains, that is, the spatial (i.e., pixels’ coordinates (x, y)), the spectral (i.e., intensity value), and temporal (i.e., intensity of temporal images). The spatial and spectral weights are described by  and are indicated in Eqs. (4) and (5) respectively
where, I is the pixel value at x and y locations, and and are the spatial and spectral bandwidths respectively that set the degree of filtering based on the spatial and spectral similarities between the central pixel and nearby pixels. The novelty of this filter is in the design of the temporal weight, where it computes the resemblance between every image and the entire set of images using an average Euclidean distance as the follows
where are the difference between the current image being processed and all other images and is the degree of filtering along the temporal direction. Eq. (6) allows all images to contribute toward each other in enhancing the overall radiometric characteristics and consistency without requiring a reference image for the RRN process.
4.1.3 Experimental results and analysis
The 3D spatial-temporal filter was conducted on three experiments with varying resolutions and complexities. Experiments 1 and 2 were applied on urban and sub-urban areas respectively; each experiment had five medium-resolution images from Landsat 8 satellite (with 15- to 30- m spatial resolution). Experiment 3 was on a fine-resolution image from Planet satellite (with 3-m spatial resolution).
Figure 6(b) and (c) shows an example of the input and results of the filter using the data from experiment 1 (i.e., the urban area). The input images show a significant discrepancy in the radiometric appearance (see Figure 6(b)); however, the heterogeneity between multitemporal images is reduced after the filtering process (see Figure 6(c)). By comparing the original and filtered images in Figure 6(c), we can notice that the land covers are more similar in the filtered images than in the original images. For instance, the water surface (shown in Figure 6(c) in blue bold dashed line) used to have a clear contrast in intensity in the original images, but after the filtering process, they become more spectrally alike in terms of intensity looks and ranges.
The experiments are also validated numerically using transfer learning classification (using SVM) to test the consistency between the normalized filtered images. The transfer learning classification uses a reference training data from one image and applies it to the rest of the images. The results in Table 2 indicate that the filtered images have higher accuracy than the nonfiltered original images, where the average improvement in accuracy is ∼6%, 19%, and 2% in all three experiments respectively. Reducing the uncertainty in the filtering process by not requiring a reference image for normalization was the key to this algorithm. The algorithm was formulated to take advantage of the temporal direction by treating all images in the dataset as a reference. Therefore, it will have higher confidence to distinguish between actual objects and radiometric distortions (like clouds) in the scene when processing each pixel.
|Transfer learning classification|
|Exp. I Suburban|
|Exp. I - Urban|
4.2 Spatiotemporal fusion of multisource multitemporal images
4.2.1 Background and objective
Multitemporal and multisource satellite images often generate inconsistent classification maps. Noise and misclassifications are inevitable when classifying satellite images, and the precision and accuracy of classification maps vary based on the radiometric quality of the images. The radiometric quality is a function of the acquisition and sensor conditions as mentioned in the background in Section 1.1. The algorithm can also play a major role in the accuracy of the results; some classification algorithms are more efficient than others, while some can be sensitive to the spatial details in the images like complex dense areas and repeated patterns, which lead objects of different classes to have similar spectral reflectance. The acquisition time, type of algorithm, and distribution of objects in the scene are huge factors that can degrade the quality and generate inconsistent classification maps across different times. To address these issues, the authors in  proposed a 3D iterative spatiotemporal filtering to enhance the classification maps of multitemporal very high-resolution satellite images. Since the 3D geometric information is more stable and is invariant to spectral changes across temporal images,  proposed combining the 3D geometric information in the DSM with multitemporal classification maps to provide spectrally invariant algorithm.
The 3D iterative spatiotemporal filter is a fusion method that combines information from various types, sources, and times. The algorithm is a combination of feature and decision levels of fusion; it is described in detail in Algorithm 1. The first step is to generate initial probability maps for all images using random forest classification. The inference model is then built to recursively process every pixel in the probability maps using a normalized weighing function that computes the total weight based on the spatial (), spectral (), and temporal () similarities. The temporal weight is based on the elevation values in the DSMs. The probability value for every pixel is computed and updated using and the previous iteration until it satisfies the convergence condition, which requires the difference between the current and previous iterations to be under a certain limit.
Algorithm 1: Pseudo code of the proposed 3D iterative spatiotemporal filter 
4.2.3 Experimental results and analysis
The proposed filter was applied to three datasets that include an open area, residential area, and school area. The input data include multisource and multitemporal very high-resolution images and DSMs; the probability maps were created for six types of classes: buildings, long-term or temporary lodges, trees, grass, ground, and roads (see Figure 7(a) for more details about the input data). Figure 7(b) shows a sample of the filtering results. We can see that the initial classification of the building (circled with an ellipse) is mostly incorrectly classified to long-term lodge; however, it keeps improving as the filtering proceeds through the iterations.
The overall accuracy was reported, and it indicates that the overall enhancement in the accuracy is about ∼2–6% (see Table 3). We can also notice that dense areas such as the residential area have the lowest accuracy range (around 85%), while the rest of the study areas had accuracy improvement in the 90% range. It indicates that the filtering algorithm is dependent on the degree of density and complexity in the scene, where objects are hard to distinguish in condensed areas due to mixed pixel and spectral similarity of different objects.
|Date||Test region 1||Test region 2||Test region 3|
|Before (%)||After (%)||∆ (%)||Before (%)||After (%)||∆ (%)||Before (%)||After (%)||∆ (%)|
4.3 Spatiotemporal fusion of 3D depth maps
4.3.1 Background and objective
Obtaining high-quality depth images (also known as depth maps) is essential for remote sensing applications that process 3D geometric information like 3D reconstruction. MVS algorithms are widely used approaches to obtain depth images (see Section 2.1.2.); however, depth maps generated using MVS often contain noise, outliers, and incomplete representation of depth like having missing data, holes, or fuzzy edges and boundaries. A common approach to recover the depth map is by fusing several depth maps through probabilistic or deterministic methods. However, most fusion techniques in image processing focus on the fusion of depth images from Kinect or video scenes, which cannot be directly applied on depth generated from satellite images due to the nature of images. The difference between depth generated from satellite sensors and Kinect or video cameras include:
Images captured indoor using Kinect or video cameras have less noise, since they are not exposed to external environmental influences like atmospheric effects.
Kinect or video cameras generate a large volume of images, which can improve dense matching, while the number of satellite images is limited due to the temporal resolution of the satellite sensor.
The depth from satellite images is highly sensitive to the constant changes in the environment and the spatial characteristics of the earth surface like the repeated patterns, complexity, sparsity, and density of objects in the scene, which can obstruct or create mismatching errors in the dense image matching process.
Most depth fusion algorithms for geospatial data focus on median filtering (see Section 4.3.), but it still needs some improvement in terms of robustness and adaptivity to the scene content. To address the aforementioned problems,  proposed an adaptive and semantic-guided spatiotemporal filtering algorithm to generate a single depth map with high precision. The adaptivity is implemented to address the issue of varied uncertainty for objects of different classes.
The adaptive and semantic-guided spatiotemporal filter is a pixel-based fusion method, where the depth of the fused pixel is inferred using multitemporal depths and a prior knowledge about the pixel class and uncertainty. A reference orthophoto is classified using a rule-based classification approach that uses normalized DSM (nDSM) with indices such as normalized difference vegetation index (NDVI). The uncertainty is then measured for all four classes (trees, grass, buildings, and ground and roads) using the standard deviation. The uncertainty is measured spatially using the classification map and also across the temporal images. The adaptive and semantic-guided spatiotemporal filter is intended to enhance the median filter, thus it uses height as the base to the fused pixel, where the general form of the filter is expressed as
where is the fused pixel; i, j are the pixel’s coordinates; is the median height value from the temporal DSMs; and the spectral, spatial, and temporal height weights are expressed as , , and respectively. The and are described in Eqs. (4) and (5) that measure the spectral and spatial components from the orthophoto. The is a measure of similarity for the height data across temporal images, and it can be computed using the following formula:
where is the adaptive height bandwidth, which varies based on the class of pixel as follows:
4.3.3 Experimental results and analysis
The method in  was experimented on three datasets with varying complexities. The satellite images are taken from the World-View III sensor, and depth is generated using MVS algorithm on every image pair using RSP (RPC Stereo Processor) software developed by  and semi-global matching (SGM) algorithm . Figure 8 describes the procedures followed by the fusion algorithm, in addition to the visual results where it shows that noise and missing elevation points were recovered in the fused image. The validation of three experiments shows that this fusion technique can achieve up to 2% increase in the overall accuracy of the depth map.
Spatiotemporal fusion is one of the powerful techniques to enhance the quality of remote sensing data, hence, the performance of its applications. Recently, it has been drawing great attention in many fields, due to its capability to analyze and relate the space-interaction on ground, which can lead to promising results in terms of stability, precision, and accuracy. The redundant temporal information is useful to develop a time-invariant fusion algorithm that leads to the same inference from the multitemporal geospatial data regardless of the noise and changes that occur occasionally due to natural (e.g., metrology, ecology, and phenology) or instrumental (e.g., sensor conditions) causes. Therefore, incorporating spatiotemporal analysis in any of the three levels of fusion can boost their performance, where it can be flexible to handle data from multiple sources, types, and times. Despite the effectiveness of spatiotemporal fusion, there are still some issues that may affect the precision and accuracy of the final output. These considerations must be taken into account while designing the spatiotemporal fusion algorithm. For example, spatiotemporal analysis for per-pixel operations is highly sensitive to mixed pixels especially for coarse-resolution images where one pixel may contain the spectral information of more than one object. The accuracy of the spatiotemporal fusion can also be sensitive to the complexity of the scene, where in densely congested areas such as cities the accuracy may be less than open areas or sub-urban areas (as mentioned in the examples in Section 4.). This is due to the increase in the heterogeneity of the images in these dense areas. This issue can be solved using adaptive spatiotemporal fusion algorithms, which is a not widely investigated area of study in current practices. Feature and decision levels of fusion can partially solve this problem by learning from patches of features or classified images, but their accuracy will also be under the influence of the feature extraction algorithm or the algorithm to derive the initial output. For instance, mismatching features can result in fusing unrelated features or data points, thus produce inaccurate coefficients for the feature-level fusion model. Another observation is the lack of studies that relates the number of temporal images and the fusion output accuracy, which is useful to decide the optimal number of input images for fusion. Additionally, it is rarely seen that the integrated images are picked before fusion, where assessing and choosing good images can lead to better results. Spatiotemporal fusion algorithms are either local or global approaches, the local algorithms are simple and forward like pixel-level fusion or local filtering like the methods in [19, 22], while global methods tend to perform extensive operations for optimization purposes like in . In future works, we aim to explore how these explicitly modeled spatiotemporal fusion algorithms can be enhanced by the power of more complex and inherent models such as deep learning-based models to drive more important remote sensing applications.
The authors would like to express their gratitude to Planet Co. for providing them with the data; to sustainable institute at the Ohio state university, Office of Naval Research (Award No. N000141712928) for partial support of the research; and to the Johns Hopkins University Applied Physics Laboratory and IARPA and the IEEE GRSS Image Analysis and Data Fusion Technical Committee for making the benchmark satellite images available.