Selected Issues and Constraints of Image Matching in Terrain-Aided Navigation: A Comparative Study

The sensitivity of global navigation satellite systems to disruptions precludes their use in conditions of armed conflict with an opponent possessing comparable technical capabilities. In military unmanned aerial vehicles (UAVs) the aim is to obtain navigational data to determine the location and correction of flight routes by means of other types of navigational systems. To correct the position of an UAV relative to a given trajectory, the systems that associate reference terrain maps with image information can be used. Over the last dozen or so years, new, effective algorithms for matching digital images have been developed. The results of their performance effectiveness are based on images that are fragments taken from source files, and therefore their qualitatively identical counterparts exist in the reference images. However, the differences between the reference image stored in the memory of navigation system and the image recorded by the sensor can be significant. In this paper modern methods of image registration and matching to UAV position refinement are compared, and adaptation of available methods to the operating conditions of the UAV navigation system is discussed.


Introduction
Global navigation satellite systems are widely used in both civil and military technology areas. The advantage of such systems is very high accuracy in determining the coordinates, however, the possibility of easy interference precludes their use in conditions of armed conflict with an opponent equipped with comparable technical capabilities. In the case of military autonomous unmanned aerial vehicles (UAVs), in particular cruise missiles (CM), the aim is therefore to determine navigation data for specifying the position and correcting the flight paths by means of other types of navigation and self-guidance systems.
Such systems are usually based on inertial navigation systems (INS) which use accelerometers, angular rate gyroscopes and magnetometers to provide relatively accurate tracking of an object's position and orientation in space. However, they are exposed to drift and systematic errors of sensors, hence the divergence between the actual and the measured position of the object is constantly increasing with time. This results in a significant navigational error. Therefore, two types of systems designed to correct the position of an object in relation to a given trajectory are normally used in the solutions of the UAV/CM navigation and self-guidance systems. The first group contains systems whose task is to determine the position on the basis of data obtained from radio altimeters, related to reference height maps. Such systems include, for example: TERCOM (terrain contour matching), used in Tomahawk cruise missiles, SITAN (Sandia inertial terrain-aided navigation), using terrain gradients as input for the modified extended Kalman filter (EKF) estimating the position of the object, and VATAN (Viterbi-algorithm terrain-aided navigation), a version of the system based on the Viterbi algorithm and characterisedin relation to the SITAN systemwith lower mean square error of position estimation [1][2][3][4][5]. The main disadvantage of these solutions is the active operation of measuring devices, which reveals the position of the object in space and eliminates the advantages associated with the use of a passive (and therefore undetectable) inertial navigation system. The second group consists of systems associating reference terrain maps with image information obtained by means of visible, residual or infrared light cameras [6,7]. Such systems include the American DSMAC (digital scene matching area correlator), also used in Tomahawk missiles [8,9], and its Russian counterpart used in Kalibr (aka Club) missiles. Their advantage is both the accuracy of positioning and the secrecy (understood as passivity) of operation.
Due to the dynamic development of UAVs/CMs equipped with navigation systems operating independently of satellite systems and a number of problems associated with the implementation of the discussed issue, the assessment on the sensitivity of the selected methods to environmental conditions and constraints in the measurement systems, which often negatively affect the results obtained, has been carried out. The essence of the work is to consider issues related to the processing of image information obtained from optical sensors carried by UAV/CM and its association with terrain reference images. In particular, issues of the correctness of image data matching and the limitations of the possibilities of their similarities' assessment are considered. The article compares modern image matching methods assuming real conditions for obtaining information. The main goal set by the authors is to verify selected algorithms, identify the key aspects determining the effectiveness of their operation and indicate potential directions of their development.

State of the art
The operation of classic object identification algorithms, indicating the similarities between the recorded and reference images (the so-called patterns), is mainly based on the use of correlation methods. These algorithms, although effectively implemented in solutions to typical technical problems, are insufficiently effective in the case of topographic navigation. It is related to, inter alia, the limitations and conditions in the measurement system, environmental conditions and characteristics of the detected objects, which have a strong negative impact on the obtained correlation results. This disqualifies the possibility of their direct use in the tasks of matching reference terrain maps with the acquired image information.
A particularly significant obstacle is the fact that the sensory elements of navigation systems installed on UAV/CM record image data in various environmental and lighting conditions [10]. Frequently, reference data of high informative value, due to various conditions, constitute a pattern of little use or even lead to incorrect results. This is the case, for example, when the reconnaissance is conducted in different weather conditions than those in which the UAV/CM mission takes place (Figure 1). Therefore, image feature matching becomes a complex issue. The conditions related to the image recording parameters, e.g. variable view angle, maintaining scale or using various types of sensors, turn out to be equally important.
Image matching methods began to be strongly developed with the dissemination of digital image in technology. Initially, the classical Fourier and correlation methods were used. However, these methods did not allow for successful multimodal, multi-temporal, and multi-perspective matching of different images. The taxonomy of the classical methods used in the image matching process was presented in the early 1990s [11]. The image feature space, considered as a source of information necessary for image matching, was defined and local variations in the image were identified as the greatest difficulty in the matching process. In the 21st century, further development of methods based on the features of the image continued [12]. It should be emphasised that most image matching methods based on image features include four consecutive basic steps: feature detection, feature matching, model estimation of mutual image transformation and final transformation of the analysed image. These methods became an alternative to the correlation and Fourier methods. For over a dozen years, new, effective algorithms for processing and matching digital images have been developed, using statistical methods based on matching local features in images [11][12][13][14], cf. Figure 2. Their authors point to the greater invariance of the proposed algorithms to perspective distortions, rotation, translation, scaling and lighting changes. Given their high reliability under static conditions, as well as their low sensitivity to changes in the optical system's position, including translation, orientation and scale, it is justified to conduct studies in order to verify their usefulness and effectiveness. The paper focuses on modern image matching algorithms, which can potentially be used in topographic navigation issues. It should be stressed that the problem is completely different in the indicated context. This is due to the fact that although the matched images represent the same area of the terrain, the manner and time of the recording differ significantly from each other. This is not a typical application of these algorithms, hence a limited effectiveness of their operation can be expected.
The common feature of all methods is the use of the so-called scale space described in [15], allowing the decimation of image data and examination of similarities between images of different scales. A significant step in the development of image matching methods based on local features was the development of the Scale-Invariant Feature Transform (SIFT) algorithm [16]. In this algorithm, the characteristics are selected locally and their position does not change while the image is scaled. Their indication is done by determining the local extremes of function Dx ðÞ , that is the difference between the results of image Ix , y ðÞ convolution with Gaussian functions Gx , y, σ ðÞ with different values of scale parameter σ: A more numerically efficient version of the SIFT algorithm, called Speeded-Up Robust Features (SURF) is based on the so-called integral images [17]. Both methods use the basic processing steps described in [12]. Additionally, in order to ensure the effectiveness of feature detection in images of different resolutions, a scale space, consisting of octaves which represent the series of responses of a convolutional filter with a variable size, was introduced.
Simply put, the detection of the characteristic point is based on the use of the determinant of a Hessian matrix det H ðÞ . In the case of SURF, the second-order derivatives of the Gaussian function G approximated by the box filters B xx , B xy , B yy , and the integral image are also used [18]. The Hessian matrix in these methods takes the form L xy x, y, σ ðÞ L yy x, y, σ ðÞ where The determinant of the Hessian matrix after approximation using box filters and the Frobenius norm is given as After detecting the local extremes of det H ðÞ ,similarlytoDx ðÞ , the location of characteristic points representing local features, called blob-like for the SIFT and SURF methods, is determined. In this step, the SIFT method also rejects features whose contrast is lower than the assumed threshold t by comparing | Dx ðÞ | < t as well as points lying on isolated edges. This is done by comparing the value of the Hessian matrix H trace quotient, and its determinant with the curvature coefficient r: In 2011, an alternative method to SIFT and SURF, Oriented FAST and Rotated BRIEF (ORB), was proposed [19]. The method was based on the modified Features from Accelerated Segment Test (FAST) detector [20,21], enabling corner and edge detection, and a modified Binary Robust Independent Elementary Features (BRIEF) descriptor [22]. This approach involves changing the scale of the image on the basis of blurring with an increasing value of Gaussian filter. Despite the noise reduction and enhancing the uniformity of areas interpreted by human beings as unique (e.g. surface of the lake, wall of a building, shape of a vehicle, etc.), it causes blurring of their edges. This often leads to the inability to indicate the boundaries between areas and to define characteristic points in their neighbourhood.
The solution to this problem was proposed in the KAZE method (Japanese for "wind") [23]. Unlike the SIFT and SURF methods, which use the Gaussian function causing isotropic diffusion of luminance to generalise the image, in the KAZE method the generalisation is based on nonlinear diffusion in consecutive octaves of the scale [24]. The anisotropic image blurring in this method depends on the local luminance distribution. Nonlinear diffusion can be presented in the following equation: The blur intensity can be adapted by the introduced conductivity function c, which is usually related to time. However, using the approach proposed in [15], parameter t is related to the image scale. Various forms of the conductivity function c were proposed in related works developing the use of nonlinear diffusion in the context of image filtration [24][25][26]. One of the functions used for nonlinear diffusion can be: where ∇L σ is the gradient of the Gaussian blurred function of the original image I on the scale σ, and k is the contrast ratio.
This function allows for blurring the image while maintaining the edges of structures. As a result, more features can be detected at different image scales. However, it involves the use of a gradient, which in the case of intense image disturbance, e.g. in the form of a shadow, may cause an unfavourable (due to the subsequent detection of features) distribution of diffusion in the image.
An important stage of the considered methods is the description of a characteristic point by means of a vector containing information about its surroundings. The SIFT method uses a luminance gradient and in the SURF method the image response to horizontally and vertically oriented Haar wavelet is applied. In general, around the characteristic point in the area with a defined radius dependent on the σ scale, a certain number of cells are created and dominant values of the gradient or the responses to Haar wavelets are determined. These are the basis for calculating the so-called feature metrics. Finally, the dominant orientation is established. Characteristic features in the SIFT method are determined by where mx , y ðÞ is the gradient size, θ x, y ðÞ is the orientation, and L σ is the blurred image discussed above.
The SIFT method creates thereunder a gradient histogram that sums up the determined values in four cells. In analogous cells, according to the SURF method, the responses to Haar wavelets distributed along the radii in the neighbourhood of the point with an interval of π=3 are summed up. In each SURF subregion, a vector v is determined: where d x and d y are the characteristic point's neighbourhood responses to the horizontally and vertically oriented Haar wavelets, respectively.
In the KAZE method the procedure is similar as for the SURF method with the difference that the first order derivatives from the image function are used. The point description operation is performed for all levels in the adopted scale space, thereby creating a pyramid of vectors assigned to subsequent levels containing an increasingly generalised image.
The Maximally Stable Extremal Regions (MSER) method introduced in [27] has a different approach to the detection and description of local features. In this method, regions (shapes), referred to as maximally stable, are selected as the characteristics of the image. The image in this method is treated as a function I which transforms where D is the domain of I and S is its set of values, usually S ¼ 0, 1, … , 255 fg . Regions (areas, shapes) with a specific (typically average) luminance level can be determined in the image. Region Q is understood as a subset of pixels in the image which is a continuous subset of D such that ∀p, q ∈ Q exist sequences p, a 1 , a 2 , … , q and pAa 1 , … , a i Aa iþ1 , … , a n Aq, where A ⊂ DÂDis the neighbourhood relation and the formula a i Aa iþ1 is the neighbourhood between pixels a i and a iþ1 . The extremal region is Q ⊂ D, such that ∀p ∈ Q, ∀q ∈ ∂Q : Ip ðÞ > Iq ðÞ(maximum intensity region) or Ip ðÞ < Iq ðÞ(minimum intensity region). The desired maximally stable extremal region (MSER) is the region R ¼ Q i * , which for the sequence of extremal regions Q 1 , … , Q iÀ1 , Q i , nested i.e. Q i ⊂ Q iþ1 and for qi ðÞ¼|Q iþΔ \Q iÀΔ |=|Q i | has a local minimum at i * . Whereas Δ ∈ S is the stability parameter and the luminance threshold. The procedure of determining MSER regions is repeated throughout the assumed σ scale space.
In the feature description stage, a vector using image moments is determined for each region. Based on the moments m 00 , m 01 , … , m 20 , the centre of gravity of each MSER region and the ellipse approximating the region are determined according to the procedure described in [28]. The ellipse equation is given as The orientation θ and the size of the ellipse defined by its a 1 and a 2 axes allow to describe the features of the region taken for comparison in the matching step. The moment m of the order p þ q ðÞ of the MSER region used to determine the centre of gravity of the C ¼ Ix g , y g region can be represented as follows: The use of moments and the centre of gravity is also a feature of ORB method which uses machine learning approach for corner detection. After their detection, based on the image moments, the centre of gravity C is determined for each corner according to the formula where On the basis of the corner's position and centre of gravity, the orientation of the feature is determined as shown in the equation: The feature description step uses the assigned orientation to complete the binary BRIEF descriptor [22], with the condition of verifying the belonging of the point L σ x, y ðÞ to the matrix W θ . It is based on the simple comparison of the pixel luminance values in the neighbourhood of the feature: The matrix W θ is the product of the original matrix W, containing the locations of the points, which are subject to the tests, and the rotation matrix based on the determined angles θ x, y ðÞ . In such case the ORB vector describing the feature takes the form: The common element for the described methods is the stage of comparing the distinguished features detected on the reference and registered images. It is of fundamental importance in the field of absolute terrain position designation, because the location of the matched features is the source of determining the matrix of mutual image transformation. In this comparison, vectors describing the features in a given method, e.g. feature metric and its orientation, are taken into account.
The determination of the similarity between the feature description vectors v a and v b is based on various measures. The most commonly used are the distances defined as follows: The third frequently used norm for binary vectors is the Hamming distance given as: Another approach for matching two features is the nearest neighbour algorithm based on the ratio of the distances d 1 and d 2 . However, it should be remembered that the matching result for the described distances may vary, hence the importance of the features' detection and description step.
The final step in all the discussed methods is the statistical verification of a set of matched local features. It happens that, as a result of the initial comparison of the vectors which describe the features, mismatches resulting from the acquisition conditions described above are indicated. Therefore, after the pre-processing step, additional criteria are applied to distinguish matches from mismatches, e.g. based on the Random Sample Consensus (RANSAC) method [29]. This method allows for the estimation of a mathematical model describing the location of local features in the image provided that most of the matched points fit into this model (with the assumed maximum error). Then those points that do not fit into the estimated model are discarded in the step of determining the image transformation matrix.

Problem formulation
The following set is considered: Elements of I are two-dimensional discrete signals (digital images) and describe the same part of the Earth's surface, but recorded at different times and therefore under different environmental and lighting conditions. Image I j chosen from the set I is treated as a reference signal, i.e. characterised by excellent structural similarity to itself. In order to compare any I k image selected from the set I with reference image I j , the following similarity measures were used: mean square error J MSE and the related J PSNR (peak-signal-to-noise ratio) and J SSIM (structural similarity index measure).
The mean square error is determined by the formula: I j x, y ðÞ À I k ðx, yÞ ÂÃ 2 (22) in which M and N are the width and height of images in pixels. Index J PSNR is defined as: where ℓ is the range of changes of the luminance value, while the index J SSIM can be described as: (24) in which μ I j and μ I k are the mean values of the I j and I k image luminance, σ I j and σ I k are the variances of I j and I k , σ I j I k is covariance of a pair I j , I k ÀÁ , and ξ 1 ¼ 0:01ℓ ðÞ 2 and ξ 2 ¼ 0:03 ℓ ðÞ 2 are positive constants avoiding instability when the denominator of the Eq. (24) is very close to zero [30][31][32].
For such defined initial conditions, the best match of the subsequent elements of the set I in relation to the reference element I j is sought, assuming that the similarity index measures of the examined pairs I j , I k ÀÁ are strongly undesirable, i.e.
The term best match is understood as defining certain vectors v j and v k of values, which characterise the considered signals I j and I k , and then linking them in a way that makes it possible to explicitly state that the selected pair I j , I k ÀÁ describes the same fragment of the Earth's surface.

Performance analysis
In order to verify the sensitivity of the selected methods to limitations in the measurement system and environmental changes, a number of studies taking into account the actual conditions of obtaining information were conducted. Due to their difficult nature, they were performed with the use of computer simulation methods. The research was carried out in three stages. In the first stage, a detailed analysis of the test sets, using the values of the similarity indexes of the elements defined in the article, was completed. On the basis of the performed tests, special cases were selected and subjected to detailed analysis. In the further part of the study, the methods and verification of the correctness of image data matching in the scope of mutual matching of the sets presented to the analysis were compared. Finally, the influence of changes in the contrast of the acquired image on the number of features detected and the subsequent matching results was examined.

Analysis of test set elements
For the initial numerical tests, the test set I consisting of four elements was adopted, whereby I 0 is treated as reference. Each element of the set I is a 24-bit digital image with a size of 1080 Â 1080 pixels, representing the same fragment of the Earth's surface, with a centrally located characteristic terrain object (Figure 3).
The object is located in a natural environment characteristic for tundra and therefore distinguished by a rocky ground with a very low plant cover, dominated by mosses and lichens. Image I 0 (reference) was taken in the autumn and mostly brown colours, associated with the tundra soils and rock formations in this area, prevail in it. The image I 1 shows the environment in spring-summer conditions, i.e. during the growing season. Images I 2 and I 3 were taken in winter, with snow cover, whereby in the case of I 3 , there is also a strong cloud cover. Test set I elements' similarity index measures, determined on the basis of the Eqs. (22)-(24) with the reference image I 0 , are presented in Table 1 (columns 2-5).
Based on the obtained results, it can be shown that the elements constituting the test set I were selected so that only one of them (I 1 ) has a relatively high degree of   It should be noted that I 2 and I 3 were also carefully selected. Both images show a similar arrangement of snow cover, which is reflected in the determined values of the pair similarity indexes I 2 , I 3 ðÞ , cf. Table 1, columns 6 and 7. This is justified in practice: it may happen that the data from the reconnaissance are accurate (e.g. they take into account the snow in the area concerned, the lack of leaves on the trees in late autumn or the high water level in spring), but strong fogging or rainfall make the image I 3 obtained from UAV/CM recording systems several hours later significantly different from the reference image (in this case I 2 ).

Comparison of the selected methods of image feature matching
The test set I presented in SubSection 4.1 was examined in order to compare the effectiveness of image matching performed by algorithms using local features. The SURF, MSER, ORB and KAZE methods were taken into account. Image I 0 is a pattern, and I 1 , I 2 , I 3 are matched images. In the algorithms, the values of parameters proposed by their authors were used with the exception of the features' similarity threshold, which was lowered to the level of 50% due to large differences between individual elements of the set I. The best matching of individual features in the compared images was assumed, using the similarity measures proposed for these methods. The RANSAC method was used for the final correction of the matched features, for which an affine transformation model between images was adopted. In order to verify the effectiveness of the considered methods and the correctness of the parameters adopted in the last study, the pattern was replaced. It was assumed that I 2 is a reference image and I 3 is a matched image. The matching results of the individual test pairs of I are shown in Figures 4-8 and in Tables 2-5.
Analysis of the matching results has shown that the selected algorithms are not effective when the matched images, despite the same content, differ significantly, cf. pair I 0 , I 3 ðÞ . All methods were most effective in matching the pair I 0 , I 1 ðÞ . While the SURF and MSER methods indicated mismatches for matching pairs I 0 , I 2 ðÞ and I 0 , I 3 ðÞ , the ORB method did not (cf. Table 3 and Table 4). The KAZE method identified correctly the fragment of the image on which the corresponding features of the pair I 0 , I 2 ðÞ were located. When comparing a relatively similar pair I 2 , I 3 ðÞ ,it appeared that all algorithms indicated mismatches or lack thereof, with KAZE and MSER indicating two correct matches each ( Table 5).
In general, the KAZE method proved to be the most effective, while the ORB method showed the least processing efficiency of the set I. Due to the lack of any pair I 0 , I 2 ðÞ , I 0 , I 3 ðÞ and I 2 , I 3 ðÞ matches, no graphical results are presented for the ORB method. A potential cause of a lack of pair I 2 , I 3 ðÞ matches is a large contrast change, characteristic of the occurrence of acquisition interference, such as the fog visible in the image I 3 .

Effect of contrast change on the number of the detected features
The research focused on the analysis of the effect of contrast change on the number of features detected in the image. For this purpose, the contrast of the image I 0 had been gradually reduced until a uniform colour throughout the image was obtained. Afterwards, the transformed set was further analysed. SURF, MSER, ORB and KAZE methods were used again. Figure 9 shows the cumulative results of this study.    On the basis of the results obtained, it can be concluded that the number of features detected by the examined methods decreases with image contrast reducing, which results in a smaller statistical sample processed in each subsequent step of these methods. This may be the cause of the lower matching efficiency of the methods considered for images that are significantly different from each other.

Conclusions and final remarks
The results of the algorithms presented in the literature are usually related to images that are fragments of source images, i.e. have their qualitatively identical counterparts in Ref. images. In the analysed cases, the differences between the reference image stored in the memory of the navigation system and that recorded by the sensor are significant. As a result, there are certain consequences that often prevent the image representing the same field object from being effectively matched. This is due to real environmental conditions and restrictions on obtaining information. The measurement system parameters and the quality of the images taken have a direct impact on the number of detected features. For example, the lack of complete information about the accuracy of field object's image mapping makes it impossible to properly select the size of the filters. This results in the detection of objects that are completely irrelevant to the issue considered, such as bushes, leaves or grass blades, which are highly variable over time. Consequently, it has a significant impact on the performance of individual algorithms.
The study concluded that the use of statistical algorithms such as RANSAC improves the effectiveness of the selected methods. However, the results obtained strongly depend on the size of the set taken into consideration and the match/ mismatch ratio. Therefore, in the terrain image processing, it is necessary to conduct an analysis of the informational characteristics of the examined objects and the conditions of acquisition. This allows for extracting characteristic points whose description does not significantly change due to atmospheric conditions.
The results of the simulation tests enable a general conclusion that the methods considered are often insufficient to determine the coordinates of a UAV/CM flying under unfavourable environmental conditions. The greatest development potential, in the context of the implementations examined in this work, is characterised by methods based on anisotropic diffusion, which in the course of simulation studies showed the highest effectiveness. Therefore, it seems justified to focus the research effort on further development of new image processing methods within the group of anisotropic diffusion methods. In particular, it is proposed to take the informative character of terrain images as determinants of the input parameters of the designed processing methods into account, to apply pre-processing methods aimed at decimation of the input data, their segmentation and determination of the main components, and to extend the definition of the designed methods with additional criteria increasing the effectiveness of detection and image feature matching. The newly developed methods should be aimed at the improvement of feature detection efficiency in terrain images and the selection of processing parameters taking into account environmental conditions as well as limitations and conditions in the measurement system.